Robot navigation is one of the fundamental challenges in robotics needed for autonomous intelligent agents. This problem is traditionally defined as finding a path from a start location to a target location and executing this path in a robust and safe manner, e.g. go from the office kitchen to the whiteboard. It requires the ability for the robot to perceive its environment, localize itself w.r.t. a target, reason about obstacles in its immediate vicinity, and develop a long range plan for getting to the target.
There is a huge body of work on navigation 
. Traditionally, navigation systems rely on feature extraction and geometric based reasoning to localize a robot and map its surroundings. When such maps are generated or given, the robot can use them to find a navigation path using planning algorithms .
More recently, learning navigation policies has emerged as a new line of work. Learned policies have the potential to leverage all aspects of the observations relevant to navigation, and not only the ones encoded manually in a geometric map of the environment. Furthermore, they can learn navigation behaviors which are not expressable with traditional planners.
Most of the learned policies capitalize on the recent advances in Deep Reinforcement Learning (RL). Despite their potential, the current approaches have been difficult to successfully deploy on real robots. Due to the high sample complexity of RL algorithms, these neural policies can often only be successfully trained in simulation environments. For navigation problems, such environments ought to capture both the physics and the visual complexity of the real world. Having both in one simulation environment is very challenging and practically non-existent (see Sec.II).
Contributions In this work, we present a neural net based policy for navigation, NavNet, which is (i) fully learned and (ii) directly deployable on a real robot (see Fig. 1). This is in contrast to most of learned policies which are not directly usable on a real robot (see Sec. II). It is a two level policy – a long range navigation policy, which outputs high level commands of the sorts ‘go forward’, ‘turn left’, etc.; and a short range policy, which executes the commands. The high level policy is trained to interpret real images from the deployment environment for the purpose of planning. While technically any local planner can be used to execute commands given by the high level policy, we show that training a short range policy specifically to complement the needs of the high level planner on a mobile base – being precise, keeping a straight path, and avoiding collisions – results in better performance in the real world.
This hierarchical separation into two policies, with the specified interface between them, allows for training these on different environments. As a second contribution, we propose to train the high-level on a ’streetview’-like environment of our indoor spaces. Training environments which exhibit high visual truthfulness while precisely modeling robot physics are hard to create. However, images of the deployment space organized in a semi-metric graph are easy to obtain. We show how to easily construct such a graph and train a high-level policy on it. Furthermore, simple maze-like synthetic environments, which model a robot mobile base, can be built using simulators. We train a low-level policy on such a synthetic environment.
We apply and extend  to learn our high-level policy. The supervision can come from a traditional path planner, as done in , or from human labeled paths. As a further contribution, we show how to utilize human demonstrations of navigation paths. These demonstrations allow for the robot to learn behaviors not easily expressed by traditional planners.
We extensively test the presented approach in a large () office building. We demonstrate that we can achieve success rate navigating to targets from random starting points across a building on a real robot, which is higher than the performance of a SLAM-based system in the same environment. Furthermore, this is quantitatively new performance from published neural models, which have not been deployed in the real world at this scale.
Ii Related Work
SLAM-based Navigation As a very old problem, there is an extensive literature on SLAM, planning and robot navigation [5, 1, 2, 6], which due to space limitations will not be discussed here. It suffices to say though, that our work falls roughly into mapless navigation. The neural policy requires neither a geometric map of the environment nor localization at test time.
Neural Net Based Navigation In recent years, RL-learned neural net policies have been explored for navigation. These are usually learned and tested in simulation. Examples include: using A3C on 3D mazes ; A3C on AITHOR ; ADDPG tested using a depth sensor on a real robot ; RL algorithms trained and tested on scans only of real spaces (no real robot) [10, 11] or SUNCG . Due to the large sample complexity of RL, the above methods cannot be learned directly on a real robot.
A high level environment has been used by Mirowski et al.  without deployment on a real robot. Bruce et al. [14, 15] deploy a RL trained neural net policy and use an environment constructed from a traversal. However, their system is not fully deployed on a real robot – the policy actions are executed by an operator, while our system employs a second low-level policy to execute these actions. Thus, contrary to them we provide a fully deployable navigation solution.
An explicit path planning strategy, similar to our approach, has been employed by Savinov et al. , in a learned topological graph. However, this path planning is used for inference only, and the system is evaluated in synthetic environments.
Beyond RL algorithms, investigations have been conducted into appropriate architectures, with emphasis on using models with external memory [17, 18, 19, 20]. These approaches, in their current form, are only applied in simulation. Learned low level controllers have been developed , and also combined with traditional PRM planning .
Realistic Training Environments for Navigation The majority of the simulation environments used for navigation experiments are not visually realistic, and as such the trained policies are not usable on a robot in the real world. Some of these environments use photo realistic rendering [23, 8], but still not extensively tested in real settings.
Recently, captures of real world environments are available in the form of train/test environments for robotics. The closest match to our desired environment is the Active Vision Dataset , which is a dense capture of homes. Unfortunately, these environments, while large in number, are individually too small to sufficiently challenge long range navigation. Larger environments ([25, 26, 27]), are captured too sparsely and contain rendered 3D reconstructions for locations not in the original scan. Although the reconstructions are of high visual fidelity, the rendered views still contain artifacts.
Iii Navigation Model
The navigation model is a two level policy (see Fig. 2). The high level policy is responsible for long range planning – for every observation and target label, it is trained to output a high level action bringing the robot closer to the target. The low level policy is responsible for executing the high level action on the specified platform. It is trained to maintain a straight path and safe distance to nearby objects.
The two policies have complementary properties. The high level policy knows and plans in the deployment environment, but cannot precisely guide the robot. The low level policy has no knowledge of the deployment environment, as it is trained on a different simulation environment, but moves the robot precisely and safely (see Table I).
|High Level Planning||Low Level Control|
|inputs||RGB images||1D depth|
|Training Environment Simulation Quality|
Iii-a Long Range High Level Planning Policy
The high level policy takes as inputs an RGB image , a target specification , and a binary proximity indicator . The latter indicator is the output of a radar reading and indicates if a collision is imminent. Currently, it is defined as iff there is an object within m.
We use two ways of specifying targets: a location label or a target image. To define a location label, we overlay a regular grid onto the world and define each cell as a potential target. In our case, we have a grid of size where
cells represent valid locations. Each location is defined as a one hot vector, where is the number of all possible locations.
The image-based target specification is defined by an embedding of an image from the target location. The embedding is obtained with same network used to embed observations. This second target definition is more flexible, but also more ambiguous as locations might look quite similar in visually repetitive environments.
The output action space is defined as three possible actions . The forward motion is intended to be , the turns are at . Note, however, that these values are approximate, as their semantics are established during training of the policy. The training environment, however, does not represent the above actions with very high precision (see Sec. IV-A)
With the above notation, the high level policy is trained to output a value estimating the progress towards the target , defined as the negative change in distance to if action is taken at observation . This value function can be used to estimate which action moves the robot closest to the target:
The above value function is implemented as a recurrent neural net taking as input the concatenated and transformed embeddings of the observation , target , and the proximity bit :
The recurrent net is a single layer LSTM , which maintains a state over the execution of a navigation run. This state serves to capture recent actions, which has shown cricual in avoiding oscillations (revisiting the same state over and over again).
Iii-B Low-Level Policy
While the high level commands can be executed on the robot using position control, their verbatim execution leads to inaccuracies, drifts, non-smooth behavior and eventually collisions. With the above definition of , we can safely execute rotations, but for ’forward’ we use a short range policy. This policy is trained to rectify inaccuracies of the high level plan, keep a straight path and avoid collisions.
The input to the policy is a 1-dimensional LiDAR reading. It captures the geometry of nearby objects, which is sufficient for short term safe control. The low level action space is continuous, defined by the kinematics of the robot. In our case, we work with a differential drive mobile base. Thus, the action space is a 2 dimensional real valued vector of the twist values for the two wheels (angular velocities for each wheel).
The policy is formulated as a ConvNet over both space and time. Since it is a ConvNet, it maintains no state. However, having as an input several recent observations allows for the policy to reason over robot motion, which we believe is important for low-level control.
In more details, the low level policy receives the last 3 LiDAR readings , and . Since these are 1-dimensional, they can be concatenated into an image, where the second dimension is time:
The network has following 4 layers: , where denotes convolution with kernel
and strideand is fully connected layer with output dimension .
Iv Training Environments and Algorithms
Iv-a High Level Planning Policy
WorldViewEnv The training uses an environment consisting of real images obtained via traversals. These images represent states of the robot in the world and are organized in a graph, whose edges represent actions moving the robot from one state to another.
To create WorldViewEnv, we use a rig of 6 cameras organized in a hexagonal shape. This rig captures a set of images every as it moves through the building.
We perform two operations with the raw captured data. First, the images are stitched into a 360 degree panorama, which can be cropped in any direction to obtain images of the desired field of view (fov). We use fov of 108 deg (width) and 90 deg (height). We crop each panorama into 24 views each 15 degrees apart. We define two rotational actions: ‘turn left’ and ‘turn right’, which shifts the robot’s view while remaining at the same location.
Second, we use the Cartographer API  to estimate the poses of the images. We define the ‘forward’ action as moving the robot to a nearby location in the direction of the current view. We attempt to move forward by , but there is no guarantee that an image was captured exactly away. We consider this action possible if an image is captured within from the new location, which we set as the result of this action. The final action space is (see Fig. 3).
Discussion WorldViewEnv exhibits high visual fidelity and the traversals cover most of the designated building spaces. However, the high level actions capture rough motions and can be used to express a navigation path, but cannot be executed robustly on a robot. Hence, we use this environment to train a high level policy only.
As the actions connecting views are high level, we need only locally correct SLAM and loop closure. Hence, we do not need the high precision necessary for global geometric maps. Moreover, no mapping of the surroundings is needed.
The training is a form of imitation learning, where we can imitate paths from traditional planners and human navigations. Furthermore, imitation learning provides supervision at every step of the policy execution, which leads to more stable optimization contrary to many of the RL setups suffering from sparse rewards.
These paths can be used as supervision at every step of the policy execution, when present. Since supervised learning has lower sample complexity, it plays as an advantage over RL when it comes to efficiency.
To define the training loss, consider a set of navigation paths leading to one of several pre-defined targets. These paths are defined over the graph underlying WorldViewEnv. In our experiments is either a set of all shortest paths to targets produced by a shortest path planner or a set of human navigation paths.
For a target , start and a path , we can denote by the distance from to along if both start and target are on the path in this order. If one of or both of them are not on the path, then the above distance is . Using the full , we can consider the shortest path in which leads from to :
Using , we can define the progress toward if we apply at state :
where is the image at which one arrives after taking action . This transition is defined by the WorldViewEnv.
The loss trains the high level policy to output values as close a possible to . As we use a recurrent net to define the policy (see Eq. (2)), we define the loss over whole navigation paths. Denote by a navigation path, then the loss reads
where the model is defined in Eq. (2). We use Adam Optimizer , where at each training step we unroll the current policy with random starts and formulate the above loss for states which are covered by . At the beginning of the training, the policy performs random actions, resulting in random paths. As the training progresses, the paths become more meaningful and the above loss emphasizes on situations which will be encountered at test time. This approach is very similar to DAGGER .
We utilize a scalable distributed training setup  in which the unrolling is done in separate processes for performance reasons. More precisely, we run collector jobs, which refresh their policy every steps. The unrolled episodes are written into a replay buffer. training jobs utilize the data in the buffer to perform the actual training. We use standard data augmentation techniques for images, learning rate of , batch size of , and unrolls of length of .
Iv-B Low Level Policy
Simple Synthetic Environment The low level policy is trained in a synthetic environment, consisting of several hallways and rooms (see Fig. 4, left). It is generated using a 2D layout, which is converted to 3D by extending the walls up. The observations are 1D depth images (see Fig. 4, right). Due to their simplicity, these can be simulated with high fidelity and the trained models transfer to the real robot. Visually, this environment is a maze of hallways of varying sizes.
In addition, we simulate the kinematics of the robot, a differential drive. It consists of two wheels which are controlled by their velocities. We do not simulate the robot’s dynamics.
Thus, we simulate the robot motion with simple depth perception. The experiments show that this is sufficient to learn a low level policy.
Training Our policy is to execute the ‘forward’ action while being efficient and safe. For training we use continuous Deep Q-Learning (DDPG) . The above requirements are encoded by the reward needed by DDPG for a given action at a state . The reward should be the highest if the robot is moving straight as quickly as possible without colliding:
where and denote the linear and angular velocity of the differential drive after applying the current action (in the current state which is omitted for brevity). If this action results in no collision, the reward is a function of how fast () and how straight () the robot moves. If there is a collision, then the robot incurs a large negative reward .
The above setup results in policies which are smooth and maintain a large distance from their surroundings. Empirically, they tend to follow the medial axis of hallways and rooms.
Implementation We borrow the implementation details and the training setup of , where we replace their reward with the one above. To outline, we use Adam optimizer  with learning rate of and batch size of . The employed DDPG algorithm uses a ‘critic network’, which approximates the Q value for given state and action . It has a very similar architecture to the low level policy, defined in Eq. 3 – it uses the same type of ConvNet to embed the observation, which is subsequently concatenated with the action. This vector is embedded to a single value using a two layer perceptron (layers of dimensions 10 and 8) .
To assess performance, we measure success rate as the portion of the runs which end within of the specified target.
For training, we scan a large x office building. In this scan, a panorama is captured every during a traversal. Note that these scans were acquired months apart in time, and have differences in the height of the camera. For testing we have two setups: (i) a second scan of the same building (5835 distinct locations; see Fig. 5). (ii) a real robot which runs the policy in the building.
V-a High Level Planning Policy
We evaluate how well the high level policy performs in isolation of the low level policy. For this we use WorldViewEnv, where we train on data from traversals and test using data from a holdout traversal. This simulates the scenario where we navigate the same environment but see different images from different poses. At test time, the policy traverses the test WorldViewEnv scan and evaluate navigating to locations (see Fig. 5). For each location, we randomly pick starting positions with each target location at least away, resulting in runs.
The high level policy achieves near perfect success rate of on the test traversal. This performance is consistently high across all targets (at least success rate). The average run length is steps which corresponds to approx. , resulting in total.
Generalization over Time It is to be noted that of the training traversal was taken 4 months apart from the testing one. Being an office building, some areas of the environment have undergone changes (moved furniture, different furniture). This creates a realistic setup, but also a challenging one.
Most of the failures were due to such changes. Also, some of the test traversal was missing areas present in the train one, which was causing additional failures, as the robot was seeking to go through these areas. Despite these challenges, our approach resulted in navigation performance close to perfect.
The results also indicate that the policy generalizes to novel views and changes in the environment. The test traversal does not follow the train one and was recorded a month later at a different camera height. The executed paths don’t follow verbatim the supervision (Fig. 7, right). Further, the policy can recover from missed turns (Fig. 7, left). In aggregate, the trained policy learns to get to the target efficiently with an average run length times the shortest possible one.
Generalization over Space The training data traverses every area three times. Although the operator didn’t intend to make the traversals different, they might cover the areas slightly differently. We train NavNet with 1, 2, and 3 traversals. The success rate for training with 2 and 3 traversals is , while with only 1 traversal, the success rate drops to . Some of the performance drop can also be attributed to the single traversal experiment using a scan captured at a higher viewpoint. This shows that more data from the scanned environment helps, but the optimal performance is reached without the need to re-scan the environment many times.
Image-based Target Specification Furthermore, our method allows more flexibility in specifying targets for the policy. While traditional navigation systems take a global target pose as an input, our training setup can allow arbitrary target images. To verify this, we run an experiment where instead of specifying the target as a one-hot vector, we provide an image from a target area. We input to the policy a target image sampled randomly from a target area. Due to the rotations at each location, there are usually several hundred views at each target location at each of the locations. At test time, we sample a target image in the same way.
Using this setup, we achieve success rate. Our analysis shows that the decrease in performance is mostly from receiving ambiguous target images. For example, an image of a wall or window provides very little information for the policy to determine where to go, or an image of a cubicle area may be hard to differentiate since there are a large number of cubicles in the environment (Fig. 8). To verify this, we also run an experiment where we also provide an additional target image at the same location but facing in the opposite direction to cover a larger field of view. With this additional input, we achieve success rate, showing that more views helps the policy disambiguate the target. These experiments also show that we can generalize to a large number of targets.
Shortest Path Planner vs Human Demonstration In Sec. IV-A we define the loss using a set of navigation paths. In the experiments above, these are generated by a shortest path planner. In addition, we can use demonstrations of human navigation, which allows us to learn behaviors otherwise hard or impossible to encode in a traditional planner.
For one of the targets, we ask an inhabitant of the office space to label plausible navigation trajectories leading to this target starting from various points in the building. We label demonstrations. A qualitative comparison between the shortest path supervision and demonstration supervision paths in the test WorldViewEnv is presented in Fig. 9. We see that the demonstration-based navigation largely avoids the middle vertical sections. These correspond to paths through cubicle spaces, which are naturally avoided by humans as they are socially disturbing. This shows that a learning-based policy can learn socially aware behavior without manually encoding rules in the planner.
V-B Real Robot Experiment
We use the Fetch robot mobile base  to run real world experiments. For each of the 10 targets we perform 4 runs. We run the robot without any manual intervention. of the training scan was collected 7 months before the experiment, the rest was collected 3 months before.
NavNet achieves success rate of over these 40 runs. This shows that the policy is capable of learning long range planning while avoiding obstacles. As shown in Fig. 6 these runs are long, over highly repetitive areas. The failed runs fall into roughly three categories: leaving the area it is trained for (4 runs); lost localization leading to spinning (1 run); collisions with objects due to low-level policy failures (2 runs).
To put these results into perspective, we evaluate an established off-the shelf SLAM-based model – the ROS Navigation Stack 
. This system requires scanning the environment to build a 2D occupancy map. NavNet only expects a collection of images with their relationship, without needing precise pose and mapping of surroundings. Furthermore, the ROS system requires manual localization at the beginning of every run, which we do not need. Finally, it utilizes more sensors – 1D LiDAR and an RGBD head camera. NavNet requires a monocular RGB camera for the high-level policy, and 1D LiDAR for the low level. The ROS Navigation Stack also requires careful tuning – manual cleaning of the occupancy map, manual tuning of hyperparameters for obstacle detection, etc.
With this setup, it achieves success rate in reaching the targets. This shows that a learned system can outperform an established navigation system, even at a lower setup cost and manual work. The ROS system exhibits two major failures – localization failure and collisions. NavNet, on the contrary, learns to extract from the RGB sensor the relevant information for best localization in the current environment, while ROS relies on non-adapted feature matching. Furthermore, the occupancy map used by ROS leads either to collisions or being stuck, depending on how the hyperparameters governing the safety margins are set. The learned policy seems to learn a better behavior.
WorldViewEnv vs Real Robot Runs Another indication that the learned policy generalizes to unseen views and thus is robust in real settings is a comparison of runs in WorldViewEnv with runs on the real robot. In Fig. 10 we show 4 runs with same start and target executed in simulation and in real. In 3 out of 4 cases the robot took identical paths. In one run (in red) it made a different but meaningful turn at the first intersection and then successfully reached the target in similar distance.
V-C Low Level Policy
Learned vs ROS To investigate the importance of our learned low level policy, we experiment using the ROS local planner to execute high level commands from NavNet. This requires a pre-generated occupancy map, which our low level policy doesn’t need. To determine whether we can step forward at any location, we use the local costmap, which fuses sensor data from a LiDAR scan with a pre-computed SLAM map, to determine if the area in front of the robot contains traversable space. We provide the planner with one of three relative poses according to which action the high level planner chose. We tune the local planner to allow a radius tolerance for “forward” action and radians for “rotate” actions. These parameters and several others were chosen after 1 day of manual tuning and testing.
We compare the two systems on 10 random runs. NavNet succeeds in 9, while NavNet with ROS succeeds in 6. Visualization of some of the successful trajectories are shown in Fig. 11. Failures roughly fall into the global policy taking the robot into areas where the local planner is not confident and gets stuck, unable to execute a “forward” command at any of the 24 rotations. More subjectively, the ROS local planner seems to result in more “zig-zag“ paths while our policy maintains smoother and more central paths.
Qualitative Evaluation We perform 4 runs on the robot with same start and target. For each of them, we place obstacles (2 bins and a chair) in the middle of the hallway for the robot to navigate through. In unobstructed settings, the robot goes through the middle of the hallway. With obstacles, the robot (i) still executes a correct high level plan and (ii) successfully changes the path execution to avoid the obstacles. As shown in Fig. 12, the area it has to go through is only approx. wider than the robot base. The low level policy manages to get the robot safely around.
In this paper we present a novel two level neural policy for navigation. We train it in a novel way using two different environments, which allows for easy deployment on a real robot. The system has high performance compared to SLAM-based methods and demonstrates that neural policies can be trained aßnd deployed in the real world.
The authors thank A. Angelova, A. Faust, J. Kosecka for insightful comments; D. Cheung, J. Davidson, R. Nguyen for organizational support.
-  F. Bonin-Font, A. Ortiz, and G. Oliver, “Visual navigation for mobile robots: A survey,” Journal of intelligent and robotic systems, vol. 53, no. 3, pp. 263–296, 2008.
-  S. Thrun, “Simultaneous localization and mapping,” in Robotics and cognitive approaches to spatial mapping. Springer, 2007, pp. 13–41.
-  S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics. MIT press, 2005.
-  A. Mousavian, A. Toshev, M. Fiser, J. Kosecka, and J. Davidson, “Visual representations for semantic target driven navigation,” ICRA, 2019.
-  G. N. DeSouza and A. C. Kak, “Vision for mobile robot navigation: A survey,” IEEE Transactions on PAMI, vol. 24, no. 2, pp. 237–267, 2002.
-  I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3D semantic parsing of large-scale indoor spaces,” in CVPR, 2016.
-  P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuoglu et al., “Learning to navigate in complex environments,” in ICLR, 2017.
-  Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, “Target-driven visual navigation in indoor scenes using deep reinforcement learning,” in ICRA, 2017.
-  L. Tai, G. Paolo, and M. Liu, “Virtual-to-real deep reinforcement learning: Continuous control of mobile robots for mapless navigation,” in IROS, 2017.
-  M. Savva, A. X. Chang, A. Dosovitskiy, T. Funkhouser, and V. Koltun, “Minos: Multimodal indoor simulator for navigation in complex environments,” arXiv preprint arXiv:1712.03931, 2017.
-  S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive mapping and planning for visual navigation,” in CVPR, 2017.
-  Y. Wu, Y. Wu, G. Gkioxari, and Y. Tian, “Building generalizable agents with a realistic and rich 3d environment,” in ICLR, 2018.
-  P. Mirowski, M. K. Grimes, M. Malinowski, K. M. Hermann, K. Anderson, D. Teplyashin, K. Simonyan, K. Kavukcuoglu, A. Zisserman, and R. Hadsell, “Learning to navigate in cities without a map,” arXiv preprint arXiv:1804.00168, 2018.
-  J. Bruce, N. Sünderhauf, P. Mirowski, R. Hadsell, and M. Milford, “Learning deployable navigation policies at kilometer scale from a single traversal,” CoRL, 2018.
-  ——, “One-shot reinforcement learning for robot navigation with interactive replay,” arXiv preprint arXiv:1711.10137, 2017.
-  N. Savinov, A. Dosovitskiy, and V. Koltun, “Semi-parametric topological memory for navigation,” in ICLR, 2018.
-  J. Zhang, L. Tai, J. Boedecker, W. Burgard, and M. Liu, “Neural slam: Learning to explore with external memory,” arXiv preprint.
-  E. Parisotto and R. Salakhutdinov, “Neural map: Structured memory for deep reinforcement learning,” arXiv preprint arXiv:1702.08360, 2017.
-  J. Oh, V. Chockalingam, S. Singh, and H. Lee, “Control of memory, active perception, and action in minecraft,” in ICML, 2016.
-  A. Khan, C. Zhang, N. Atanasov, K. Karydis, V. Kumar, and D. D. Lee, “Memory augmented control networks,” in ICLR, 2018.
-  H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning navigation behaviors end to end,” arXiv preprint arXiv:1809.10124, 2018.
-  A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia, “Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning,” in ICRA, 2018.
-  A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in CoRL, 2017.
-  P. Ammirato, P. Poirson, E. Park, J. Košecká, and A. C. Berg, “A dataset for developing and benchmarking active vision,” in ICRA, 2017.
-  I. Armeni, S. Sax, A. R. Zamir, and S. Savarese, “Joint 2d-3d-semantic data for indoor scene understanding,” arXiv preprint arXiv:1702.01105, 2017.
-  A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” in 3DV, 2017.
-  F. Xia, A. R. Zamir, Z.-Y. He, A. Sax, J. Malik, and S. Savarese, “Gibson Env: real-world perception for embodied agents,” in CVPR, 2018.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE CVPR, 2016.
-  M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA Workshop on Open Source Software, vol. 3. Kobe, Japan, 2009, p. 5.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
-  S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in AISTATS, 2011, pp. 627–635.
-  D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver, “Distributed prioritized experience replay,” in ICRA, 2018.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” in ICLR, 2016.
-  “Fetch robot,” https://fetchrobotics.com/robotics-platforms/, accessed: 09-13-2018.
-  “Ros navigation stack,” http://wiki.ros.org/navigation.