Autonomous robot navigation, that is, how to convey a robot from one location to another, is one of the most fundamental and well-studied problems in robotics. This problem has been tackled in a number of different research communities over the last five decades. In the control community, robot navigation is studied typically for a known agent (or system) in a known environment, meaning that a dynamics model of the agent and a geometric map of the environment are available. Under these assumptions, optimal-control schemes can be used to obtain smooth, collision-free trajectories to reach the goal. Moreover, the obtained controller has many favorable properties: near-perfect generalization to variations in the underlying system, near-zero sample complexity, and provable robustness to system and environmental noise. This is the basis of how a number of real physical systems are controlled, such as airplanes and industrial robots. On the flip side, explicit knowledge of the environment severely limits the applicability of such techniques. Successful practical deployment either requires extensive environment instrumentation, or use of sophisticated sensor suites that can accurately represent the environment from on-board sensors.
On the other hand, in the learning community, robot navigation is typically studied for an unknown agent in an unknown environment. No assumptions are made about the robot or the underlying environment. Policies are learned to directly map on-board sensor readings to motor torques (or control commands in general) in an end-to-end manner using trial and error. This is done by reinforcing random roll-outs that lead to favorable outcomes (as measured through some pre-defined reward functions). The lack of any assumptions makes this research very attractive: policies can be learned without knowing anything about the underlying system, and policies can work in un-instrumented environments merely via partial views of the environment as obtained through inexpensive on-board sensors. However, policies learned using such techniques are extremely specialized to the system they were trained on. They fail to generalize to even minor variations (such as mass and inertia) of the underlying system and fail to transfer even between similar agents. Furthermore, learning such policies is extremely sample inefficient requiring millions of samples to acquire desired behavior .
In this paper, we study robot navigation in static environments under the assumption of perfect robot state measurement. We make the crucial observation that most interesting problems involve a known system in an unknown environment. This observation motivates the design of a factorized approach that uses learning to tackle unknown environments and leverages optimal control using known
system dynamics to produce smooth locomotion. More specifically, we train Convolutional Neural Network (CNN) based high-level policies, that incrementally use the current RGB image observations to produce a sequence of intermediate states orwaypoints. These waypoints are produced to guide the robot to the desired target location via a collision-free path in previously unknown environments, and are used as targets for a model-based optimal controller to generate smooth, dynamically feasible control sequences to be executed on the robot. Our approach, which we refer to as WayPtNav (WayPoint-based Navigation) here on, is summarized in Fig. 1.
Our factorized approach to robot navigation benefits from the advantages of classical control and learning-based approaches in a way that addresses their individual limitations. Learning leads to generalization in unknown environments without the need for expensive sensors or elaborate environment instrumentation. It can leverage statistical regularities to make predictions about the environment from partial views (RGB images) of the environment. Knowledge of the underlying dynamics for control leads to generalization to systems with different physical properties (such as mass and moment of inertia). At the same time, use of feedback-based control leads to smooth, continuous, and efficient trajectories that are naturally robust to noise in actuation. Furthermore, this factorization also takes the burden off the learning system. Learning now does not need to spend interaction samples to learn about the dynamics of the underlying system, and can exclusively focus on dealing with generalization to unknown environments. To summarize, our key contributions are:
we design an approach to robot navigation that combines a learning-based perception module and a dynamics model-based planning module to robustly maneuver the robot in novel, cluttered real environments using only a single on-board RGB camera,
through simulations and experiments on a ground mobile robot system, we demonstrate that our factorized approach is better and more efficient at reaching the goals, results in smoother trajectories, and generalizes well across different systems, as compared to End-to-End (E2E) learning approaches.
Ii Related Work
Our work builds upon work in optimal control, and learning-based techniques for navigation. We discuss relevant works in these two areas. As our proposed solution seeks to combine model-free and model-based approaches in a hierarchical manner, we also discuss relevant work in hierarchical RL, and works that seek to combine model-free and model-based learning. As our work tackles navigation, we also provide an overview of navigation in classical robotics.
Optimal Control. Optimal control tackles the optimization problem of controlling a known system in a known environment to minimize a known cost function. A number of schemes have been proposed to solve optimal control problems: random shooting, collocation method, dynamic programming, and trajectory optimization. Certain choices of the system dynamics and the cost function simplify the optimization. For example, if the dynamics are linear, and the cost function is quadratic, optimization can be done efficiently using dynamic programming  via Linear Quadratic Regulator (LQR) . Researchers have also extended this efficiency to non-linear systems, by iteratively solving a series of LQR problems on a linearized version of the system around the current trajectory . These optimizations scheme can be made even more efficient for the class of differentially flat systems by using splines for the state trajectories [24, 44, 30].
and trajectory tracking via optimal control as described above. Mapping refers to estimating the 3D structure of the world using a collection of RGB images, depth images, or LiDAR scans. Path planning involves computing a global plan to convey the robot to the desired location while avoiding the mapped obstacles. Finally, trajectory tracking refers to generating control commands to track the planned trajectory. Studying these three problems separately has lead to incredible progress. However, intermediate representations between these modules are typically geometric in nature. For example, maps are typically purely geometric entities devoid of any semantics: nothing is known unless explicitly observed, and there is no notion of navigational affordances (that to go far away, I should step into the hallway, etc.). To overcome these limitations, task planning approaches have been proposed that leverage object and region specific semantics during planning[13, 4, 33, 7, 9]
; however, often hand-tuned heuristics are used to leverage these semantics in planning. Our work follows a similar decomposition of the problem. However, instead of obtaining waypoints via geometry-based reasoning or hand-tuning the semantics, we employ learning. Given single images of the environment, we directly learn to predict good intermediate waypoints that will convey the robot to the desired target location without causing any collisions. Use of learning alleviates need for explicit estimation of the exact specifics of the environment, and allows our method to automatically pick up semantic navigation cues that are relevant for effective navigation from commodity on-board RGB images.
End-to-End Learning for Navigation. There has been a recent interest in employing end-to-end learning for training policies for goal-driven navigation [45, 16, 23]. The typical motivation of this work is to incorporate semantics and common-sense reasoning into navigation policies. While Zhu et al.  only learn policies that work well in training environments, Gupta et al.  and Khan et al.  design policies that generalize to previously unseen environments. Most such works [16, 45, 23] abstract out low-level control and work with a set of macro-actions (going forward cm, turning ). Such ignorance of the dynamics of the underlying robot, results in jerky and inefficient policies that exhibit stop and go behavior. Numerous other works have tackled navigation in synthetic game environments [31, 38, 35], and completely ignore considerations of real-world deployment of such policies, such as dynamics and state estimation.
Researchers have also employed learning to tackle locomotion problems [19, 14, 37]. These works learn policies for collision-avoidance, i.e. how to move around in an environment without colliding. Kahn et al.  use motion primitives, Gandhi et al. , and Sadeghi and Levine  use velocity control for locomotion. Muller et al.  focus on lane keeping and making turns in outdoor environments using a scene segmentation module which is used to generate the next desired position for the vehicle based on a user command. While all of these works marry low-level control to learning-based systems, none of these works tackle the problem of goal-reaching which is the focus of this work.
Hierarchical Reinforcement Learning.
Hierarchical Reinforcement Learning.Our work is also related to hierarchical RL [3, 10, 43, 27], where a policy is factorized into sub-policies that are controlled via a master policy. While the intuition behind such a factorization is similar to ours, the focus in hierarchical RL is on jointly learning policies at both these levels from trial-and-error interactions with the environment, rather than employing known models for the low-level policies as is the case in our work here.
Combining Optimal Control and Learning. A number of papers seek to combine the best of learning with optimal control. Drews et al.  learn a cost function from monocular images which is used with Model Predictive Control (MPC) for aggressive driving on a race track. Kaufmann et al. [20, 21] use learning to predict waypoints that are used with model-based control for drone racing. While these factorized approaches are similar in spirit to ours, the focus of these works is on aggressive control in training environments. In contrast, we seek to learn goal-driven policies that work well in completely novel testing environments. Other works such as that from Levine et al.  and Pan et al.  combine optimal control and end-to-end learning, by training neural network policies to mimic the output control values based on the raw images. We explicitly compare to such an approach in this work.
Iii Problem Setup
In this work, we study the problem of autonomous navigation of a ground vehicle in previously unknown indoor environments. We assume that odometry is perfect (i.e., the exact vehicle state is available), and that the environment is static. Dealing with imperfect odometry and dynamic environments are problems in their own right, and we defer them to future work.
We model our ground vehicle as a three-dimensional non-linear Dubins car with the following dynamics:
where is the state of vehicle, is the position, is the heading, is the speed, and is the turn rate at time . The input (control) to the system is . The linear and angular speeds and are bounded within and respectively. We use a discretized version of the dynamics in Eqn. (1) for all planning purposes. The robot is equipped with a monocular RGB camera mounted at a fixed height and oriented at a fixed pitch.
The goal of this paper is to learn control policies for this robot for navigation in novel environments that it has not encountered before. More specifically, we study goal-oriented navigation tasks: the robot needs to go to a target position, , specified in the robot’s coordinate frame (e.g., m forward, m left), without colliding with any obstacles. These tasks are to be performed in novel environments whose map or topology is not available to the robot. In particular, at a given time step , the robot with state receives as input the RGB image of the environment , , and the target position expressed in the current coordinate frame of the robot. The objective is to obtain a control policy that uses these inputs to guide the robot to the target as quickly as possible.
Iv Model-based Learning for Navigation
We use a waypoint-based approach to navigation (WayPtNav). WayPtNav makes use of two submodules: perception and planning. The perception module uses current RGB observation with a CNN to predict the desired next position that the robot should navigate to. The planning module then uses the dynamics model of the robot to plan a smooth (minimum-jerk) trajectory to convey the robot to the predicted waypoint. Finally, we use a dynamics model-based linear feedback controller to track this trajectory. The WayPtNav framework is demonstrated in Figure 2 and summarized in Algorithm 1.
Iv-a Perception Module
The goal of the perception module is to analyze the image and provide high-level guidance to the planning module. We implement the perception module using a CNN that takes as input a pixel RGB image, , captured from the onboard camera, the target position, , specified in the vehicle’s current coordinate frame, and vehicle’s current linear and angular speed, , and outputs a waypoint , where is the desired next state of the robot (Line 6 in Algorithm 1). The planning module uses this desired next state to design a trajectory as well as a feedback controller to convey the system to the desired state over a time horizon of (also referred to as the control horizon). The entire process is repeated at time . The network is trained by imitating automatically computed expert sequence of waypoints, as explained in Section V.
Intuitively, the network can use semantic cues (such as presence of scene surfaces like floor and walls, common furniture objects like tables and chairs, exit points like doors) alongside the learned priors about the shapes of such common objects to generate an estimate of where it should roughly go next, even when the desired position is not visible in the current RGB image. This allows the robot to do a more guided and efficient exploration in the novel environments based on its prior experience. Qualitative results in Section VI show how our perception module is able to make predictions about where to go next even though the waypoint it outputs has not been observed explicitly.
Iv-B Planning Module
Given a waypoint , and the current linear and angular speed , the planning module uses the system dynamics in Eqn. (1) to design a smooth trajectory from the current vehicle state to the waypoint. In this work, we represent the and trajectories using a third-order spline. Since the heading of the vehicle can be determined from the and trajectories (i.e., Dubins car model in Eqn. (1) is differentially flat with outputs and ), the generated spline trajectories are also dynamically feasible for Eqn. (1). Thus a spline-based planner provides the desired state and control trajectories, , that robot should follow for the time horizon to reach the desired waypoint (Line 7).
Since the generated splines are third-order, the generated position, speed and acceleration trajectories are smooth and differentiable. This is often an important consideration for real robots, where jerky trajectories might lead to compounding sensor errors, poor tracking, or even hardware damage . The parameters of these trajectories can be determined by using the boundary conditions, i.e., the current state, the final state and the current speed. This corresponds to solving a set of linear equations and can be done efficiently onboard. For more details on spline trajectory generation for Dubins car, we refer the interested readers to . Note that while we use splines in this work, other model-based planning schemes, such as MPC, can also be used for trajectory planning.
To track the generated trajectory , we design a LQR  based linear feedback controller, (Line 8). Here and represent the feed-forward and feedback terms respectively. The LQR controller is obtained using the dynamics in Eqn. (1), linearized around the trajectory generated by the planning module. LQR is a widely used feedback controller in robotic systems, and has been shown to reject external disturbances and make planning robust to mismatches between the dynamics model and the actual system [42, 29]. The presence of this feedback controller in the planning module also allows us to deploy the proposed framework directly on a real robot, even when it is trained only in simulation (provided that the perception module is robust to the visual differences between simulated and real-world images) as we demonstrate in Section VII, even though the model in Eqn. (1) may not capture the true physics of the robot.
The control commands generated by the LQR controller are executed on the system over a time horizon of seconds (Line 9), and then a new image is obtained. This image is used by the perception module to predict a new waypoint, that is subsequently used by the planner to generate a new plan. This process is repeated until the robot reaches the goal position.
V Training Procedure
We train the perception module with supervised learning, using automatically generated globally optimal trajectories as a source of supervision. To generate these trajectories, we assume that the location of all obstacles is known during training time. However, during test time no such assumption is made and the robot relies only on the image data and other on-board sensors.
We compare the proposed model-based planning approach (WayPtNav) to the end-to-end (E2E) learning approach across various metrics for the test navigation tasks. (a) Success rate of each approach in reaching the target (higher is better). WayPtNav is 26% more successful compares to E2E learning (b) Average time and standard deviation to reach the target (lower is better). WayPtNav on average takes 2/3rd as much time as the E2E learning approach to reach the target. (c, d) Average acceleration and jerk along the robot trajectory (lower is better). Trajectories produced by the WayPtNav are considerably less jerky than ones produced by E2E learning approach. This is a significant advantage on real robots.
V-a Expert policy
Even when the positions and shapes of all obstacles are known, generating a trajectory from a starting state to the goal state that does not collide with any obstacles is still a challenging optimal control problem, especially when the obstacles are not convex, which is often the case with the cluttered indoor environments. To generate the expert trajectory, we define a cost function that trades-off the distance from the target position and the nearest obstacle, and optimize for the waypoint such that the resultant spline trajectory minimizes this cost function. More specifically, given a map of obstacles and the goal position, we compute the signed distance functions from the obstacles and the goal. For computing the signed distance from the goal, we mask the obstacles to get the shortest collision-free path to the goal. In the cost function, we penalize the squared signed distance from the goal. If the vehicle trajectory is within a certain distance of an obstacle, there is an additional cubic penalty proportional to the inverse signed distance from the obstacles. This is to make sure that the vehicle trajectory does not go too close to the obstacle (we empirically found that it is significantly harder to learn to accurately predict the waypoints for which the vehicle trajectory goes too close to the obstacles, as the prediction error margin for the neutral network is much smaller in such a case). The overall cost function is given by the sum of these penalties across the entire trajectory.
Given this cost function, we optimize for the waypoints using a random sampling-based approach. For random sampling, we sample the waypoints within the ground-projected field-of-view of the vehicle (ground projected assuming no obstacles). We only consider the waypoints that are reachable from the current state of the vehicle while respecting the vehicle’s control bounds. The obtained optimal waypoints, the corresponding RGB images, and the egocentric target positions are used as the training data for the CNN in our perception module. Note, even though we use a random sampling-based approach to obtain the optimal waypoint, other gradient-based optimization schemes can also be used.
V-B Data Collection
We use the Stanford large-scale 3D Indoor Spaces (S3DIS) dataset to generate our training and test data . The dataset consists of 3D scans (in the form of textured meshes) collected in 6 large-scale indoor areas that originate from 3 different office buildings. Scans from 2 buildings were used for training and the robot was tested on scans from the 3rd building. We pre-processed the meshes to compute the location of all obstacles, and the space traversable by the robot. This traversable space was used to sample arbitrary start and target positions, and these samples along with the location of obstacles were used to generate the expert supervision as outlined in Section V-A.
Vi Simulation Experiments
In this section, we present our simulation results and compare WayPtNav with a state-of-the-art E2E learning baseline. For E2E learning, the CNN is trained to directly output the control commands over the control horizon . To generate the supervision for E2E learning, we generate the smooth control commands corresponding to the expert trajectory using a LQR controller. We use a control horizon of 1.5s (corresponds to a frame rate of roughly 0.7 Hz) in all our simulations.
All our models are trained with a single GPU worker using TensorFlow
. We use MSE loss on the waypoint prediction (respectively on the control command prediction) for training the CNN in our perception module (respectively for E2E learning). We use ADAM optimizer to optimize the loss function with a batch size of 64. We train both networks for 35K iterations with a constant learning rate of 0.0001 and use a weight decay ofto regularize the network.
We use ResNet-50 
, pre-trained for ImageNet Classification, to encode the image input. We remove the top layer, and use a downsampling convolution layer, followed by 5 fully connected layers with 128 neurons each to regress to the optimal waypoint (or control commands for E2E learning). The image features obtained at the last convolution layer are concatenated with the egocentric target position and the current linear and angular speed before passing them to the fully connected layers (see Figure2
). During training, the ResNet layers are finetuned along with the fully connected layers to learn the features that are relevant to the navigation tasks. We use standard techniques used to avoid overfitting including dropout following each fully connected layer except the last (with dropout probability) , and data augmentation (such as randomly distorting brightness, contrast, adding blur, perspective distortion) at training time .
We train our agent on 125K data points generated by our expert policy on 2 buildings of the S3DIS dataset. We use an , training, validation split and use cross validation to choose all our hyper parameters for supervised learning. We choose the third unseen building for testing and keep a held out set of 80 test navigational goals on which we evaluate our approach and baseline.
The test problems are sampled to include a variety of tasks, such as going around obstacles, going out of the room, or into another hallway. The same sampling procedure is used during training and testing. We measure performance after running the learned network for a maximum duration of 80s. We report multiple metrics, the success rate (the agent succeeds if it reaches within 30cm of the target position without colliding with any obstacle on the way), the average time taken to reach the target position for the successful episodes, and the average acceleration and jerk along the robot trajectory.
Success Rate: The success rates for the test tasks for WayPtNav as well as for E2E learning are reported in Figure 3. WayPtNav is able to outperform the E2E learning approach by roughly 25%, indicating that the proposed approach is more reliable at navigating the robot to its target. We also visualize the robot trajectories for the two approaches for some sample test problems in Figure 4.
Note that the top-view trajectories are plotted only for visualization purposes, the networks are trained only on first-person RGB images. As evident from the trajectories, WayPtNav enables the robot to navigate through narrow hallways, to make tight turns around the obstacles and corners, whereas E2E learning is not able to do so consistently. Intuitively, in our approach, the CNN only needs to learn “where to go next” based on the current observed image, since the control profile to go to the next desired state is generated by a model-based planner. On the other hand, in E2E learning, the CNN additionally needs to learn “how to go” to that next desired position. The latter can be a harder learning problem from a control perspective, especially when the robot needs to make aggressive maneuvers such as making tight turns around obstacles. We also test the two approaches on a smaller control horizon of and thus allowing for a faster visual feedback, but we do not see any significant changes in the success rates.
Control Profile: In addition to a higher success rate, WayPtNav also guides the robot to the target more quickly than the E2E learning approach. We report the average time taken to reach the target position along with the standard deviation for the two approaches in 3. For a fair comparison, we report these numbers only over the test tasks at which the E2E learning approach succeeds. WayPtNav conveys the robot 33% faster to its target as compared to the E2E learning approach.
We also plot the linear velocity control profile produced by the two approaches for a sample goal in Figure 5.
Even though smooth control profiles generated by the expert policy were used as supervision for E2E learning, the learned control profile is discontinuous and jerky. On the other hand, since the CNN only acts as a high-level planner in WayPtNav, the control profile is still generated by a model-based planner and is smooth. We report the average acceleration and jerk over the trajectories for the two approaches in Figure 3 and 3 respectively. WayPtNav has significantly lower average acceleration and jerk. This has a significant implication for actual robots since the consumed power is directly proportional to the average acceleration. As an example, for the same battery capacity, WayPtNav will allow the robot to travel roughly twice as much as compared to E2E learning. To facilitate the smoothness of the control profile, we also try adding a smoothing penalty in the loss function for E2E learning; however, we see a significant decline in the success rate as a result.
Semantics Learning: One of the main objectives of this work is to enable the perception module to learn the navigation cues and semantics that might be helpful in planning in novel environments. In this section, we visualize some of the test cases and demonstrate how the proposed approach is indeed able to learn such cues. In the first test case, the robot starts inside a conference room and is required to go to a target position that is in a different room. The problem setup is shown in Figure 6. The blue dot represents the robot’s start position and the black arrow represents the robot’s heading. The center of the green area represents the target position.
We mark some intermediate points on the robot trajectory in Figure 6 where it is required to predict a waypoint, and visualize the corresponding RGB images that it observed in Figures 6-6. The robot is initially facing some chairs (Fig. 6), and based on the observed image, it predicts a waypoint to rotate in place to explore the environment. Upon rotation, the robot sees part of the door (Fig. 6) and is able to learn that to go to the target position it needs to exit the current room through this door and predicts a waypoint to go towards the door (Fig. 6). Next, it produces a waypoint to go out of the door (Fig. 6), and eventually reaches the target position. We also show the full trajectory (the red plot) of the robot from the start position to the target position as well as mark the obstacles in context in Figure 6.
As a second example, we consider the test case in Fig. 7 where the robot needs to go from Hallway 1 to Hallway 2. We show an intermediate point in the robot trajectory (Marked 1 in Fig. 7) where the robot needs to turn right into Hallway 2 to reach the target. We also show the corresponding (stylized) field-of-view (projected on the ground) of the robot (the blue cone in Fig. 7). The corresponding RGB image that the robot observed is shown in Figure 7. Based on the observed image, the robot produces the cyan waypoint and the bold red desired trajectory. Even though the robot is not able to see its next desired position (it is blocked by the door in the field-of-view), it can see part of Hallway 2 and is able to reason that a hallway typically leads to another hallway. The robot uses this semantic prior to produce a waypoint leading into Hallway 2 and eventually reaches the target.
Vii Hardware Experiments
We next test WayPtNav on a hardware testbed. For our experiments, we use a TurtleBot 2. We use the network trained in simulation (as described in Section VI) and deploy it directly on the Turtlebot without any finetuning or additional training. The experiments demonstrate that our approach can handle dynamics mismatches between a real robot and the simplified model that is used for training in simulation. For state measurement, we use the on-board odometry sensors on the TurtleBot. Recall, we trained our networks with random perspective distortions applied to input images. This makes the perception module robust to differences in camera pitch and field-of-view between simulation and the real robot. We test our framework on three different navigation tasks:
1) Navigation Through Cluttered Environments: In this experiment, our goal is to test if WayPtNav can enable navigation through real-world cluttered environments. The target of the robot requires it to navigate through a very narrow hallway with bikes on one side, and then around furniture objects like chairs and sofas to reach the goal. A stylized top-view description of the task is shown in Fig. 8, while Fig. 9 shows the third-person view of the starting configuration of the robot, and the first-person RGB images observed by the robot at some of the intermediate trajectory points marked in Fig. 8. The robot starts very close to the bikes on its right and consequently predicts a waypoint to the left. It then predicts a series of waypoints to go straight through the narrow hallway, till it reaches the chairs and desks at the end of the hallway. The robot then navigates precisely between the chairs and sofas to reach its target position.
Even though the dynamics model does not exactly capture the physics of a real robot (it does not include any mass and inertia effects for example), the feedback controller provided by the model-based planner compensates for these differences, and enables the robot to closely track the desired trajectory while ensuring trajectory smoothness. This is crucial for generalizing from simulation to a physical robot.
2) Leveraging Navigation Cues In Real-World Environments: The goal of this experiment is to demonstrate that WayPtNav can leverage real-world visual semantics to efficiently explore the environment and complete the navigation task. In this task, the robot starts inside a room facing a wall and the target position is outside the room (Figure 8).
The robot first rotates to explore the room, and realizes that the target is outside the room. It then makes its way to the door to exit the room, to eventually reach the desired target position. Our policies automatically learn to pick up such semantic cues (that the room can be exited through a doorway), even though they were never explicitly trained to look for them. This allows for efficient goal-reaching behavior in novel environments.
3) Navigation In Dynamic Environments: Though in this work, we only focus on static environments, our policy is reactive and can adapt to changes in the environment. We test this behavior by changing the environment as the robot executes its policy. The robot is tasked to go straight 6m (Figure 8). It must go around the brown chair. As it is executing the policy, we move the chair around to block the robot’s path (we show the moved chair location in green, purple and yellow and mark the corresponding time at which the chair was moved on the robot’s trajectory in the same color). We decrease the control horizon to 0.5s to allow for fast adaptation. The robot indeed reacts quickly to the moving chair and is able to re-plan a trajectory in light of the moved chair to reach its target position without any collisions.
Comparison with the End-to-End Learning Approach: We repeat Experiments 1 (Navigation Through Cluttered Environments) and 2 (Leveraging Navigation Cues in Real-World Environments) for 5 trials each using both the proposed approach and the E2E method. The obtained results are shown in Figure 11. WayPtNav succeeds in 9 of 10 trials, failing only once in Experiment 2. The E2E approach, on the other hand, fails all 5 trials of Experiment 2 and succeeds in 2 of 5 trials of Experiment 1. Closer analysis of the failure cases supports our simulation-based conclusion that the E2E method struggles to predict aggressive, yet nuanced control profiles, even more so on a real robot.
We also compare the time taken to reach the target, average acceleration, and average jerk on the 2 trials in which E2E method succeed. Consistent with our simulation-based results, WayPtNav produces trajectories which are approximately 25 percent more efficient at reaching the goal than the E2E method. We note that the Turtlebot platform has an internal, low-level PID controller which naturally enforces velocity smoothing. Regardless, WayPtNav still achieves average acceleration approximately one-third that of the E2E method and average jerk approximately one-fifth that of the E2E approach, thus resulting in much smoother and efficient trajectories compared to the E2E learning approach on a real robot.
Viii Conclusion and Future Work
We propose a model-based navigation approach that employs learning to leverage semantic cues for goal-driven navigation in novel indoor environments. Our proposed approach is better and more efficient at reaching goals, and results in smoother trajectories when compared to an End-to-End learning approach. Use of a model-based feedback controller, allows our proposed approach to successfully generalize from simulation to physical robots.
Several interesting future directions emerge from this work. Our proposed approach assumes perfect robot state estimate and employs a purely reactive policy. These assumptions and choices may not be optimal, specially for long range tasks. Incorporating spatial or visual memory to address these limitations would be fruitful future directions. Furthermore, a more detailed study of dynamic environments would be important and interesting.
This research is supported in part by the DARPA Assured Autonomy program under agreement number FA8750-18-C-0101, by NSF under the CPS Frontier project VeHICaL project (1545126), by NSF grants 1739816 and 1837132, by the UC-Philippine-California Advanced Research Institute under project IIID-2016-005, by SRC under the CONIX Center, and by Berkeley Deep Drive.
Abadi et al. 
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
Tensorflow: a system for large-scale machine learning.In OSDI, volume 16, pages 265–283, 2016.
Armeni et al. 
I. Armeni, A. Sax, A. R. Zamir, and S. Savarese.
Joint 2D-3D-Semantic Data for Indoor Scene Understanding.ArXiv e-prints, February 2017.
- Barto and Mahadevan  Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. DEDS, 2003.
- Becker et al.  Craig Becker, Joaquin Salas, Kentaro Tokusei, and J-C Latombe. Reliable navigation using landmarks. In ICRA, volume 1, pages 401–406. IEEE, 1995.
- Bellman  Richard Bellman. Dynamic programming. Courier Corporation, 2013.
- Bender and Laub  Douglas J Bender and Alan J Laub. The linear-quadratic optimal regulator for descriptor systems: discrete-time case. Automatica, 23(1):71–85, 1987.
- Bowman et al.  Sean L Bowman, Nikolay Atanasov, Kostas Daniilidis, and George J Pappas. Probabilistic data association for semantic slam. In ICRA. IEEE, 2017.
- Canny  John Canny. The complexity of robot motion planning. MIT press, 1988.
- Civera et al.  Javier Civera, Dorian Gálvez-López, Luis Riazuelo, Juan D Tardós, and JMM Montiel. Towards semantic slam using a monocular camera. In ICRA, pages 1277–1284. IEEE, 2011.
- Dayan and Hinton  Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In NIPS, 1993.
- Drews et al.  Paul Drews, Grady Williams, Brian Goldfain, Evangelos A. Theodorou, and James M. Rehg. Aggressive deep driving: Combining convolutional neural networks and model predictive control. In CoRL, volume 78 of Proceedings of Machine Learning Research, pages 133–142. PMLR, 2017.
- Fuentes-Pacheco et al.  J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendón-Mancha. Visual simultaneous localization and mapping: a survey. Artificial Intelligence Review, 2015.
- Galindo et al.  Cipriano Galindo, Juan-Antonio Fernández-Madrigal, Javier González, and Alessandro Saffiotti. Robot task planning using semantic maps. Robotics and autonomous systems, 56(11):955–966, 2008.
- Gandhi et al.  Dhiraj Gandhi, Lerrel Pinto, and Abhinav Gupta. Learning to fly by crashing. In IROS, 2017.
- González et al.  David González, Joshué Pérez, Vicente Milanés, and Fawzi Nashashibi. A review of motion planning techniques for automated vehicles. IEEE Trans. Intelligent Transportation Systems, 17(4):1135–1145, 2016.
- Gupta et al.  Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive mapping and planning for visual navigation. arXiv preprint arXiv:1702.03920, 2017.
Hartley and Zisserman 
Richard Hartley and Andrew Zisserman.
Multiple view geometry in computer vision. Cambridge university press, 2003.
- He et al.  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
- Kahn et al.  Gregory Kahn, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. Uncertainty-aware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182, 2017.
- Kaufmann et al. [2018a] Elia Kaufmann, Mathias Gehrig, Philipp Foehn, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. Beauty and the beast: Optimal methods meet learning for drone racing. arXiv preprint arXiv:1810.06224, 2018a.
- Kaufmann et al. [2018b] Elia Kaufmann, Antonio Loquercio, Rene Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. Deep drone racing: Learning agile flight in dynamic environments. arXiv preprint arXiv:1806.08548, 2018b.
- Kavraki et al.  Lydia E Kavraki, Petr Svestka, J-C Latombe, and Mark H Overmars. Probabilistic roadmaps for path planning in high-dimensional configuration spaces. RA, 1996.
- Khan et al.  Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Vijay Kumar, and Daniel D Lee. Memory augmented control networks. In ICLR, 2018.
- Koo and Sastry  T John Koo and Shankar Sastry. Differential flatness based full authority helicopter control design. In CDC, volume 2, pages 1982–1987. IEEE, 1999.
- LaValle  Steven M LaValle. Rapidly-exploring random trees: A new tool for path planning. 1998.
- Levine et al.  Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. JMLR, 2016.
- Levy et al.  Andrew Levy, Robert Platt, and Kate Saenko. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948, 2017.
- Li and Todorov  Weiwei Li and Emanuel Todorov. Iterative linear quadratic regulator design for nonlinear biological movement systems. In ICINCO (1), pages 222–229, 2004.
- Mason et al.  Sean Mason, Ludovic Righetti, and Stefan Schaal. Full dynamics lqr control of a humanoid robot: An experimental study on balancing and squatting. In HUMANOID, pages 374–379. IEEE, 2014.
- Mellinger and Kumar  Daniel Mellinger and Vijay Kumar. Minimum snap trajectory generation and control for quadrotors. In ICRA, pages 2520–2525. IEEE, 2011.
- Mirowski et al.  Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andy Ballard, Andrea Banino, Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in complex environments. In ICLR, 2017.
- Müller et al.  Matthias Müller, Alexey Dosovitskiy, Bernard Ghanem, and Vladen Koltun. Driving policy transfer via modularity and abstraction. arXiv preprint arXiv:1804.09364, 2018.
- Nieto-Granda et al.  Carlos Nieto-Granda, John G Rogers, Alexander JB Trevor, and Henrik I Christensen. Semantic map partitioning in indoor environments using regional analysis. In IROS, pages 1451–1456. IEEE, 2010.
- Pan et al.  Yunpeng Pan, Ching-An Cheng, Kamil Saigol, Keuntaek Lee, Xinyan Yan, Evangelos Theodorou, and Byron Boots. Agile off-road autonomous driving using end-to-end deep imitation learning. arXiv preprint arXiv:1709.07174, 2017.
- Parisotto and Salakhutdinov  Emilio Parisotto and Ruslan Salakhutdinov. Neural Map: Structured memory for deep reinforcement learning. In ICLR, 2018.
- Recht  Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2018.
- Sadeghi and Levine  Fereshteh Sadeghi and Sergey Levine. (CAD)RL: Real single-image flight without a single real image. In RSS, 2017.
- Savinov et al.  Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. In ICLR, 2018.
- Simard et al.  Patrice Y Simard, David Steinkraus, John C Platt, et al. Best practices for convolutional neural networks applied to visual document analysis. In ICDAR, volume 3, pages 958–962, 2003.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
- Thrun et al.  Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.
- Van Den Berg et al.  Jur Van Den Berg, Pieter Abbeel, and Ken Goldberg. LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7):895–913, 2011.
- Vezhnevets et al.  Alexander Sasha Vezhnevets, Simon Osindero, Tom Schaul, Nicolas Heess, Max Jaderberg, David Silver, and Koray Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprint arXiv:1703.01161, 2017.
Walambe et al. 
Rahee Walambe, Nipun Agarwal, Swagatu Kale, and Vrunda Joshi.
Optimal trajectory generation for car-type mobile robot using spline interpolation.IFAC, 49(1):601 – 606, 2016.
- Zhu et al.  Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA, 2017.