Combining Optimal Control and Learning for Visual Navigation in Novel Environments

03/06/2019 ∙ by Somil Bansal, et al. ∙ 10

Model-based control is a popular paradigm for robot navigation because it can leverage a known dynamics model to efficiently plan robust robot trajectories. However, it is challenging to use model-based methods in settings where the environment is a priori unknown and can only be observed partially through on-board sensors on the robot. In this work, we address this short-coming by coupling model-based control with learning-based perception. The learning-based perception module produces a series of waypoints that guide the robot to the goal via a collision-free path. These waypoints are used by a model-based planner to generate a smooth and dynamically feasible trajectory that is executed on the physical system using feedback control. Our experiments in simulated real-world cluttered environments and on an actual ground vehicle demonstrate that the proposed approach can reach goal locations more reliably and efficiently in novel, previously-unknown environments as compared to a purely end-to-end learning-based alternative. Our approach is successfully able to exhibit goal-driven behavior without relying on detailed explicit 3D maps of the environment, works well with low frame rates, and generalizes well from simulation to the real world. Videos describing our approach and experiments are available on the project website.



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous robot navigation, that is, how to convey a robot from one location to another, is one of the most fundamental and well-studied problems in robotics. This problem has been tackled in a number of different research communities over the last five decades. In the control community, robot navigation is studied typically for a known agent (or system) in a known environment, meaning that a dynamics model of the agent and a geometric map of the environment are available. Under these assumptions, optimal-control schemes can be used to obtain smooth, collision-free trajectories to reach the goal. Moreover, the obtained controller has many favorable properties: near-perfect generalization to variations in the underlying system, near-zero sample complexity, and provable robustness to system and environmental noise. This is the basis of how a number of real physical systems are controlled, such as airplanes and industrial robots. On the flip side, explicit knowledge of the environment severely limits the applicability of such techniques. Successful practical deployment either requires extensive environment instrumentation, or use of sophisticated sensor suites that can accurately represent the environment from on-board sensors.

Fig. 1: Overview: We consider the problem of navigation from a start position to a goal position. Our approach (WayPtNav) consists of a learning-based perception module and a dynamics model-based planning module. The perception module predicts a waypoint based on the current first-person RGB image observation. This waypoint is used by the model-based planning module to design a controller that smoothly regulates the system to this waypoint. This process is repeated for the next image until the robot reaches the goal.
The first two authors contributed equally to this paper. University of California, Berkeley. Facebook AI Research. Project website:

On the other hand, in the learning community, robot navigation is typically studied for an unknown agent in an unknown environment. No assumptions are made about the robot or the underlying environment. Policies are learned to directly map on-board sensor readings to motor torques (or control commands in general) in an end-to-end manner using trial and error. This is done by reinforcing random roll-outs that lead to favorable outcomes (as measured through some pre-defined reward functions). The lack of any assumptions makes this research very attractive: policies can be learned without knowing anything about the underlying system, and policies can work in un-instrumented environments merely via partial views of the environment as obtained through inexpensive on-board sensors. However, policies learned using such techniques are extremely specialized to the system they were trained on. They fail to generalize to even minor variations (such as mass and inertia) of the underlying system and fail to transfer even between similar agents. Furthermore, learning such policies is extremely sample inefficient requiring millions of samples to acquire desired behavior [36].

In this paper, we study robot navigation in static environments under the assumption of perfect robot state measurement. We make the crucial observation that most interesting problems involve a known system in an unknown environment. This observation motivates the design of a factorized approach that uses learning to tackle unknown environments and leverages optimal control using known

system dynamics to produce smooth locomotion. More specifically, we train Convolutional Neural Network (CNN) based high-level policies, that incrementally use the current RGB image observations to produce a sequence of intermediate states or

waypoints. These waypoints are produced to guide the robot to the desired target location via a collision-free path in previously unknown environments, and are used as targets for a model-based optimal controller to generate smooth, dynamically feasible control sequences to be executed on the robot. Our approach, which we refer to as WayPtNav (WayPoint-based Navigation) here on, is summarized in Fig. 1.

Our factorized approach to robot navigation benefits from the advantages of classical control and learning-based approaches in a way that addresses their individual limitations. Learning leads to generalization in unknown environments without the need for expensive sensors or elaborate environment instrumentation. It can leverage statistical regularities to make predictions about the environment from partial views (RGB images) of the environment. Knowledge of the underlying dynamics for control leads to generalization to systems with different physical properties (such as mass and moment of inertia). At the same time, use of feedback-based control leads to smooth, continuous, and efficient trajectories that are naturally robust to noise in actuation. Furthermore, this factorization also takes the burden off the learning system. Learning now does not need to spend interaction samples to learn about the dynamics of the underlying system, and can exclusively focus on dealing with generalization to unknown environments. To summarize, our key contributions are:

  • we design an approach to robot navigation that combines a learning-based perception module and a dynamics model-based planning module to robustly maneuver the robot in novel, cluttered real environments using only a single on-board RGB camera,

  • through simulations and experiments on a ground mobile robot system, we demonstrate that our factorized approach is better and more efficient at reaching the goals, results in smoother trajectories, and generalizes well across different systems, as compared to End-to-End (E2E) learning approaches.

We review related work in Section II, describe the problem setup in Section III, detail our approach in Section IV and V, and describe the results of our experiments in Section VII.

Ii Related Work

Our work builds upon work in optimal control, and learning-based techniques for navigation. We discuss relevant works in these two areas. As our proposed solution seeks to combine model-free and model-based approaches in a hierarchical manner, we also discuss relevant work in hierarchical RL, and works that seek to combine model-free and model-based learning. As our work tackles navigation, we also provide an overview of navigation in classical robotics.

Optimal Control. Optimal control tackles the optimization problem of controlling a known system in a known environment to minimize a known cost function. A number of schemes have been proposed to solve optimal control problems: random shooting, collocation method, dynamic programming, and trajectory optimization. Certain choices of the system dynamics and the cost function simplify the optimization. For example, if the dynamics are linear, and the cost function is quadratic, optimization can be done efficiently using dynamic programming [5] via Linear Quadratic Regulator (LQR) [6]. Researchers have also extended this efficiency to non-linear systems, by iteratively solving a series of LQR problems on a linearized version of the system around the current trajectory [28]. These optimizations scheme can be made even more efficient for the class of differentially flat systems by using splines for the state trajectories [24, 44, 30].

Classical Robot Navigation. Classical robotics factorizes the problem of robot navigation [41, 12] into sub-problems of mapping and localization [41, 17], path planning [25, 22, 8]

and trajectory tracking via optimal control as described above. Mapping refers to estimating the 3D structure of the world using a collection of RGB images, depth images, or LiDAR scans. Path planning involves computing a global plan to convey the robot to the desired location while avoiding the mapped obstacles. Finally, trajectory tracking refers to generating control commands to track the planned trajectory. Studying these three problems separately has lead to incredible progress. However, intermediate representations between these modules are typically geometric in nature. For example, maps are typically purely geometric entities devoid of any semantics: nothing is known unless explicitly observed, and there is no notion of navigational affordances (that to go far away, I should step into the hallway, etc.). To overcome these limitations, task planning approaches have been proposed that leverage object and region specific semantics during planning

[13, 4, 33, 7, 9]

; however, often hand-tuned heuristics are used to leverage these semantics in planning. Our work follows a similar decomposition of the problem. However, instead of obtaining waypoints via geometry-based reasoning or hand-tuning the semantics, we employ learning. Given single images of the environment, we directly learn to predict good intermediate waypoints that will convey the robot to the desired target location without causing any collisions. Use of learning alleviates need for explicit estimation of the exact specifics of the environment, and allows our method to automatically pick up semantic navigation cues that are relevant for effective navigation from commodity on-board RGB images.

End-to-End Learning for Navigation. There has been a recent interest in employing end-to-end learning for training policies for goal-driven navigation [45, 16, 23]. The typical motivation of this work is to incorporate semantics and common-sense reasoning into navigation policies. While Zhu et al. [45] only learn policies that work well in training environments, Gupta et al. [16] and Khan et al. [23] design policies that generalize to previously unseen environments. Most such works [16, 45, 23] abstract out low-level control and work with a set of macro-actions (going forward cm, turning ). Such ignorance of the dynamics of the underlying robot, results in jerky and inefficient policies that exhibit stop and go behavior. Numerous other works have tackled navigation in synthetic game environments [31, 38, 35], and completely ignore considerations of real-world deployment of such policies, such as dynamics and state estimation.

Researchers have also employed learning to tackle locomotion problems [19, 14, 37]. These works learn policies for collision-avoidance, i.e. how to move around in an environment without colliding. Kahn et al. [19] use motion primitives, Gandhi et al. [14], and Sadeghi and Levine [37] use velocity control for locomotion. Muller et al. [32] focus on lane keeping and making turns in outdoor environments using a scene segmentation module which is used to generate the next desired position for the vehicle based on a user command. While all of these works marry low-level control to learning-based systems, none of these works tackle the problem of goal-reaching which is the focus of this work.

Hierarchical Reinforcement Learning.

Our work is also related to hierarchical RL [3, 10, 43, 27], where a policy is factorized into sub-policies that are controlled via a master policy. While the intuition behind such a factorization is similar to ours, the focus in hierarchical RL is on jointly learning policies at both these levels from trial-and-error interactions with the environment, rather than employing known models for the low-level policies as is the case in our work here.

Combining Optimal Control and Learning. A number of papers seek to combine the best of learning with optimal control. Drews et al. [11] learn a cost function from monocular images which is used with Model Predictive Control (MPC) for aggressive driving on a race track. Kaufmann et al. [20, 21] use learning to predict waypoints that are used with model-based control for drone racing. While these factorized approaches are similar in spirit to ours, the focus of these works is on aggressive control in training environments. In contrast, we seek to learn goal-driven policies that work well in completely novel testing environments. Other works such as that from Levine et al. [26] and Pan et al. [34] combine optimal control and end-to-end learning, by training neural network policies to mimic the output control values based on the raw images. We explicitly compare to such an approach in this work.

Fig. 2: Overall Proposed Framework: Our approach to navigation consists of a learning-based perception module and a dynamics model-based planning module. The perception module consists of a CNN that outputs a desired next state or waypoint. This waypoint is used by the model-based planning module to design a controller to smoothly regulate the system to the waypoint.

Iii Problem Setup

In this work, we study the problem of autonomous navigation of a ground vehicle in previously unknown indoor environments. We assume that odometry is perfect (i.e., the exact vehicle state is available), and that the environment is static. Dealing with imperfect odometry and dynamic environments are problems in their own right, and we defer them to future work.

We model our ground vehicle as a three-dimensional non-linear Dubins car with the following dynamics:


where is the state of vehicle, is the position, is the heading, is the speed, and is the turn rate at time . The input (control) to the system is . The linear and angular speeds and are bounded within and respectively. We use a discretized version of the dynamics in Eqn. (1) for all planning purposes. The robot is equipped with a monocular RGB camera mounted at a fixed height and oriented at a fixed pitch.

The goal of this paper is to learn control policies for this robot for navigation in novel environments that it has not encountered before. More specifically, we study goal-oriented navigation tasks: the robot needs to go to a target position, , specified in the robot’s coordinate frame (e.g., m forward, m left), without colliding with any obstacles. These tasks are to be performed in novel environments whose map or topology is not available to the robot. In particular, at a given time step , the robot with state receives as input the RGB image of the environment , , and the target position expressed in the current coordinate frame of the robot. The objective is to obtain a control policy that uses these inputs to guide the robot to the target as quickly as possible.

1: Goal location 
2:for  to  do
3:      Measured robot pose
4:      Measured linear, angular speed
5:     Every steps do Replan every steps      
6:          Goal location in the robot’s coordinate frame
7:          Predict next waypoint
8:          Plan spline-based smooth trajectory
9:          Tracking controller      
10:      Apply control
11:end for
Algorithm 1 Model-based Navigation via Learned Waypoint Prediction

Iv Model-based Learning for Navigation

We use a waypoint-based approach to navigation (WayPtNav). WayPtNav makes use of two submodules: perception and planning. The perception module uses current RGB observation with a CNN to predict the desired next position that the robot should navigate to. The planning module then uses the dynamics model of the robot to plan a smooth (minimum-jerk) trajectory to convey the robot to the predicted waypoint. Finally, we use a dynamics model-based linear feedback controller to track this trajectory. The WayPtNav framework is demonstrated in Figure 2 and summarized in Algorithm 1.

Iv-a Perception Module

The goal of the perception module is to analyze the image and provide high-level guidance to the planning module. We implement the perception module using a CNN that takes as input a pixel RGB image, , captured from the onboard camera, the target position, , specified in the vehicle’s current coordinate frame, and vehicle’s current linear and angular speed, , and outputs a waypoint , where is the desired next state of the robot (Line 6 in Algorithm 1). The planning module uses this desired next state to design a trajectory as well as a feedback controller to convey the system to the desired state over a time horizon of (also referred to as the control horizon). The entire process is repeated at time . The network is trained by imitating automatically computed expert sequence of waypoints, as explained in Section V.

Intuitively, the network can use semantic cues (such as presence of scene surfaces like floor and walls, common furniture objects like tables and chairs, exit points like doors) alongside the learned priors about the shapes of such common objects to generate an estimate of where it should roughly go next, even when the desired position is not visible in the current RGB image. This allows the robot to do a more guided and efficient exploration in the novel environments based on its prior experience. Qualitative results in Section VI show how our perception module is able to make predictions about where to go next even though the waypoint it outputs has not been observed explicitly.

Iv-B Planning Module

Given a waypoint , and the current linear and angular speed , the planning module uses the system dynamics in Eqn. (1) to design a smooth trajectory from the current vehicle state to the waypoint. In this work, we represent the and trajectories using a third-order spline. Since the heading of the vehicle can be determined from the and trajectories (i.e., Dubins car model in Eqn. (1) is differentially flat with outputs and ), the generated spline trajectories are also dynamically feasible for Eqn. (1). Thus a spline-based planner provides the desired state and control trajectories, , that robot should follow for the time horizon to reach the desired waypoint (Line 7).

Since the generated splines are third-order, the generated position, speed and acceleration trajectories are smooth and differentiable. This is often an important consideration for real robots, where jerky trajectories might lead to compounding sensor errors, poor tracking, or even hardware damage [15]. The parameters of these trajectories can be determined by using the boundary conditions, i.e., the current state, the final state and the current speed. This corresponds to solving a set of linear equations and can be done efficiently onboard. For more details on spline trajectory generation for Dubins car, we refer the interested readers to [44]. Note that while we use splines in this work, other model-based planning schemes, such as MPC, can also be used for trajectory planning.

To track the generated trajectory , we design a LQR [6] based linear feedback controller, (Line 8). Here and represent the feed-forward and feedback terms respectively. The LQR controller is obtained using the dynamics in Eqn. (1), linearized around the trajectory generated by the planning module. LQR is a widely used feedback controller in robotic systems, and has been shown to reject external disturbances and make planning robust to mismatches between the dynamics model and the actual system [42, 29]. The presence of this feedback controller in the planning module also allows us to deploy the proposed framework directly on a real robot, even when it is trained only in simulation (provided that the perception module is robust to the visual differences between simulated and real-world images) as we demonstrate in Section VII, even though the model in Eqn. (1) may not capture the true physics of the robot.

The control commands generated by the LQR controller are executed on the system over a time horizon of seconds (Line 9), and then a new image is obtained. This image is used by the perception module to predict a new waypoint, that is subsequently used by the planner to generate a new plan. This process is repeated until the robot reaches the goal position.

V Training Procedure

We train the perception module with supervised learning, using automatically generated globally optimal trajectories as a source of supervision. To generate these trajectories, we assume that the location of all obstacles is known during training time. However, during test time no such assumption is made and the robot relies only on the image data and other on-board sensors.

Success (%)
Average Time to Reach Target (s)
Average Acceleration ()
Average Jerk ()
Fig. 3: Quantitative Comparisons in Simulation:

We compare the proposed model-based planning approach (WayPtNav) to the end-to-end (E2E) learning approach across various metrics for the test navigation tasks. (a) Success rate of each approach in reaching the target (higher is better). WayPtNav is 26% more successful compares to E2E learning (b) Average time and standard deviation to reach the target (lower is better). WayPtNav on average takes 2/3rd as much time as the E2E learning approach to reach the target. (c, d) Average acceleration and jerk along the robot trajectory (lower is better). Trajectories produced by the WayPtNav are considerably less jerky than ones produced by E2E learning approach. This is a significant advantage on real robots.

V-a Expert policy

Even when the positions and shapes of all obstacles are known, generating a trajectory from a starting state to the goal state that does not collide with any obstacles is still a challenging optimal control problem, especially when the obstacles are not convex, which is often the case with the cluttered indoor environments. To generate the expert trajectory, we define a cost function that trades-off the distance from the target position and the nearest obstacle, and optimize for the waypoint such that the resultant spline trajectory minimizes this cost function. More specifically, given a map of obstacles and the goal position, we compute the signed distance functions from the obstacles and the goal. For computing the signed distance from the goal, we mask the obstacles to get the shortest collision-free path to the goal. In the cost function, we penalize the squared signed distance from the goal. If the vehicle trajectory is within a certain distance of an obstacle, there is an additional cubic penalty proportional to the inverse signed distance from the obstacles. This is to make sure that the vehicle trajectory does not go too close to the obstacle (we empirically found that it is significantly harder to learn to accurately predict the waypoints for which the vehicle trajectory goes too close to the obstacles, as the prediction error margin for the neutral network is much smaller in such a case). The overall cost function is given by the sum of these penalties across the entire trajectory.

Given this cost function, we optimize for the waypoints using a random sampling-based approach. For random sampling, we sample the waypoints within the ground-projected field-of-view of the vehicle (ground projected assuming no obstacles). We only consider the waypoints that are reachable from the current state of the vehicle while respecting the vehicle’s control bounds. The obtained optimal waypoints, the corresponding RGB images, and the egocentric target positions are used as the training data for the CNN in our perception module. Note, even though we use a random sampling-based approach to obtain the optimal waypoint, other gradient-based optimization schemes can also be used.

V-B Data Collection

We use the Stanford large-scale 3D Indoor Spaces (S3DIS) dataset to generate our training and test data [2]. The dataset consists of 3D scans (in the form of textured meshes) collected in 6 large-scale indoor areas that originate from 3 different office buildings. Scans from 2 buildings were used for training and the robot was tested on scans from the 3rd building. We pre-processed the meshes to compute the location of all obstacles, and the space traversable by the robot. This traversable space was used to sample arbitrary start and target positions, and these samples along with the location of obstacles were used to generate the expert supervision as outlined in Section V-A.

Vi Simulation Experiments

In this section, we present our simulation results and compare WayPtNav with a state-of-the-art E2E learning baseline. For E2E learning, the CNN is trained to directly output the control commands over the control horizon . To generate the supervision for E2E learning, we generate the smooth control commands corresponding to the expert trajectory using a LQR controller. We use a control horizon of 1.5s (corresponds to a frame rate of roughly 0.7 Hz) in all our simulations.

Fig. 4: Trajectory Visualization: We visualize the trajectories produced by the model-based planning approach (top row) and the end-to-end (E2E) learning approach (bottom row) for 5 sample test tasks. The E2E learning approach struggles to navigate around the tight corners or narrow hallways, whereas WayPtNav is able to produce a smooth, collision-free trajectory to reach the target position.
Fig. 5: Control Profiles: We visualize the linear speed control profile produced by our proposed approach (WayPtNav) and the end-to-end (E2E) learning approach for a sample test task. Both approaches are able to reach the target position; however, WayPtNav takes only 10s to reach the target whereas the E2E learning approach takes about 17s. Moreover, the control profile produced by the E2E learning approach is significantly more jerky than WayPtNav, which is often concerning for real robots as it can lead to significant errors in sensors and can cause hardware damage. Moreover, jerky control profiles are power inefficient and can reduce the run time of a robot.

All our models are trained with a single GPU worker using TensorFlow 


. We use MSE loss on the waypoint prediction (respectively on the control command prediction) for training the CNN in our perception module (respectively for E2E learning). We use ADAM optimizer to optimize the loss function with a batch size of 64. We train both networks for 35K iterations with a constant learning rate of 0.0001 and use a weight decay of

to regularize the network.

We use ResNet-50 [18]

, pre-trained for ImageNet Classification, to encode the image input. We remove the top layer, and use a downsampling convolution layer, followed by 5 fully connected layers with 128 neurons each to regress to the optimal waypoint (or control commands for E2E learning). The image features obtained at the last convolution layer are concatenated with the egocentric target position and the current linear and angular speed before passing them to the fully connected layers (see Figure


). During training, the ResNet layers are finetuned along with the fully connected layers to learn the features that are relevant to the navigation tasks. We use standard techniques used to avoid overfitting including dropout following each fully connected layer except the last (with dropout probability

) [40], and data augmentation (such as randomly distorting brightness, contrast, adding blur, perspective distortion) at training time [39].

We train our agent on 125K data points generated by our expert policy on 2 buildings of the S3DIS dataset. We use an , training, validation split and use cross validation to choose all our hyper parameters for supervised learning. We choose the third unseen building for testing and keep a held out set of 80 test navigational goals on which we evaluate our approach and baseline.

The test problems are sampled to include a variety of tasks, such as going around obstacles, going out of the room, or into another hallway. The same sampling procedure is used during training and testing. We measure performance after running the learned network for a maximum duration of 80s. We report multiple metrics, the success rate (the agent succeeds if it reaches within 30cm of the target position without colliding with any obstacle on the way), the average time taken to reach the target position for the successful episodes, and the average acceleration and jerk along the robot trajectory.

Success Rate: The success rates for the test tasks for WayPtNav as well as for E2E learning are reported in Figure 3. WayPtNav is able to outperform the E2E learning approach by roughly 25%, indicating that the proposed approach is more reliable at navigating the robot to its target. We also visualize the robot trajectories for the two approaches for some sample test problems in Figure 4.

Top view (for vis. only)
Image at Position 1
Image at Position 2
Image at Position 3
Image at Position 4
Fig. 6: We visualize the trajectory as well as the observed RGB images (at the marked points) for WayPtNav for a sample test task: the robot needs to go from the current room into another room. Our method is able to leverage semantic cues to exit the room through the doorway to get to the goal location. Such semantics cues enable the robot to do a more efficient and guided exploration in novel environments.
Top view (for vis. only)
Image at Position 1
Fig. 7: We visualize the trajectory as well as the observed RGB images (at the marked point) for our method (WayPtNav) for a sample test task. In this task, the robot needs to go from one hallway (Hallway 1) into another (Hallway 2). Even though the entire Hallway 2 is not explicitly visible, the robot is able to reason that a hallway typically leads into another hallway, and uses this statistical regularity to efficiently reach its target position.

Note that the top-view trajectories are plotted only for visualization purposes, the networks are trained only on first-person RGB images. As evident from the trajectories, WayPtNav enables the robot to navigate through narrow hallways, to make tight turns around the obstacles and corners, whereas E2E learning is not able to do so consistently. Intuitively, in our approach, the CNN only needs to learn “where to go next” based on the current observed image, since the control profile to go to the next desired state is generated by a model-based planner. On the other hand, in E2E learning, the CNN additionally needs to learn “how to go” to that next desired position. The latter can be a harder learning problem from a control perspective, especially when the robot needs to make aggressive maneuvers such as making tight turns around obstacles. We also test the two approaches on a smaller control horizon of and thus allowing for a faster visual feedback, but we do not see any significant changes in the success rates.

Control Profile: In addition to a higher success rate, WayPtNav also guides the robot to the target more quickly than the E2E learning approach. We report the average time taken to reach the target position along with the standard deviation for the two approaches in 3. For a fair comparison, we report these numbers only over the test tasks at which the E2E learning approach succeeds. WayPtNav conveys the robot 33% faster to its target as compared to the E2E learning approach.

We also plot the linear velocity control profile produced by the two approaches for a sample goal in Figure 5.

Even though smooth control profiles generated by the expert policy were used as supervision for E2E learning, the learned control profile is discontinuous and jerky. On the other hand, since the CNN only acts as a high-level planner in WayPtNav, the control profile is still generated by a model-based planner and is smooth. We report the average acceleration and jerk over the trajectories for the two approaches in Figure 3 and 3 respectively. WayPtNav has significantly lower average acceleration and jerk. This has a significant implication for actual robots since the consumed power is directly proportional to the average acceleration. As an example, for the same battery capacity, WayPtNav will allow the robot to travel roughly twice as much as compared to E2E learning. To facilitate the smoothness of the control profile, we also try adding a smoothing penalty in the loss function for E2E learning; however, we see a significant decline in the success rate as a result.

Semantics Learning: One of the main objectives of this work is to enable the perception module to learn the navigation cues and semantics that might be helpful in planning in novel environments. In this section, we visualize some of the test cases and demonstrate how the proposed approach is indeed able to learn such cues. In the first test case, the robot starts inside a conference room and is required to go to a target position that is in a different room. The problem setup is shown in Figure 6. The blue dot represents the robot’s start position and the black arrow represents the robot’s heading. The center of the green area represents the target position.

We mark some intermediate points on the robot trajectory in Figure 6 where it is required to predict a waypoint, and visualize the corresponding RGB images that it observed in Figures 6-6. The robot is initially facing some chairs (Fig. 6), and based on the observed image, it predicts a waypoint to rotate in place to explore the environment. Upon rotation, the robot sees part of the door (Fig. 6) and is able to learn that to go to the target position it needs to exit the current room through this door and predicts a waypoint to go towards the door (Fig. 6). Next, it produces a waypoint to go out of the door (Fig. 6), and eventually reaches the target position. We also show the full trajectory (the red plot) of the robot from the start position to the target position as well as mark the obstacles in context in Figure 6.

Experiment 1
Experiment 2
Experiment 3
Fig. 8: Hardware Experiment Visualizations: We show stylized top-views of the three experiment settings. The starting position of the robot is given by the blue dot and the target position is the center of the green area. The trajectory of the robot is shown in red. (a) In experiment 1, the robot needs to go through a narrow hallway followed by a cluttered area to reach the target position. The experiment setting requires a precise control of the robot to avoid the obstacles and the proposed approach is able to reach the target while maintaining a smooth trajectory at all times. We also mark some intermediate points at which we visualize the observed RGB images in Figure 9. (b) In experiment 2, the robot starts inside a room and based on its prior experience during training learns that to exit the room it needs to go through a door, and eventually reaches its target position in the hallway. (c) In experiment 3, we move the chair to block the robot’s path as it is traversing to its target. The robot is successfully able to react to the moving obstacle and replan a collision-free trajectory to the target. See project website for videos of the execution.
Third-person view.
RGB image at position 1.
RGB image at position 2.
RGB image at position 3.
RGB image at position 4.
Fig. 9: Hardware Experiment Visualizations: We visualize third-person view of the initial configuration of our experiment, as well as the RGB image observed by the robot to predict the waypoint at some of the intermediate points on its trajectory to the goal. These intermediate points are marked in Figure 8. Even though the environment is cluttered, the proposed approach is successfully able to convey the robot to the target position.

As a second example, we consider the test case in Fig. 7 where the robot needs to go from Hallway 1 to Hallway 2. We show an intermediate point in the robot trajectory (Marked 1 in Fig. 7) where the robot needs to turn right into Hallway 2 to reach the target. We also show the corresponding (stylized) field-of-view (projected on the ground) of the robot (the blue cone in Fig. 7). The corresponding RGB image that the robot observed is shown in Figure 7. Based on the observed image, the robot produces the cyan waypoint and the bold red desired trajectory. Even though the robot is not able to see its next desired position (it is blocked by the door in the field-of-view), it can see part of Hallway 2 and is able to reason that a hallway typically leads to another hallway. The robot uses this semantic prior to produce a waypoint leading into Hallway 2 and eventually reaches the target.

Vii Hardware Experiments

We next test WayPtNav on a hardware testbed. For our experiments, we use a TurtleBot 2. We use the network trained in simulation (as described in Section VI) and deploy it directly on the Turtlebot without any finetuning or additional training. The experiments demonstrate that our approach can handle dynamics mismatches between a real robot and the simplified model that is used for training in simulation. For state measurement, we use the on-board odometry sensors on the TurtleBot. Recall, we trained our networks with random perspective distortions applied to input images. This makes the perception module robust to differences in camera pitch and field-of-view between simulation and the real robot. We test our framework on three different navigation tasks:

1) Navigation Through Cluttered Environments: In this experiment, our goal is to test if WayPtNav can enable navigation through real-world cluttered environments. The target of the robot requires it to navigate through a very narrow hallway with bikes on one side, and then around furniture objects like chairs and sofas to reach the goal. A stylized top-view description of the task is shown in Fig. 8, while Fig. 9 shows the third-person view of the starting configuration of the robot, and the first-person RGB images observed by the robot at some of the intermediate trajectory points marked in Fig. 8. The robot starts very close to the bikes on its right and consequently predicts a waypoint to the left. It then predicts a series of waypoints to go straight through the narrow hallway, till it reaches the chairs and desks at the end of the hallway. The robot then navigates precisely between the chairs and sofas to reach its target position.

We also show the actual trajectory of the robot and the desired trajectory as predicted by our dynamics model in Eqn. (1) in Fig. 10.

Fig. 10: Robot’s actual trajectory as compared to the desired trajectory (as obtained via planning using the dynamics fromEqn. (1)). Corrections made by the feedback controller, enables tracking of the desired trajectory to an accuracy of 4cm on the actual robot.

Even though the dynamics model does not exactly capture the physics of a real robot (it does not include any mass and inertia effects for example), the feedback controller provided by the model-based planner compensates for these differences, and enables the robot to closely track the desired trajectory while ensuring trajectory smoothness. This is crucial for generalizing from simulation to a physical robot.

2) Leveraging Navigation Cues In Real-World Environments: The goal of this experiment is to demonstrate that WayPtNav can leverage real-world visual semantics to efficiently explore the environment and complete the navigation task. In this task, the robot starts inside a room facing a wall and the target position is outside the room (Figure 8).

The robot first rotates to explore the room, and realizes that the target is outside the room. It then makes its way to the door to exit the room, to eventually reach the desired target position. Our policies automatically learn to pick up such semantic cues (that the room can be exited through a doorway), even though they were never explicitly trained to look for them. This allows for efficient goal-reaching behavior in novel environments.

3) Navigation In Dynamic Environments: Though in this work, we only focus on static environments, our policy is reactive and can adapt to changes in the environment. We test this behavior by changing the environment as the robot executes its policy. The robot is tasked to go straight 6m (Figure 8). It must go around the brown chair. As it is executing the policy, we move the chair around to block the robot’s path (we show the moved chair location in green, purple and yellow and mark the corresponding time at which the chair was moved on the robot’s trajectory in the same color). We decrease the control horizon to 0.5s to allow for fast adaptation. The robot indeed reacts quickly to the moving chair and is able to re-plan a trajectory in light of the moved chair to reach its target position without any collisions.

Comparison with the End-to-End Learning Approach: We repeat Experiments 1 (Navigation Through Cluttered Environments) and 2 (Leveraging Navigation Cues in Real-World Environments) for 5 trials each using both the proposed approach and the E2E method. The obtained results are shown in Figure 11. WayPtNav succeeds in 9 of 10 trials, failing only once in Experiment 2. The E2E approach, on the other hand, fails all 5 trials of Experiment 2 and succeeds in 2 of 5 trials of Experiment 1. Closer analysis of the failure cases supports our simulation-based conclusion that the E2E method struggles to predict aggressive, yet nuanced control profiles, even more so on a real robot.

We also compare the time taken to reach the target, average acceleration, and average jerk on the 2 trials in which E2E method succeed. Consistent with our simulation-based results, WayPtNav produces trajectories which are approximately 25 percent more efficient at reaching the goal than the E2E method. We note that the Turtlebot platform has an internal, low-level PID controller which naturally enforces velocity smoothing. Regardless, WayPtNav still achieves average acceleration approximately one-third that of the E2E method and average jerk approximately one-fifth that of the E2E approach, thus resulting in much smoother and efficient trajectories compared to the E2E learning approach on a real robot.

Success (%)
Time to Reach Target (s)
Average Acceleration ()
Average Jerk ()
Fig. 11: Quantitative Comparisons for Hardware Experiments: We deploy both our proposed method (WayPtNav) and the end-to-end (E2E) method on a Turtlebot 2 hardware testbed in two navigational scenarios. We run both methods in both scenarios for 5 trials (total of 20 trials). WayPtNav succeeds on 9 / 10 trials while the E2E method only succeeds on 2 / 10. We compute the remaining metrics on the 2 trials on which both WayPtNav and E2E succeeded (one bar for each trial). Real world results are consistent with our findings in simulation: WayPtNav is more successful, efficient, and produces significantly smoother control profiles than the E2E approach.

Viii Conclusion and Future Work

We propose a model-based navigation approach that employs learning to leverage semantic cues for goal-driven navigation in novel indoor environments. Our proposed approach is better and more efficient at reaching goals, and results in smoother trajectories when compared to an End-to-End learning approach. Use of a model-based feedback controller, allows our proposed approach to successfully generalize from simulation to physical robots.

Several interesting future directions emerge from this work. Our proposed approach assumes perfect robot state estimate and employs a purely reactive policy. These assumptions and choices may not be optimal, specially for long range tasks. Incorporating spatial or visual memory to address these limitations would be fruitful future directions. Furthermore, a more detailed study of dynamic environments would be important and interesting.


This research is supported in part by the DARPA Assured Autonomy program under agreement number FA8750-18-C-0101, by NSF under the CPS Frontier project VeHICaL project (1545126), by NSF grants 1739816 and 1837132, by the UC-Philippine-California Advanced Research Institute under project IIID-2016-005, by SRC under the CONIX Center, and by Berkeley Deep Drive.