Visual Navigation Among Humans with Optimal Control as a Supervisor

03/20/2020 ∙ by Varun Tolani, et al. ∙ 5

Real world navigation requires robots to operate in unfamiliar, dynamic environments, sharing spaces with humans. Navigating around humans is especially difficult because it requires predicting their future motion, which can be quite challenging. We propose a novel framework for navigation around humans which combines learning-based perception with model-based optimal control. Specifically, we train a Convolutional Neural Network (CNN)-based perception module which maps the robot's visual inputs to a waypoint, or next desired state. This waypoint is then input into planning and control modules which convey the robot safely and efficiently to the goal. To train the CNN we contribute a photo-realistic bench-marking dataset for autonomous robot navigation in the presence of humans. The CNN is trained using supervised learning on images rendered from our photo-realistic dataset. The proposed framework learns to anticipate and react to peoples' motion based only on a monocular RGB image, without explicitly predicting future human motion. Our method generalizes well to unseen buildings and humans in both simulation and real world environments. Furthermore, our experiments demonstrate that combining model-based control and learning leads to better and more data-efficient navigational behaviors as compared to a purely learning based approach. Videos describing our approach and experiments are available on the project website.



There are no comments yet.


page 1

page 6

page 7

page 8

page 10

Code Repositories


Human Active Navigation Dataset

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous robot navigation has a potential to enable many critical robot applications, from service robots that deliver food and medicine, to logistics and search and rescue missions. In all these applications, it is imperative for robots to work safely among humans and be able to adjust their own motion plans based on observed human behavior.

For example, if a person is turning left, the robot should pass the human to the right to avoid cutting the person off. Or when a person is moving in the same direction as the robot, the robot should maintain a safe distance between itself and the person.

Specifically, the methods for autonomous robot navigation among people need to recognize humans, predict human behaviors, and react to them safely. Human recognition is difficult because people come in different shapes and sizes, and might even be partially occluded. Behavior prediction, on the other hand, is challenging because the human’s navigation goal (intent) is not known to the robot, and people have different temperaments and physical abilities which affect their motion (speed, paths etc.) [21]. The robot’s reactions to humans need to carefully balance reaching the robot’s destination, and avoiding humans. These aspects make navigating around humans particularly challenging, especially when the robot itself is operating in a new, a priori unknown environment. Classical solutions consists of a pipeline of specialized methods for perception, behavior prediction, planning, and control. While modular and easy to debug, this software stack is difficult to build, and often requires careful environment-specific tuning of each sub-module [23].

Fig. 1: (top) We consider the problem of autonomous visual navigation in a priori unknown, indoor environments with humans, using monocular RGB images (bottom right). To learn navigational behavior around humans, we create the HumANav dataset which allows for photorealistic rendering in simulated buildings environments with humans (bottom left). HumANav is designed to render visual cues relevant for navigation (i.e. if a human is moving quickly its legs will be spread apart). We demonstrate that a navigation framework trained on HumANav can successfully learn to navigate around humans and transfer the learned policies from simulation to reality.

Recently, end-to-end learning methods, which map sensor outputs directly to robot actions [6, 13, 8, 19], have shown promising results in reducing the stack complexity, increasing robustness to noise, and generalizing to new problem instances. These methods, however, tend to be sample inefficient and overspecialized to the system on which they were trained on [20].

Seeking to combine the benefits of classical and learning-based methods for navigation, the authors in [3] combine a learned perception module with a model based controller for navigation in a priori, unkown environments. This work leverages a learned Convolutional Neural Network (CNN), which predicts a waypoint, or the vehicle’s next desired state, and uses optimal control to actually reach the waypoint. This method has been shown to execute smooth, efficient trajectories in novel, static environments (both simulation and real).

In this work, we use the navigation architecture proposed in [3] but extend it to dynamic environments. Specifically, the robot is navigating in an a priori unknown space shared with a single human, whose trajectory is also unknown. However to train such a method requires training data, which can be obtained either from real robots or from simulation. While robot manipulation often trains from real data [13], training navigational agents on real robots poses safety, privacy, and logistical challenges. At the same time, training in simulation poses a challenge as it requires simulation of visually realistic humans and their motion [22, 17]. To the best of our knowledge, no such dataset exists. For this reason, training navigational agents often rely either on synthetic humans devoid of low-level visual cues such as color, texture, and pose [8], or training on photorealistic static environments [6, 11, 3] and evaluating in dynamic environments.

To address challenges of training visual navigation agents around humans, in this paper we introduce the Human Active Navigation Dataset (HumANav), an active dataset for navigation around humans in photo-realistic environments. We use it to generate supervision for optimal waypoints for the method described in [3]. The dataset consists of scans of 6000 synthetic but realistic humans from the SURREAL dataset [22] placed in office buildings from Stanford Large Scale 3D Indoor Spaces Dataset [1], though in principal scans from any indoor office environment can be used. HumANav allows for user manipulation of human agents within the building and provides photorealistic renderings (RGB, Depth, Surface Normals, etc.) via a standard perspective projection camera. Critically, HumANav also ensures important visual cues associated with human movement are present in images (i.e. if a human is walking quickly its legs will be farther apart than if it is moving slowly), facilitating reasoning about human motion. To create training data, we assume full knowledge of the environment and human trajectories, and use Model Predictive Control (MPC) to produce optimal waypoints and their associated trajectories. We then render the RGB image seen at each intermediate vehicle state along the trajectory and save it along with the optimal waypoint for training the CNN via supervised learning.

Our modular architecture trained on HumANav generalizes to previously unseen environments, humans, and human behaviors.

Through simulations and experiments on a mobile robot, we demonstrate that our framework is better and results in smoother

trajectories, as compared to end-to-end learning. At the same time, it does not require explicit state estimation and trajectory prediction of the human, unlike a classical navigation pipeline, leading to a more

reliable performance, especially when there is inevitable noise in human state estimation. Additionally, we show that our agent can learn to reason about the dynamic nature of humans, taking into account their future motion while planning its own path. The on-robot experiments demonstrate that the proposed framework transfers directly from simulation to reality.

Overall, this paper is of interest to the broader visual navigation, planning, and control communities a) to use the HumANav Dataset for benchmarking for visual navigation around humans, b) develop visual navigation algorithms, both classical and machine-learning based, that generalize to new environments and people while closing the sim-to-real gap, and c) as an example of autonomous data collection and benchmark creation in human occupied environments that avoids privacy concerns.

Ii Related Work

An extensive body of research studies autonomous navigation in static and dynamic environments. We cannot possibly hope to summarize all these works here, but we attempt to discuss the most closely related approaches.

Navigation among humans:

Classical robotics factorizes the problem of navigation among humans into sub-problems of detection and tracking [4], human motion prediction [21], and planning [14]. However, reliable state estimation of the human might be challenging, especially when the robot is using narrow field-of-view sensors such as a monocular RGB camera. Moreover, human motion prediction itself can be quite challenging [10, 2] and is an active area of research [21]. Our method uses learning to predict goal-driven, collision-avoidant waypoints to navigate around humans only using a monocular RGB image, without explicitly estimating human state or predicting human motion.

Learning-based approaches have also been explored for navigation among humans. Authors in [15]

use reinforcement learning to produce socially-compliant motion among humans from a dataset that combines eight different sources of real pedestrian motion. However, the method requires human trajectories, and relies on detection and tracking algorithms to locate the humans.

Our method, in contrast, uses an RGB camera and does not explicitly require human state estimation. Methods based on depth sensors [18, 8, 5, 6] do not require high visual fidelity, but require expensive wide-field of view LiDAR sensors.

Social visual navigation: [16] proposes a dataset on multi-modal social visual navigation with manual annotations that serves as a perception benchmark for person detection and tracking. The dataset is collected in real environments using real humans, manual annotation, and non-goal oriented navigation. In contrast, our dataset (HumANav), aims to serve as a benchmark to goal-oriented navigation in the presence of humans. Since the data is fully collected in simulation using synthetic humans, our dataset and method avoid privacy concerns that might arise from using real human subjects, while still learning social navigation that transfers to real robots. Another benchmark on navigation [8], similarly to us, uses simulation for training, but is unsuitable for visual navigation tests because humans in the scene have no visual texture and features, which are known to be important for closing sim-to-real gap reliably [22, 17].

Several other methods [19, 9, 6, 7] approach navigation in dynamic environments by using classical planning in static environments as the higher level planner, along with reinforcement learning for adaptive local planning and path tracking. This approach limits their ability to reason about the dynamic nature of human and planning optimal paths around it. Instead. We learn the waypoint placement for the high-level planner to be safe with respect to the human motion, and use optimal control to reach the waypoint.

Iii Problem Setup

We study the problem of autonomous robot navigation in an a priori unknown indoor space shared with a single human, whose trajectory is unknown to the robot. We assume that the robot odometry is perfect (i.e., the vehicle state can be estimated reliably) and that the human expects the robot to avoid her. We defer dealing with imperfect odometry and the effect of the robot trajectory on the human trajectory to future work.

We model the vehicle as a three-dimensional non-linear Dubins car system with dynamics:


where is the state of vehicle, is the position, is the heading, is the speed, and is the turn rate at time . The inputs (control) to the vehicle are that are bounded within and respectively. The robot observes the environment through a forward-facing, monocular RGB camera mounted at a fixed height and oriented at a fixed pitch.

We use a discretized version of the dynamics in Eqn. (1) for path planning.

The goal of this paper is to learn control policies for the robot to go to a target position, , specified in a global coordinate frame, while avoiding collision with any obstacles or humans. These tasks are to be performed in novel, dynamic environments with no prior information on the topology of the environment or the time-varying trajectory of the human. At time and state , the robot receives an RGB image of the environment , , as well as , the target position expressed in the current coordinate frame of the robot.

The objective is to obtain a control policy that uses these inputs to guide the robot to the target as quickly and as safely as possible.

Iv Model-based Learning for Navigation

We develop a learning-based approach to indoor ground robot navigation in a priori unknown environments in the presence of a human.

Iv-a Learning-Based WayPoint approach to Navigation (LB-WayPtNav)

We use the LB-WayPtNav framework proposed in [3] for navigation in static environments, and train it on the data collected in dynamic environments. LB-WayPtNav combines a learning-based perception module with a dynamics model-based planning and control module for navigation in novel environments (Fig. 2). We give a brief overview of the perception and planning/control modules of LB-WayPtNav here. Further details can be found in the [3].

Fig. 2: The LB-WayPtNav framework to navigate in a priori unknown static environments. The figure is taken from [3].

Iv-A1 Perception Module

The perception module of LB-WayPtNav leverages a CNN, which inputs , the robot’s current 224 224 RGB image, , the desired goal position in the robot’s coordinate frame, and , the robot’s current linear and angular speed. The CNN is trained to output the robot’s next desired state, or waypoint, .

Iv-A2 Planning and Control Module

Given the desired waypoint , a spline-based trajectory is used to plan a smooth, efficient, and dynamically feasible (with respect to the dynamics in (1)) trajectory from the robot’s current state to . A tracking controller is obtained by solving an LQR problem around the linearized spline trajectory. The controller is executed on the robot for a control horizon of seconds, at which point the robot receives a new image of the environment and repeats the process.

Iv-B Training Details

LB-WayPtNav is trained on data obtained from photorealistic static environments, which is not sufficient to navigate among humans. A key contribution of our work is to provide a method for generating training data for navigation in indoor environments in the presence of humans whose trajectories are not known beforehand.

To generate the training data, we first develop a photorealistic simulator for human-centric indoor environments (Section IV-C). Given a human trajectory, this simulator can generate visuals of the scene from any arbitrary viewpoints in the environment. Second, given the precise map of the indoor environment in simulation, we use Model Predictive Control (MPC) to generate realistic human trajectories and to compute optimal waypoints to plan around static obstacles and humans. These optimal waypoints along with the rendered images obtained from the simulator are used to train the CNN. Note that neither a map of the environment nor the trajectory of the human are available during test time– the robot only relies on the monocular RGB image obtained from the onboard camera. We now describe the human trajectory and optimal waypoint generation process.

To generate human trajectories, we model the human as a goal-driven dynamical system with state and dynamics given by (1). We additionally make the simplifying assumptions that at any given time, the human is either stationary (), moving straight (), or turning (), and that the human follows a piecewise constant velocity trajectory. Even though simplistic, these assumptions are often used in human-robot interaction literature to obtain a reasonable approximation of human trajectories [21].

The human and robot are both modeled as cylindrical agents. To generate the training data, we first sample the start positions (, ) and the goal positions (, ) for the robot and the human respectively, as well as a unique identity for the human (body shape, texture, gender). We then use receding horizon MPC to plan paths for both the robot and human for a horizon of . In particular, at time , the human solves for the optimal trajectory that minimizes the following cost function


Here represents the minimum collision-free distance between and the human goal position (also known as the FMM distance). represents the signed distance to the nearest static obstacle. The obstacle cost is penalized only when the human is closer than to the nearest obstacle. The coefficients are chosen to weight the different costs with respect to each other.

Given the optimal human trajectory for time horizon , , the robot optimizes for the waypoint, , such that the corresponding trajectory to that waypoint minimizes the following cost function:


Similar to the human’s cost function, represents the collision-free distance to robot’s goal, , represents the signed distance to the nearest obstacle, and represents the signed distance to the human at time . The robot’s distance to the human is only penalized when the robot and human are closer than to each other. The coefficients are chosen to weight the different costs with respect to each other.

The planning is done in a receding horizon fashion. In particular, both the robot and the human plan for a horizon of . The planned paths are executed by each agent for a control horizon of , and then paths are replanned. The process of planning, and executing trajectories is repeated until the robot reaches its goal position. We then render the image seen at each of the robot’s intermediate states and save the corresponding pair for training. We refer to LB-WayPtNav trained on this data as LB-WayPtNav-DH (Dynamic Human).

Iv-B1 Data Sampling Procedure

We found that training on data with rich interaction between the robot and both static and dynamic obstacles was crucial to success in test scenarios, especially on our hardware setup. Data with "rich interaction" include episodes where the robot must navigate around chairs, through doorways, behind a slowly moving human, cut across a humans path, etc.. To this end, we designed a data sampling procedure to stimulate such interaction.

  • Firstly, a start and goal position for the robot, , are sampled in the static environment. To force interaction between the two agents, we choose such that the human’s initial state lies approximately along the robot’s optimal path to its goal position in the absence of the human.

  • At training time, the robot has complete knowledge of the topology of the environment and the time-varying trajectory and footprint of the human. However, at test time, we seek to reproduce navigational behavior in new environments based solely on first person RGB imagery. To bridge the information gap between training and test time and help facilitate learning, we found restricting our expert to plan using information directly available in the current RGB image to be extremely effective. Specifically, is non-zero only when the human is visible in the robots current RGB image. This facilitates downstream learning as it ensures the human is visible when information about the human is used for planning.

Iv-C The Human Active Navigation Dataset (HumANav)

Data generation for the training procedure described in section IV-B is challenging as it requires an environment whose map is known a priori, a human whose geometry and time varying trajectory are known a priori, and capabilities for rendering realistic visuals of both the environment and the human from any viewpoint. To the best of our knowledge, no simulated dataset exists with all these capabilities, so we created a new dataset, the Human Active Navigation Dataset (HumANav).

We use MPC for generating human and robot trajectories as described in Sec. IV-B. For rendering purposes, we use the Swiftshader rendering engine, a CPU based rendering engine for photorealistic visuals (rgb, disparity, surface normal, etc.) from textured meshes used in [12]. We use mesh scans of office buildings from the Stanford Large Scale 3d Indoor Spaces Dataset (SD3DIS) [1], however the rendering engine is independent of the meshes used. In principle, textured meshes from scans of any office buildings can be used. For human meshes we turn to the SURREAL Dataset [22] which renders images of synthetic humans in a variety of poses, genders, body shapes, and lighting conditions. Though the meshes themselves are synthetic, the human poses in the SURREAL dataset come from real human motion capture data and contain a variety of actions including running, jumping, dancing, acrobatics, and walking. We focus on the subset of poses in which the human is walking.

The HumANav Dataset, shown in Figure 3 is an active dataset incorporating SURREAL and the Swiftshader engine using SD3DIS building data. The dataset contains 6000 walking, textured human meshes organized by body shape, gender, and velocity moving in scans of 5 Stanford office buildings. Our dataset allows a user to load a specified human into a desired building, at a specified position, orientation, speed, and angular speed. The user can manipulate human position, orientation, speed, and angular speed in the building, and render photo-realistic visuals (RGB, disparity, surface normals, etc.) of the human and building from arbitrary viewpoints using a standard perspective projection camera. To train LB-WayPtNav-DH we generate 180 thousand samples, and train on 125 thousand of them.

Fig. 3: HumANav incorporates SURREAL and SD3DIS building datasets for photorealistic rendering of the indoor environments with humans. It can generate humans with different appearances, as well as different visual modalities of the scene, such as RGB, disparity, and surface normal.

Crucially, HumANav preserves visual cues of humans that are relevant to navigation. When an image is rendered, a mesh with appropriate visual cues for that human’s walking speed is loaded into the static building. This ensures that visual cues for planning are present in images used for training (i.e. if the human is walking quickly its legs will be far apart in the image).

V Simulation Experiments

In this work, we investigate the following two key questions in indoor office environments with humans. We investigate both claims in simulation and hardware experiments

  • Can LB-WayPtNav-DH effectively plan goal-driven trajectories in novel environments while reasoning about the dynamic nature of humans?

  • What are the merits of combining model-based control with learning for perceptual understanding of humans, compared to fully learning-based methods and purely geometry based mapping and planning approaches?

Simulation Setup: Our simulation experiments are conducted using our active dataset described in IV-C. Scans from 3 buildings and 4800 human identities are used to generate training data. 154 test episodes in a 4th held out building and held out human identities are used to evaluate all methods; we sample training and test scenarios to include rich interaction with both static and dynamic obstacles such as planning around sudden turns by the human, following a human down a hallway, navigating around tight corners, etc. using the sampling procedure outlined in IV-B1.

Implementation Details: Following LB-WayPtNav, we use a pretrained ResNet-50 to initialize the CNN-based perception module and finetune it using a MSE loss and ADAM optimizer with learning rate and weight decay coefficient of on 125k points from our own dataset.

Baselines: We compare LB-WayPtNav-DH with seven baselines:

  • LB-WayPtNav [3]: trained on environments with no humans.

  • LB-WayPtNav-DH-FOV: LB-WayPtNav-DH trained to predict waypoints which always avoid the human, regardless of whether the human is visible in the robot’s current RGB image.

  • LB-WayPtNav-DH-Gray: LB-WayPtNav-DH trained on photos of humans that are colored in gray rather than with realistic clothing, skin color, and hair (see Figure 4).

  • E2E learning: CNN trained on the same data as LB-WayPtNav-DH, but instead of a waypoint directly regress to control commands corresponding to the optimal trajectory of the robot.

  • Mapping-SH (Static Human): a mapping method that computes the free space (treating human as any other static obstacle) in the current scene and use it for planning. Mapping-SH uses the known parameters of the robot’s camera to project its current depth image to an occupancy grid, which is then used for model-based planning.

  • Mapping-WC (Worst Case Human): same as Mapping-SH, but if the human is visible in the current frame, Maping-WC plans a path around all possible human behaviors assuming that the human’s current state, , is perfectly known and that it moves at any speed in for the entire planning horizon. Inferring directly from images, is itself a challenge, however, in our simulation results we assume that Mapping-WC has access to ground truth human state information when the human is visible in its current image.

  • Mapping-WC-Noisy

    : same as Mapping-WC, however to simulate noisy state estimation of the human we add zero-centered, uniformly distributed noise to


Fig. 4: (Left) We train LB-WayPtNav-DH on images from HumANav with realistic textures (clothing, hair, skin color, facial features). We also train a baseline, LB-WayPtNav-DH-Gray on the same data, but render human figures with gray textures (Right).

We use a control horizon of in all the simulations, allowing for a fast replanning around humans.

Metrics: We compare the success rate across all methods. An episode is successful if the robot reaches within meters of its goal without colliding. We further compare LB-WayPtNav-DH and other methods on episode specific metrics computed over the subset of goals where all methods succeed; we report the average time to reach the goal, average robot jerk, and acceleration along the successful trajectories (lower is better).

Agent Input Success (%)
Expert Full map 100
Learning Based Methods
LB-WayPtNav-DH (ours) RGB 80.52
LB-WayPtNav RGB 67.53
LB-WayPtNav-DH-FOV RGB 68.18
LB-WayPtNav-DH-Gray RGB 74.03
E2E RGB 52.60
Mapping Methods (memoryless)
Mapping-SH Depth 76.63
Mapping-WC Depth 94.81
Mapping-WC-Noisy Depth 87.01
TABLE I: Summary of success rate of LB-WayPtNav-DH (ours) and the baselines in simulation. Best results for learning-based methods and mapping-based methods shown in bold.

V-a Results

Agent Time Taken (s) Acceleration () Jerk ()
LB-WayPtNav-DH (ours) 12.94 ±3.06 .09 ±.02 .64 ±.13
LB-WayPtNav 14.25 ±2.80 .10 ±.02 .71 ±.14
LB-WayPtNav-DH-FOV 13.57 ±3.52 .09 ±.02 .66 ±.13
LB-WayPtNav-DH-Gray 12.96 ±2.18 .09 ±.02 .66 ±.12
TABLE II: Comparison between LB-WayPtNav-DH (ours) and the other LB-WayPtNav methods on the 67 episodes where all methods succeed. Note: lower is better.

Comparison with LB-WayPtNav: LB-WayPtNav-DH reaches the goal on average more than LB-WayPtNav (Table I) and on average faster than LB-WayPtNav (Table II). In Figure 5 we analyze a representative test scenario where LB-WayPtNav-DH plans a safe path, taking into account the human’s direction of motion, while LB-WayPtNav fails to do so and collides. Ultimately, LB-WayPtNav tends to fail in scenarios where human directionality plays a pivotal role in planning a collision-free path.

Fig. 5: (Left) We visualize a topview of the trajectories taken by LB-WayPtNav-DH (red- solid line) and LB-WayPtNav (light blue- dashed line) on one representative navigational goal which requires nuanced reasoning about the directionality of the human. Both agents start at the dark-blue circle and their goal is to reach the green circle while avoiding static obstacles (dark grey) and a human in the environment (magenta). Both LB-WayPtNav-DH and LB-WayPtNav behave similarly at the beginning of their trajectories, but diverge upon seeing the human (red circle and light blue circle respectively on the top view plot). The RGB images the agents see of the human are shown on the right. LB-WayPtNav plans a path to the right of the human (in its direction of motion), ultimately leading to collision. LB-WayPtNav-DH plans a trajectory (transparent red) to the left the of the dynamic agent, accounting for the human’s future motion, and ultimately leading to its success.

Comparison with LB-WayPtNav-DH-FOV: LB-WayPtNav-DH reaches the goal on average more than LB-WayPtNav-DH-FOV and on average faster than LB-WayPtNav-DH-FOV. This indicates that restricting our expert to choose waypoints only considering information within its current field of view, as described in IV-B1, facilitates downstream learning and ultimately performance for LB-WayPtNav-DH.

Comparison with LB-WayPtNav-DH-Gray: LB-WayPtNav-DH reaches the goal on average more than LB-WayPtNav-DH-Gray (Table I). Both methods take approximately the same time to reach the goal and result in trajectories of comparable smoothness (Table II) indicating that training LB-WayPtNav-DH with photorealistic textures (clothing, skin color, hair color, etc.), gives a mild boost in performance (when tested on similar photorealistic humans from HumANav) over training on gray colored human figures.

Comparison with End-to-End learning: LB-WayPtNav-DH is approximately more successful in reaching the goal than the End-to-End learning method (Table I). LB-WayPtNav-DH also reaches its goal approximately faster than the End-to-End method. The average acceleration magnitude is of the average acceleration of the End-to-End method and the average jerk magnitude is of the average jerk of the End-to-End method (Table III). On further analysis, we notice that E2E learning particularly struggles in the scenarios where it must predict nuanced trajectories which navigate around both static obstacles and dynamic obstacles. These results are consistent with results in static environments [3] where the use of model-based control significantly improves the navigational performance.

Agent Time Taken (s) Acceleration () Jerk ()
LB-WayPtNav-DH (ours) 13.24 ±3.34 .092 ±.02 .65 ±.15
E2E 15.59 ±5.36 .15 ±.02 4.10 ±1.14
TABLE III: Comparison between LB-WayPtNav-DH (ours) and the End to End learning method on 76 episodes where both methods succeed. Note: lower is better.

Comparison with Mapping-SH: LB-WayPtNav-DH reaches the goal on average more than Mapping-SH (Table I). When investigating the failure modes of Mapping-SH, we found the primary failure mode to be collision with a human while they are moving (dynamic). Mapping-SH rarely collides with static obstacles or humans while they are standing still (static), as it is designed to avoid them and in simulation can do so as it has access to perfect depth information. On the other hand, the failure modes of LB-WayPtNav-DH include collision with both static and dynamic humans, as the trained CNN can make erroneous predictions in both cases. Despite these collisions with static humans (which Mapping-SH does not suffer from), LB-WayPtNav-DH still reaches the goal more than Mapping-SH, indicating that training on dynamic humans is helpful for reasoning about the dynamic nature of humans in a navigational setting.

Mapping-SH primarily fails on episodes where inferring the dynamic nature of the human is crucial to success. On these 24 "dynamic" episodes Mapping-SH achieves a success rate. LB-WayPtNav-DH, however, succeeds on of these episodes indicating that is has learned to reason about the dynamic nature of the human. Furthermore, as shown in [3] the performance of Mapping-SH tends to degrade substantially on a real hardware platform due to inaccuracies in depth estimation. On the other hand, LB-WayPtNav-DH is trained to be robust to sensor noise and exhibits similar error profiles on real and synthetic imagery.

On the goals where both methods succeed, Mapping-SH is approximately faster than LB-WayPtNav-DH. This is unsurprising as these are goals where collision can be avoided by simply treating the human as a static obstacle. In these cases Mapping-SH excels, as it has access to perfect ground-truth depth information and can plan a path which barely avoids the human. LB-WayPtNav-DH, on the other hand is trained to take conservative trajectories which avoid the human’s potential future behavior.

Agent Time Taken (s) Acceleration () Jerk ()
LB-WayPtNav-DH (ours) 13.38 ±2.76 .09 ±.03 .67 ±.18
Mapping-SH 12.20 ±1.86 .10 ±.03 .77 ±.30
TABLE IV: Comparison between LB-WayPtNav-DH (ours) and Mapping-SH on 102 episodes where both methods succeed. Note: lower is better.

Comparison with Mapping-WC & Mapping-WC-Noisy: LB-WayPtNav-DH reaches the goal on average less than Mapping-WC (Table I). This is unsurprising as Mapping-WC has access to perfect depth information; this combined with perfect human state estimation and "worst-case" planning leads to near perfect success. In fact, Mapping-WC only fails in approximately of cases due to the receding horizon nature of its MPC planner, which plans a collision free path over its current planning horizon, moving the robot to a future state from which it cannot avoid collision.

Mapping-WC reaches the goal on average faster than LB-WayPtNav-DH on the 109 episodes where all three methods succeed (Table V). We found that in many cases, Mapping-WC is able to exploit the precise geometry of the scene and human to take an aggressive trajectory which barely avoids collision with the human. Our method, on the other hand is trained to take conservative trajectories which avoid potential future motion of the human. We analyze one such representative trajectory in Figure 6.

Interestingly, we found that on cases where Mapping-WC was forced to take overly conservative paths due to its "worst case" planning, LB-WayPtNav-DH is able to reason about the human’s likely trajectory, thus reaching the goal on average faster than Mapping-WC (Table V).

The comparison with Mapping-WC-Noisy is consistent with that of Mapping-WC, however we see that Mapping-WC-Noisy reaches the goal on average less than Mapping-WC, indicating that imperfect human state estimation can have a significant impact on navigational performance (Table I). Importantly we note that Mapping-WC is nontrivial to implement on a hardware platform as human state must be accurately inferred from monocular imagery, which itself is a challenging and error prone process.

Fig. 6: (Left) We visualize a topview of the trajectories taken by Mapping-WC (light-blue dashed line) and LB-WayPtNav-DH (solid-red line) from the same state . The corresponding RGB images with a superposition of these two trajectories is shown on the right. Mapping-WC reaches the goal faster than LB-WayPtNav-DH as it has access to precise geometry of the scene and the human state and thus plans a path between the human and the wall which just barely avoids collision. LB-WayPtNav-DH, on the other hand, has been trained to take more cautious trajectories around the human and its possible future motion.
Agent Time Taken (s) Acceleration () Jerk ()
Successful Episodes (109)
LB-WayPtNav-DH (ours) 13.63 ±2.83 .09 ±.03 .68 ±.18
Mapping-WC 12.16 ±2.04 0.10 ±.03 .72 ±.21
Mapping-WC-Noisy 12.13 ±2.15 0.10 ±.03 .71 ±.19
Overly Conservative Episodes (29)
LB-WayPtNav-DH (ours) 12.21 ±1.72 .08 ±.02 .56 ±.15
Mapping-WC 12.92 ±2.03 .11 ±.03 .75 ±.24
Mapping-WC-Noisy 12.70 ±2.12 .10 ±.02 .70 ±.18
TABLE V: Comparison between LB-WayPtNav-DH (ours), Mapping-WC, and Mapping-WC-Noisy on 109 episodes where all methods succeed. Mapping-WC and Mapping-WC-Noisy tend to reach the goal faster over all 109 goals than LB-WayPtNav-DH as they exploit the precise geometry of the scene. However, on the 29 episodes where worst case planning leads to overly conservative behavior, LB-WayPtNav-DH is more efficient in reaching the goal. Note: lower is better.

Learned Navigational Cues We designed HumANav such that relevant visual cues for navigation are rendered in imagery; i.e. a human’s legs will be spread apart if they are moving quickly and will stay closed if they are not moving. To test whether LB-WayPtNav-DH can learn to reason about such visual cues, we deploy LB-WayPtNav-DH twice on the same navigational goal where the human is blocking the robot’s optimal path to goal. In one scenario the human is stationary, while in the other the human is moving forward, as shown by the spread in their legs. The resulting trajectories shown in figure 7 clearly demonstrate that LB-WayPtNav-DH has learned to incorporate visual cues such as the spread in a human’s legs into the planning process.

Fig. 7: (Left) We visualize a topview of the trajectories taken by LB-WayPtNav-DH from the same state with a static human (light blue- dashed line) and with a dynamic human (red- solid line). The corresponding RGB images seen by the robot are shown on the right. LB-WayPtNav-DH is able to incorporate visual cues, i.e. spread of humans legs and direction of human toes, into path planning, planning a path which avoids the human’s current and future states.

Failure Modes: LB-WayPtNav-DH successfully navigates around dynamic and static obstacles in novel environments, however it is primarily limited in its ability to recognize and predict the motion of humans. These issues are tightly coupled with the robot’s reactive nature (uses only the current RGB image) and field of view (forward facing camera) as humans may approach the robot from outside or on the fringe of its field of view.

Vi Hardware Experiments

We test our algorithm on a Turtlebot 2 hardware platform.

We directly deploy the LB-WayPtNav-DH framework trained in simulation (described in Section V) on the Turtlebot without any finetuning or additional training. Our algorithm is tested in two never-before-seen buildings. Importantly, we note that our robot has only been trained on synthetic humans from the SURREAL dataset [22], constrained to piecewise constant velocity trajectories. Humans in our experiments, however, are not constrained to such dynamical constraints. For robot state measurement, we use the Turtlebot’s encoder based odometry.

The hardware experiments are specifically designed to evaluate whether the robot has learned to reason about the dynamic nature of humans. For this purpose, the static clutter in these environments is limited. In experiment 1, the human walks parallel to the robot but in the opposite direction; however, the human suddenly takes a turn towards the robot, requiring it to anticipate the human behavior to avoid a collision. In experiment 2, the robot and the human move in opposite directions, but cross each other near a tight corner, requiring the robot to take a cautious trajectory around the human in order to avoid a collision with the human or other obstacles in the environment.

We compare the performance of LB-WayPtNav-DH, LB-WayPtNav, and Mapping-SH on our hardware platform across two experimental settings for five trials each. Quantitative results are presented in Table VI. We do not compare to End-To-End or Mapping-WC on our hardware setup as the simulation performance of End-To-End is already very low and Mapping-WC requires access to the ground truth state information of the human, which we noticed was not reliable using our narrow field-of-view monocular RGB camera. The experiment videos for all methods can be found on the project website.

Experiment 1
Experiment 2
Fig. 8: Two examples from the experiments. Top: executed trajectory. Purple dot is human with an arrow depicting its direction. Bottom: RGB image from the robot. The robot selects the trajectories in the opposite direction from the human, even if is means diverging from the optimal path to the goal.

Comparison With LB-WayPtNav: LB-WayPtNav succeeds in only 3 trials out of 10 (Table VI). In both experiments LB-WayPtNav attempts to avoid the human, treating it as a static obstacle, however the human advances towards the robot before it can correct course. This is unsurprising as this method is trained purely on static obstacles and these experiments are specifically designed to test the agent’s understanding of the dynamic nature of humans. In cases where treating the human as a static obstacle does succeed however, the robot is approximately more efficient than LB-WayPtNav-DH as it does not take longer, cautious paths which allow for more human avoidant behavior.

Comparison With Mapping-SH: To implement Mapping-SH on the Turtlebot, we project the robot’s current depth image onto an occupancy grid on the ground plane using the known camera intrinsic and extrinsic parameters. Similar to LB-WayPtNav, the performance of Mapping-SH deteriorates even further in our hardware experiment, succeeding in only 2 trials. Performance of Mapping-SH is further impacted by its exact reliance on the geometry of the scene, which can lead to failure when the depth sensor gives erroneous readings.

In cases where Mapping-SH does succeed, it is very efficient, reaching the goal approximately faster than LB-WayPtNav-DH. This is expected as in these cases Mapping-SH is designed to exploit the exact geometry of the scene, barely avoiding obstacles on its way to the goal. Interestingly, given the reactive nature of Mapping-SH, its over-reliance on precise geometry, and lack of understanding of the dynamic nature of the human, when Mapping-SH does succeed it does so by executing a last-resort, reactive, aggressive turn or stop to avoid imminent collision with the human. This behavior is reflected in the exceptionally high jerk of .

Performance of LB-WayPtNav-DH: LB-WayPtNav-DH succeeds in all 10 trials by exhibiting behavior which takes into account the dynamic nature of the human agent. These results demonstrate the capabilities of a learning algorithm trained entirely in simulation on the HumANav dataset to generalize to navigational problems in real buildings with real people.

In experiment 1 (Figure 8), LB-WayPtNav-DH’s navigates around the human, by moving contrary to its direction of motion which allows it to reliably avoid a collision with the human. LB-WayPtNav and Mapping-SH, however, treat the human as a static obstacle and attempt to avoid it by moving in its direction of motion. Similarly, in experiment 2 ((Figure 8)) while navigating through hallways, LB-WayPtNav-DH takes a larger radius turn around a corner to leave space for the human moving in the other direction. LB-WayPtNav and Mapping-SH exhibit greedy behavior, trying to navigate around the tight corner in hope for a shorter path to the goal, but ultimately failing more often as they cannot react quickly enough to maneuver around the human.

LB-WayPtNav-DH is also noticeably less efficient than Mapping-SH and LB-WayPtNav. This is due to the entirely different class of behavior’s exhibited by LB-WayPtNav-DH; the trajectories it takes around the human are more cautious, and thus less-efficient as it has been trained to avoid the human’s current and short-term future trajectory.

Agent Success (%) Time taken (s) Acceleration () Jerk ()
LB-WayPtNav-DH (ours) 100.00 19.15 ±1.25 0.06 ±0.01 2.62 ±0.26
LB-WayPtNav 30.00 15.56 ±0.08 0.06 ±0.01 2.47 ±0.11
Mapping-SH 20.00 11.38 ±0.23 0.22 ±0.05 10.45 ±0.10

Summary of experimental results, averaged over five trials. Average and standard deviation of success rate, time, acceleration and jerk. Best results are in bold. LB-WayPtNav and Mapping-SH approaches lack the understanding of the dynamic nature of the human, ultimately leading to a collision with the human.

Vii Conclusion and Future Work

We propose LB-WayPtNav-DH, a framework that combines a learning-based perception module and a model-based planning module for autonomous navigation in a priori unknown indoor environments with humans. To train the perception module in LB-WayPtNav-DH, we also create a photorealistic bench-marking dataset, HumANav, that can render rich indoor environment scenes with humans. The dataset consists of synthetic humans and can be fully autonomously generated, avoiding privacy and logistic difficulties present when working with real human subjects. We demonstrate that LB-WayPtNav-DH trained on HumANav can successfully learn to navigate around humans and transfer the learned policies from simulation to reality.

In future work, it would be interesting to learn richer navigation behaviors in more complex and crowded scenes. Next, in the current work, we use the SURREAL dataset directly; however, in future, we would like to contribute towards improving the diversity in the human skin color, sizes, and shapes in the SURREAL dataset. Finally, dealing with noise in robot state estimation will be another interesting future direction.


This research is supported in part by the DARPA Assured Autonomy program under agreement number FA8750-18-C-0101, by NSF under the CPS Frontier project VeHICaL project (1545126), by NSF grants 1739816 and 1837132, by the UC-Philippine-California Advanced Research Institute under project IIID-2016-005, by SRC under the CONIX Center, by Berkeley Deep Drive, and by the Google-BAIR Commons program.


  • [1] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese (2016) 3D semantic parsing of large-scale indoor spaces. In CVPR, Cited by: §I, §IV-C.
  • [2] S. Bansal, A. Bajcsy, E. Ratner, A. D. Dragan, and C. J. Tomlin (2019) A hamilton-jacobi reachability-based framework for predicting and analyzing human motion for safe planning. ICRA. Cited by: §II.
  • [3] S. Bansal, V. Tolani, S. Gupta, J. Malik, and C. Tomlin (2019) Combining optimal control and learning for visual navigation in novel environments. CoRL. Cited by: §I, §I, §I, Fig. 2, §IV-A, 1st item, §V-A, §V-A.
  • [4] A. Brunetti, D. Buongiorno, G. F. Trotta, and V. Bevilacqua (2018) Computer vision and deep learning techniques for pedestrian detection and tracking: a survey. Neurocomputing. Cited by: §II.
  • [5] C. Chen, Y. Liu, S. Kreiss, and A. Alahi (2019) Crowd-robot interaction: crowd-aware robot navigation with attention-based deep reinforcement learning. In ICRA, Cited by: §II.
  • [6] H. L. Chiang, A. Faust, M. Fiser, and A. Francis (2019) Learning navigation behaviors end-to-end with autorl. RA-L. Cited by: §I, §I, §II, §II.
  • [7] H. L. Chiang, J. Hsu, M. Fiser, L. Tapia, and A. Faust (2019) RL-rrt: kinodynamic motion planning via learning reachability estimators from rl policies. RA-L. Cited by: §II.
  • [8] T. Fan, X. Cheng, J. Pan, P. Long, W. Liu, R. Yang, and D. Manocha (2019) Getting robots unfrozen and unlost in dense pedestrian crowds. RA-L. Cited by: §I, §I, §II, §II.
  • [9] A. Faust, O. Ramirez, M. Fiser, K. Oslund, A. Francis, J. Davidson, and L. Tapia (2018) PRM-rl: long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In ICRA, Cited by: §II.
  • [10] J. F. Fisac, A. Bajcsy, S. L. Herbert, D. Fridovich-Keil, S. Wang, C. J. Tomlin, and A. D. Dragan (2018) Probabilistically safe robot planning with confidence-based human predictions. RSS. Cited by: §II.
  • [11] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian (2017) Intention-net: integrating planning and deep learning for goal-directed autonomous navigation. In CoRL, Cited by: §I.
  • [12] S. Gupta, V. Tolani, J. Davidson, S. Levine, R. Sukthankar, and J. Malik (2019) Cognitive mapping and planning for visual navigation. IJCV. Cited by: §IV-C.
  • [13] D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In CoRL, Cited by: §I, §I.
  • [14] S. M. LaValle (2006) Planning algorithms. Cambridge university press. Cited by: §II.
  • [15] M. Li, R. Jiang, S. S. Ge, and T. H. Lee (2018) Role playing learning for socially concomitant mobile robot navigation. CAAI Transactions on Intelligence Technology. Cited by: §II.
  • [16] R. Martín-Martín, H. Rezatofighi, A. Shenoi, M. Patel, J. Gwak, N. Dass, A. Federman, P. Goebel, and S. Savarese (2019) JRDB: a dataset and benchmark for visual perception for navigation in human environments. arXiv preprint arXiv:1910.11792. Cited by: §II.
  • [17] X. Peng, B. Sun, K. Ali, and K. Saenko (2015) Learning deep object detectors from 3d models. In ICCV, Cited by: §I, §II.
  • [18] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena (2017) From perception to decision: a data-driven approach to end-to-end motion planning for autonomous ground robots. In ICRA, Cited by: §II.
  • [19] A. Pokle, R. Martín-Martín, P. Goebel, V. Chow, H. M. Ewald, J. Yang, Z. Wang, A. Sadeghian, D. Sadigh, S. Savarese, and M. Vázquez (2019) Deep local trajectory replanning and control for robot navigation. In ICRA, Cited by: §I, §II.
  • [20] B. Recht (2018) A tour of reinforcement learning: the view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems. Cited by: §I.
  • [21] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras (2019) Human motion trajectory prediction: a survey. arXiv preprint arXiv:1905.06113. Cited by: §I, §II, §IV-B.
  • [22] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: §I, §I, §II, §IV-C, §VI.
  • [23] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun (2019) End-to-end interpretable neural motion planner. In CVPR, Cited by: §I.