Toward Agile Maneuvers in Highly Constrained Spaces: Learning from Hallucination

07/28/2020 ∙ by Xuesu Xiao, et al. ∙ The University of Texas at Austin 0

While classical approaches to autonomous robot navigation currently enable operation in certain environments, they break down in tightly constrained spaces, e.g., where the robot needs to engage in agile maneuvers to squeeze between obstacles. Recent machine learning techniques have the potential to address this shortcoming, but existing approaches require vast amounts of navigation experience for training, during which the robot must operate in close proximity to obstacles and risk collision. In this paper, we propose to side-step this requirement by introducing a new machine learning paradigm for autonomous navigation called learning from hallucination (LfH), which can use training data collected in completely safe environments to compute navigation controllers that result in fast, smooth, and safe navigation in highly constrained environments. Our experimental results show that the proposed LfH system outperforms three autonomous navigation baselines on a real robot, including those based on both classical and machine learning techniques (anonymized video:



There are no comments yet.


page 2

page 4

page 8

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Autonomous navigation in complex environments is an essential capability of intelligent mobile robots, and decades of robotics research has been devoted to developing autonomous systems that can navigate mobile robots in a collision-free manner in certain environments [6]. However, when facing highly constrained spaces that are barely larger than the robot, it is difficult for these conventional approaches to produce feasible motion without requiring so much computation that the robot needs to slow down or even stop.

Recently, machine learning approaches have also been used to successfully move robots from one point to another [17]

. Those approaches, based on techniques such as Reinforcement Learning (RL) and Imitation Learning (IL), have enabled new capabilities beyond those provided by classical navigation, such as terrain-based

[23] and social [2] navigation. Their initial success indicates a strong potential for learning-based methods to complement — and possibly to improve upon — classical approaches. However, most machine learning techniques require a large amount of training data before they can generalize to unseen environments. Furthermore, these approaches typically cannot provide verifiable guarantees that the robot will not collide with obstacles while navigating to its goal. While these shortcomings may not prove detrimental when applying machine learning to mobile robot navigation in relatively simple environments, their effects become disastrous in highly constrained spaces. In such environments, RL methods — which typically rely on random exploration — are unlikely to quickly find safe controllers, especially without catastrophic failures during training. IL methods are also unlikely to succeed due to the challenge of gathering demonstration data since highly constrained environments are typically difficult to navigate even for humans. In short, most existing machine learning paradigms for autonomous navigation lack both (1) the ability to generate sufficient training data for learning to navigate in highly constrained spaces and (2) safety assurances to prevent collisions.

In this paper, we introduce a novel machine learning paradigm for navigation, Learning from Hallucination (LfH), that addresses the shortcomings above and enables safe, fast, and smooth navigation in highly constrained spaces. To address challenge (1), i.e. data insufficiency, we introduce a self-supervised neural controller which can collect training data in an obstacle-free environment using a randomly-exploring policy. After performing various collision-free maneuvers, highly constrained configuration spaces that allow the same effective maneuvers are synthetically projected onto the recorded perceptual data so that the machine learner can be provided with training data as if the robot had been moving in those constrained spaces. We refer to this process of modifying the robot’s real perception as hallucination. Thanks to the inherent safety of navigating in an obstacle-free training environment, the robot can automatically generate a large amount of training data with minimal or no human supervision. In order to generalize to unseen deployment environments (e.g. environments that are less constrained), the robot’s perceptual stream is also hallucinated during runtime, using whichever perceptual stream is more constrained (between real and hallucinated) as input. To address challenge (2) regarding safety, we leverage the capabilities of classical navigation approaches: the robot assesses safety at runtime using classical techniques from model predictive control, and adjusts its motion by modulating its speed and aborting unsafe plans. LfH is fully implemented on a physical robot, and we show that it can achieve safer, faster, and smoother navigation compared to three classical and learning baselines in a highly constrained environment (Figure 1).

Figure 1: LfH navigation in a highly constrained obstacle course.

This paper makes three main contributions. First, from the motion planning perspective, LfH is a novel data-driven technique that enables safe, fast, and smooth maneuvers in previously unseen highly constrained spaces. Second, from the machine learning perspective, LfH is a novel, self-supervised learning technique that collects data offline in an obstacle-free environment and hallucinates the most constrained configuration space during training for better sample efficiency. Third, we implement LfH as a local planner for navigation which modulates motion with explicit safety estimation and provide empirical evidence of its efficacy. Combined with other conventional navigation components, our implementation achieves safer, faster, and smoother navigation in unseen environments, compared to classical and learning baselines.

2 Related Work

This section summarizes related work from the robotics community that has sought to address autonomous navigation in highly constrained spaces. We also review the literature from the machine learning community that has considered the general problem of mobile robot navigation.

2.1 Classical Navigation

Given a global path from a high-level global planner, such as Dijkstra’s algorithm [4], A* [8] or D* [5], classical mobile robot navigation systems aim to compute fine-grained motion commands to drive the robot along the global path while observing kinodynamic constraints and avoiding obstacles. Minguez and Montano [15] proposed a sophisticated rule-based Nearness Diagram approach to enable collision avoidance in very dense, cluttered, and complex scenarios and applied it on a simple holonomic robot. But for differential drive robots with non-trivial kinodynamics, researchers have relied heavily on sampling based techniques [11, 13]: Fox et al. [6] generated velocity samples achievable by the robot’s physical acceleration limit, and found the optimal sample according to some scoring function to move the robot towards a local goal, along a global path, and away from obstacles. Howard et al. [9] sampled in the robot’s state space instead of its action space, and subsequently generated motion trajectories for the sampled states. Given a highly constrained environment, the required sampling density has to increase so that a feasible motion command can be computed. Recently, Xiao et al. [24] established that, in constrained environments, robots must trade off between high computational requirements for increased sampling and reduced maximum speed in order to successfully navigate. Their APPLD algorithm manages this trade off. In contrast, LfH aims to enable safe, fast, and smooth navigation with limited computational requirements and without human demonstration.

2.2 Learned Navigation

A flurry of recent research activity has proposed several new approaches that apply machine learning techniques to the navigation task [17, 24, 18, 20, 26, 21, 25, 3, 7, 27, 22]. While these approaches are distinct in several ways (e.g., the particular way in which the navigation problem is formulated, the specific sensor data used, etc.), the machine learning paradigm employed is typically either reinforcement learning or imitation learning. Approaches based on RL [19] rely on hand-crafted reward functions for learners to discover effective navigation policies through self-generated experience. Approaches based on IL [1, 10], on the other hand, use demonstrations of effective navigation behaviors provided by other agents (e.g., humans) to learn policies that produce behaviors similar to what was demonstrated. Both paradigms have been successfully applied to the navigation task in certain scenarios, and have even enabled new navigation capabilities beyond what is typically possible with classical autonomous navigation approaches, e.g., terrain-based [23] and social navigation [2]. However, these approaches each impose substantial requirements at training time: RL-based techniques rely on large amounts of training experience gathered using a typically-random exploration policy within an environment similar to that in which the robot will be deployed, and IL-based techniques require a demonstration from the same type of environment. Unfortunately, neither of these requirements is easily satisfied in the challenging, highly constrained environments we study here: random exploration policies are too dangerous, and it is often difficult for an agent — artificial or human — to provide an expert-level demonstration. In contrast, the approach we present here utilizes imitation learning, but modifies the paradigm such that it can train using arbitrary demonstrations gathered from a different, safer environment.

3 Learning from Hallucination

We now describe the proposed technique, Learning from Hallucination, which can circumvent the difficulties of using traditional planning and learning approaches in highly constrained workspaces. In Section 3.1, we formulate the classical motion planning problem in such a way that it can be easily understood from the machine learning perspective. In Section 3.2, we propose to solve the reformulated motion planning problem with a new machine learning approach that addresses the inefficiencies of existing methods through a technique that we call hallucination. In Section 3.3, we describe the technique to ensure safe navigation in challenging spaces by incorporating both classical and learning techniques to adapt to motion uncertainties.

3.1 Motion Planning Formulation

In robotics, motion planning is formulated in configuration space (C-space) [14]. Given a particular mobile robot, the robot’s C-space represents the universe of all its possible configurations. Given a particular environment, the C-space can be decomposed as , where is the unreachable set of configurations due to obstacles, nonholonomic constraints, velocity bounds, etc., and is the set of reachable configurations. Let be a low-level action available to the robot (e.g., commanded linear and angular velocity ), and let a plan be a sequence of such actions , where is the space of all plans over time horizon . Then, using the notation above, the task of designing a motion planner is that of finding an optimal function that can be used to produce plans that result in the robot moving from the robot’s current configuration to a specified goal configuration without intersecting , while observing robot motion constraints and optimizing a particular cost function (e.g. distance, clearance, energy, and combinations thereof).

To address kinodynamics constraints, lack of explicit representation of and , high dimensionality, etc., sampling-based techniques are typically used to approximate : Probabilistic Roadmap (PRM) [11] samples in and assumes finding to connect two consecutive configurations and without entering is trivial, e.g. using a straight line. Rapidly-exploring Random Tree (RRT) [13] uses kinodynamic models to push samples in towards . Dynamic Window Approach (DWA) [6] directly generates samples in action space and finds the best sample to move the robot towards in . However, in the case of highly constrained spaces, a large sampling density is necessary. For example, PRM requires many samples to ensure the possibility of finding valid local connections between samples without entering . RRT requires many samples to efficiently progress in toward . DWA requires many samples to simply generate a viable action toward while keeping the robot configuration in . Meanwhile, the output of sampling is often not a smooth trajectory, thus requiring post-processing. However, smoothing in highly constrained spaces may be difficult without the path entering .

Instead of finding , consider now its “dual” problem, i.e., given (with and ), find the unreachable set that generated that plan. Since different can lead to the same plan, the left inverse of , , is not well defined (see Figure (a)a and (b)b). However, we can instead define a similar function such that , where denotes the C-space’s most constrained unreachable set corresponding to .111Technically, can be uniquely determined by and , but we include it as an input to for notational symmetry with . Formally, given a plan , we say

where is the optimal planner. Here, we assume every primitive control has deterministic effects, so plans have no uncertainty associated with their effects. Uncertainty will be addressed in Sections 3.2 and 3.3. We denote the corresponding reachable set of as (Figure (c)c and (d)d). We call the output of a hallucination (details can be found in Section 4.1 and supplemental materials), and this hallucination can be projected onto the robot’s sensors. For example, for a LiDAR sensor, we perform ray casting from the sensor to the boundary between and in order to project the hallucination onto the range readings (Figure (d)d). Given the hallucination for , the only viable (and therefore optimal) plan is . Note that is bijective and its inverse is well defined. All discussed mappings are shown in Figure 3.

Figure 2: Different unreachable sets (grey) lead to the same optimal control plan (red), which generates the same trajectory (blue). But that control plan has a unique corresponding most constrained unreachable set . During training, the hallucinated is mapped to the robot’s sensors, e.g. LiDAR (orange).

3.2 Machine Learning Solution Using Hallucination

Leveraging machine learning, is represented using a function approximator . Note that we aim to approximate instead of the original due to the vastly different domain size: while the domain of is all unreachable sets , ’s domain is only the most constrained ones (Figure 3). Solving demands high generalization over a large domain , while a simple model with a smaller domain can generalize better and robustly produce . An analysis showing the complexity difference is given in Appendix A (see supplemental materials).

Figure 3: Classical motion planning aims at finding a function that maps from a larger domain of unreachable sets to plans . We reduce to the most constrained unreachable sets with hallucination and then use learning for . During deployment we hallucinate the most constrained unreachable set and predict the resulting optimal plan.

During training, control plans generated by a random exploration policy are applied to drive the robot and the resulting sequence of robot configurations is recorded, where and is the number of recorded configurations. To guarantee safety during this exploration phase even without human supervision, we conduct training in an open space without any obstacles, i.e. and . is the most free (smallest) unreachable set, while is the most constrained one. In this obstacle-free space, we use to generate the most constrained unreachable set, in which is the current robot configuration at each time step, and the configuration after executing . The hallucinated can be viewed as all configurations occupied by the robot, , and . Note that we assume a deterministic world model for the hallucination and do not consider motion uncertainty. If necessary, the uncertainty can be addressed by adding an envelope around . The hallucinated is mapped to the robot’s sensors: for geometric sensors, e.g. LiDAR, we hallucinate the range readings based on , as shown in Figure (d)d

. In a data-driven manner, we use a function approximator, i.e. a neural network, to approximate the function

that maps from most constrained unreachable set to the control plan . Note that traditionally training data is collected using human demonstration or reinforcement learning, both of which are difficult in highly constrained spaces. But our inherently safe open training environment precludes the possibility of collision and allows training data to be collected in a self-supervised manner using a random exploration policy.

During deployment, the robot uses a global planner and perceives real unreachable set to produce a coarse path: (Figure (a)a). Each configuration in this sequence can have fewer dimensions than the robot’s original configuration , and also a low resolution. For example, with both translational and rotational components, while with only translations. This coarse global plan can be found very quickly, with conventional search algorithms such as Dijkstra’s [4], A* [8] or D* [5]. Being computed in real time, it is then used to approximate the most constrained hallucinated unreachable set: (Figure (b)b). Here, is an online hallucination function for deployment, which maps from a sequence of approximated planned configurations to the most constrained unreachable set, instead of from a plan of actions , as the case for . We hypothesize that machine learning can generalize over the difference between the codomain of during deployment and in training, and this hypothesis is validated by our experiments. Using the function learned during the training phase, a control plan is finally computed based on the hallucinated unreachable set:

Figure 4: (a) During deployment, a coarse global plan is computed quickly using real unreachable set. (b) Most constrained hallucinated unreachable set is then constructed based on the coarse global plan. (c) Before execution, safety is estimated by adding Gaussian noise to the computed plan.

3.3 Addressing Uncertainties

The LfH motion planner can only generalize well over hallucinated input that is similar to that seen in the training set, and the planning output is not expected to assure safety during deployment. These input differences between hallucination during deployment and hallucination in the training set and the lack of output safety assurance during deployment motivate addressing uncertainties from both the input and output perspectives.

Input Uncertainties Differences between hallucination during deployment and hallucination in the training set may stem from the coarse global path (Figure 4) being different from the robot trajectory in the training set constructed by real robot trajectories (Figure 2). For example, the robot trajectory in the training set may be smoother than the coarse global path computed during deployment. Another difference can arise when the global goal is behind the robot, and the coarse global planner does not consider nonholonomic motion constraints such that the planned global path may start from the current location and directly lead to somewhere behind the robot, while during training the robot has only driven forward. Therefore we need a pre-processing routine that makes sure the input to LfH resembles the training set. In particular, we use smoothing within to assure the coarse global path is as smooth as the robot trajectory during training so as to match the output of with . Before using the LfH planner, we use a feedback controller to rotate the robot to a configuration from which the planned global path is similar to the robot trajectories in the training set (see Section 4.1 for implementation details).

Output Uncertainties The control plan computed by the learned function approximator is a reaction to , learned from the hallucinated training set. Although it serves as an initial solution to the highly constrained workspace, it lacks both the ability to adapt to the uncertainties in real workspaces and any assurance of safety. To address those two problems, we combine the learned control plan with classical Model Predictive Control (MPC) techniques. When executing the learned plan , we assume the output uncertainty can be expressed by Gaussian noise over the nominal inputs . Therefore, within the computed plan , we sample noisy controls around the planned inputs, , and compose a perturbed plan . We then use MPC to simulate the robot under these inputs. We check each resulting trajectory in the real workspace for collision and compute the percentage of trajectories that will not enter : , where , as shown in Figure (c)c. The magnitude of the learned controls

is then heuristically modulated (see Appendix B in supplemental materials for details), while still observing kinodynamic constraints of the robot. For example, the robot may move fast when

is high and slow down when it is low. Before execution, the same MPC model checks if the modulated controls will result in a collision. If so, the controls are ignored and the robot executes a pre-specified recovery behavior. The learned control plan from hallucination can therefore both adapt to real world uncertainties and assure motion safety.

0:  , , , , , , pre-processing, recovery behavior.
1:  // Training
2:  collect motion plans from in free space and form training data , where
3:  train with by minimizing the error
4:  // Deployment (each time step)
5:  receive
6:  coarse
7:  if  does NOT resemble training set then
8:     return   with pre-processing routine
9:  end if
10:  hallucinate
11:  plan
12:  add noise to ,
13:  modulate based on
14:  if  is NOT safe then
15:     return   with recovery behavior
16:  end if
17:  return  
Algorithm 1 LfH Pipeline

LfH Pipeline

The LfH motion planning pipeline is shown in Algorithm 1. In lines 2 – 3, training data is collected via a random exploration policy in obstacle-free space and is trained on hallucinated data. At each step during deployment, line 6 computes a coarse global path from . Lines 7 – 9 correspond to the pre-processing routine based on a feedback controller to address input uncertainties. Line 10 hallucinates and line 11 produces motion plans. Lines 12 – 16 address the output uncertainties.

4 Experiments

In this section, LfH is implemented on a physical ground robot in the real world. We hypothesize LfH can achieve safer, faster, and smoother navigation than existing classical and learning approaches in highly constrained spaces. We first present our implementation of LfH on a robot. Baseline methods to compare LfH against are then described, before we provide quantitative and qualitative experimental results in a controlled test course and natural outdoor/indoor environments, respectively.

4.1 Implementation

A Clearpath Jackal, a differential drive four-wheeled ground robot, is used as a test platform for the LfH motion planner. Jackal’s nonholonomic constraints increase the difficulty of maneuvering in highly constrained spaces. Taking advantage of the widely-used Robot Operating System (ROS) move_base navigation stack [16], we replace its DWA local planner with our LfH pipeline (Algorithm 1), and use the same high-level global planner (Dijkstra’s). Input is instantiated by the 2D laser scan input from an onboard Velodyne LiDAR. Planning in a robot-centric view, is the origin and is a waypoint 1m away on the global path (orientation is ignored for simplicity).

Figure 5: Finite State Machine of the LfH Implementation

For training, data is collected in a self-supervised manner with a Jackal robot in real time simulation (line 2 in Algorithm 1). The planning horizon is chosen to be 1, i.e. only a single action (linear and angular velocity) is produced for faster computation and better accuracy. The random exploring policy is simulated by a human operator randomly pushing an Xbox joystick, with bounded in m/s and in rad/s. This could be easily replaced by a random walk exploration policy. Linear and angular velocities are randomly applied to the robot and perform a large variety of different maneuvers. We record the control inputs ( and ) as training labels. For simplicity, we directly record robot configurations (, , and ), instead of computing them based on and , to extract local goals and to hallucinate LiDAR as training input. The inherent safety provided by the collision-free open training environment allows completely self-supervised learning of a rich variety of motions. We speculate that even when training in an open space in the real world, safety can be assured by a collision avoidance policy to drive the robot back into the middle of the open space when it comes close to the environment boundary. We find the model learned from the less than four minutes simulation in real time can easily generalize to the real world. Training with the four-minute data takes less than one minute on an Intel Core i9-9980HK CPU, indicating high computational efficiency (line 3).

For deployment, we implement a Finite State Machine, shown in Figure 5. For line 6 in Algorithm 1, is smoothed by a Savitzky–Golay filter [12] on the global path. For lines 7 – 9, a PID controller in the pre-processing routine rotates the robot in place to address out-of-distribution scenarios unseen in the training set. Otherwise, the LfH module takes control. For line 10, is constructed as all configurations that are slightly more than the robot’s half-width away from the smoothed global path. We use ray casting to generate hallucinated LiDAR input. For safety, we take the minimum value between the hallucinated and real laser scan. Concatenated with the current linear and angular velocity and the local goal taken from the global plan, the hallucinated LiDAR is fed into the neural controller and a plan with planning horizon is produced (line 11). We add Gaussian noise to the produced and in line 12. The controls are modulated by the safety estimation by a MPC collision checker in line 13. For the recovery behavior routine in lines 14 – 16 when a collision is detected, the robot starts a three-phase recovery behavior designed by hand. If the first (decreasing and increasing iteratively) and second (increasing negated and original iteratively) phase are still unsafe, the robot drives back slowly in the third phase. Full implementation details are given in Appendix B in supplemental materials.

4.2 Results

The LfH controller is tested in a highly constrained obstacle course with minimum clearance roughly 1.3 the robot’s footprint (see Figure 1). The robot needs to perform agile maneuvers to navigate through this environment without a map.

We compare the LfH controller with three baseline local planners. First, we compare to the default DWA local planner in move_base, as a classical sampling-based motion planner. The default parameters recommended by the robot manufacturer are used. Second, a machine learning approach, similar to the Behavior Cloning (BC) work presented by Pfeiffer et al. [17], replaces the green LfH module in Figure 5 and maintains the yellow Turn in Place and Recovery Behavior modules to study the effect of hallucinated learning. We use the same neural network architecture as LfH, but with realistic instead of hallucinated LiDAR input. Third, as a recent improvement upon classical planners, APPLD [24] can fine-tune DWA parameters based on different navigation contexts to improve navigation. Therefore, we compare to APPLD-DWA as an upper bound of classical approaches. In particular, one of the authors with extensive Jackal teleoperation experience provides a demonstration in the same deployment environment for BC and APPLD. The demonstrator aims at quickly traversing the course in a safe manner. APPLD generates four sets of planner parameters to adapt to different regions of the obstacle course. Note that with demonstration in the same environment, we allow BC and APPLD to fit to the deployment environment, while LfH is not given any such training data. Full experimental details are given in Appendix C in supplemental materials.

DWA BC APPLD Hallucination
Success 0 2 8 9
Collision 0 5 2 1
Failure 10 3 0 0
Time 53.96.1 74.12.8 45.81.4
Table 1: Quantitative Results

Ten trials for each planner are executed in the obstacle course. The results of the four methods are summarized in Table 1

. We define “Success” as navigating through the course without any collision. Non-terminal “Collision”s are recorded and we allow the robot to keep navigating to the goal. “Failure” is when the robot fails to navigate through the course to reach the destination, e.g. getting stuck at a bottle neck or knocking down the obstacle course. We compute average traversal time based on the “Success” and “Collision” trials. For default DWA, all ten trials get stuck mostly because the default planner cannot sample viable velocity commands in such a highly constrained environment. BC does not finish three out of ten trials, mostly due to knocking down the course, and in five other trials, the robot collides with the obstacles. The safety check fails to prevent some collisions since the close distance to the obstacle is smaller than the LiDAR’s minimum range. The other two trials are successful, but we observe very jittery motion. Using four sets of parameters learned in eight hours from human demonstration, APPLD successfully navigates Jackal through the obstacle course without collision in eight trials, and causes one gentle collision in each of the other two trials. The average traversal time is longer than the demonstration (67.8s). The LfH controller succeeds in nine out of ten trials with a faster average speed and smaller variance. The motion is qualitatively smoother than other methods. One gentle collision with an obstacle happens in one trial.

Qualitatively, we also test the LfH local planner in both outdoor and indoor natural environments. It is able to successfully navigate through highly constrained spaces cluttered with everyday objects, including tables, chairs, doors, white boards, trash cans, etc. In relatively open space, LfH can also enable smooth and fast navigation. Note the robot has never seen any of those environments (Figure 1 and 6), nor even a single real obstacle, during training, and does not require human supervision.

Figure 6: Qualitative Results: Jackal navigates with the LfH local planner in outdoor and indoor environments with highly cluttered natural objects. Anonymized video of representative trials for each method and the qualitative tests:

5 Conclusion

This paper introduces the novel LfH technique to address motion planning (i.e., navigation) in highly constrained spaces. For robotics, the LfH method addresses the difficulty in planning motion when obstacle space occupies the majority of the surrounding C-space, which usually causes an increasing demand on sampling density and therefore computation using classical approaches. Seeking help from a data-driven perspective, we hallucinate the most constrained workspace that allows the same effective maneuvers in open space to the robot perception and learn a mapping from the hallucinated workspace to that optimal control input. For machine learning, the proposed LfH method provides a self-supervised training approach and largely improves sample efficiency, compared to traditional IL and RL. To combine the benefits of both sides, our local planner estimates motion safety with MPC and enables agile maneuvers in highly constrained spaces with learning. In a physical robot experiment, LfH outperforms classical sampling-based method with default and even with dynamically fine-tuned parameters, and also imitation learning deployed in the identical training environment. An interesting direction for future work is to extended LfH beyond 2D ground navigation, e.g. toward 3D aerial navigation or manipulation with higher degrees of freedom. Another interesting direction is to investigate hallucination techniques that not only hallucinate the most constrained partition

, but also any partition in between, for which the motion plan is still optimal. In that case, hallucination during deployment is no longer necessary.


  • [1] B. D. Argall, S. Chernova, M. Veloso, and B. Browning (2009) A survey of robot learning from demonstration. Robotics and autonomous systems 57 (5), pp. 469–483. Cited by: §2.2.
  • [2] Y. F. Chen, M. Everett, M. Liu, and J. P. How (2017) Socially aware motion planning with deep reinforcement learning. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1343–1350. Cited by: §1, §2.2.
  • [3] H. L. Chiang, A. Faust, M. Fiser, and A. Francis (2019) Learning navigation behaviors end-to-end with autorl. IEEE Robotics and Automation Letters 4 (2), pp. 2007–2014. Note: [2019-24] Continuation of PRM-RL. Tune RL parameters. Claim to beat DWA, but DWA is not tuned. Cited by: §2.2.
  • [4] E. W. Dijkstra (1959) A note on two problems in connexion with graphs. Numerische mathematik 1 (1), pp. 269–271. Cited by: §2.1, §3.2.
  • [5] D. Ferguson and A. Stentz (2006)

    Using interpolation to improve path planning: the field d* algorithm

    Journal of Field Robotics 23 (2), pp. 79–101. Cited by: §2.1, §3.2.
  • [6] D. Fox, W. Burgard, and S. Thrun (1997) The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine 4 (1), pp. 23–33. Cited by: §1, §2.1, §3.1.
  • [7] W. Gao, D. Hsu, W. S. Lee, S. Shen, and K. Subramanian (2017)

    Intention-net: integrating planning and deep learning for goal-directed autonomous navigation

    arXiv preprint arXiv:1710.05627. Note: [2017-26] Similar idea of using planning for global and learning for local plan. Input to the learning (neural net) is two images: (1) camera feed, and (2) intention (goal) from the global plan represented as an image. Cited by: §2.2.
  • [8] P. Hart, N. Nilsson, and B. Raphael (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. External Links: Document, Link Cited by: §2.1, §3.2.
  • [9] T. M. Howard, C. J. Green, and A. Kelly (2008) State space sampling of feasible motions for high performance mobile robot navigation in highly constrained environments. In Field and Service Robotics, pp. 585–593. Cited by: §2.1.
  • [10] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne (2017) Imitation learning: a survey of learning methods. ACM Computing Surveys (CSUR) 50 (2), pp. 1–35. Cited by: §2.2.
  • [11] L. E. Kavraki, P. Svestka, J. Latombe, and M. H. Overmars (1996) Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation 12 (4), pp. 566–580. Cited by: §2.1, §3.1.
  • [12] S. R. Krishnan and C. S. Seelamantula (2012) On the selection of optimum savitzky-golay filters. IEEE Transactions on Signal Processing 61 (2), pp. 380–391. Cited by: §4.1.
  • [13] J. J. Kuffner and S. M. LaValle (2000) RRT-connect: an efficient approach to single-query path planning. In Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No. 00CH37065), Vol. 2, pp. 995–1001. Cited by: §2.1, §3.1.
  • [14] J. Latombe (2012) Robot motion planning. Vol. 124, Springer Science & Business Media. Cited by: §3.1.
  • [15] J. Minguez and L. Montano (2004) Nearness diagram (nd) navigation: collision avoidance in troublesome scenarios. IEEE Transactions on Robotics and Automation 20 (1), pp. 45–59. Cited by: §2.1.
  • [16] OSRF (2018) ROS wiki move_base. Note: Cited by: §4.1.
  • [17] M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena (2017) From perception to decision: a data-driven approach to end-to-end motion planning for autonomous ground robots. In 2017 ieee international conference on robotics and automation (icra), pp. 1527–1533. Cited by: §1, §2.2, §4.2.
  • [18] M. Pfeiffer, S. Shukla, M. Turchetta, C. Cadena, A. Krause, R. Siegwart, and J. Nieto (2018) Reinforced imitation: sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations. IEEE Robotics and Automation Letters 3 (4), pp. 4423–4430. Cited by: §2.2.
  • [19] R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT Press. Cited by: §2.2.
  • [20] L. Tai, G. Paolo, and M. Liu (2017) Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 31–36. Note: [2017-184] Similar to our idea and to [1]: RL (ADDPG) from 10-dim laser range and goal to predict continuous v and w. Performance is inferior than move base, but superior to handicapped move base (handicapped to only 10-dim lidar). Cited by: §2.2.
  • [21] Y. Wang, H. He, and C. Sun (2018) Learning to navigate through complex dynamic environment with modular deep reinforcement learning. IEEE Transactions on Games 10 (4), pp. 400–412. Note: [2018-15] Modular DRL: DQN for global, two-stream (spatial and temporal) DQN for local, Deals with dynamic obstacles. Laser scan to discrete actions. Classic RL similar to classic PFM, modular RL beats PFM Cited by: §2.2.
  • [22] Y. Wang, B. Liu, J. Wu, Y. Zhu, S. S. Du, L. Fei-Fei, and J. B. Tenenbaum (2019) Dual sequential monte carlo: tunneling filtering and planning in continuous pomdps. arXiv preprint arXiv:1909.13003. Cited by: §2.2.
  • [23] M. Wigness, J. G. Rogers, and L. E. Navarro-Serment (2018) Robot navigation from human demonstration: learning control behaviors. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1150–1157. Note: [2018-8] Use IRL to learn local costmap Cited by: §1, §2.2.
  • [24] X. Xiao, B. Liu, G. Warnell, J. Fink, and P. Stone (2020) APPLD: adaptive planner parameter learning from demonstration. IEEE Robotics and Automation Letters 5 (3), pp. 4541–4547. Cited by: §2.1, §2.2, §4.2.
  • [25] L. Xie, S. Wang, S. Rosa, A. Markham, and N. Trigoni (2018) Learning with training wheels: speeding up training with a simple controller for deep reinforcement learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6276–6283. Note: [2018-16] RL start with a PID controller to speed up training. “we set MoveBase as the baseline approach to show that the learnt policy is comparable with existing deterministic approach.” Cited by: §2.2.
  • [26] J. Zeng, R. Ju, L. Qin, Y. Hu, Q. Yin, and C. Hu (2019) Navigation in unknown dynamic environments based on deep reinforcement learning. Sensors 19 (18), pp. 3837. Note: [2019-1] (this one deals with dynamic obstacle, a DRL approach) Actions are v and w. Add reward shaping, memory (GRU), and curriculum learning to A3C. Only compare to other A3C methods, not classic navigation approaches. Results look like no better than traditional approaches. Cited by: §2.2.
  • [27] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3357–3364. Note: [2017-587] RL from vision and goal to high level motion command: move forward or turn left. Cited by: §2.2.

Appendix A: Quantifying the Complexity of Hallucination Space

In this section we use a simple gridworld to illustrate the change in input space complexity, i.e. difference in domain size between and , when hallucination is used.

Figure 7: A simple gridworld to illustrate the complexity of hallucination space

Considering a 2D gridworld as shown in Figure 7, the agent is at the current configuration and at each step can visit the upleft or upright grid in front of it. The possible configuration within the gridworld in front the agent is , since for each grid, there either exists or not exists an obstacle. However, the hallucination space is essentially of the same complexity as the possible planning trajectories, because there exists a one-to-one relationship between a planning trajectory and its most constraint . The part of gridworld marked in blue are the reachable configurations from , and there are in total possible planning trajectories, and therefore different . In general, if each time step the agent can go to nearby grids, then . The main message here is the original configuration space is much larger than the hallucination space, i.e. , where can be regarded as the planning horizon. This is the main cause of the data-efficiency and good generalization ability of the LfH method.

Appendix B: Implementation Details


Savitzky–Golay Filter

The purpose of the Savitzky–Golay filter is to smooth the coarse global path so that it looks similar to the robot trajectory in the training set, with which the hallucinated training data is generated. The difference between the two can be caused by (1) highly discrete global path waypoints vs. high-resolution robot trajectory, (2) imperfections from move_base global planner, e.g.jitters, irregularities, vs. smooth robot trajectory, and (3) two-dimensional (, ) global waypoints without orientation vs. actual robot trajectory that fully observes nonholonomic constraints.

To address the global planner’s inability to consider nonholonomic constraints, we replace the first ten global waypoints with artificial waypoints, evenly distributed between the middle points of the rear and front bumper. The distribution of these artificial waypoints can be adjusted based on the hallucinated LiDAR’s field-of-view. Then we apply Savitzky–Golay filter with . We empirically observe the filtered global path is similar to the robot trajectory in the training set.

PID Controller

The purpose of the PID controller is to drive the robot back from out-of-distribution scenarios unseen in the training set. One particular out-of-distribution scenario is when the planned global path leads to somewhere behind the robot. Instead of a U-turn with a smooth Dubin’s path, the coarse global path directly connects the robot to somewhere behind. In such scenarios, we use the PID controller to rotate the robot in place, so that the difference between the current robot heading and the current tangential direction of the global path always falls in the range . We empirically observe a proportional gain of suffices, while smoother turning can be enabled by adding integration and derivative parts. Paired with the Savitzky–Golay filter, any robot state within this range is in distribution of the training data.


Hallucination Footprint

The Jackal’s half-width is 0.165m. During training, we use a hallucination footprint of 0.18m, i.e. any space further than 0.18m to the left and right hand side of the robot trajectory is hallucinated as occupied. In this way, a “narrow corridor” is built and we hallucinate the LiDAR input using raycasting.

LiDAR Parameters

A 2D LiDAR suffices for our LfH controller, We use ROS pointcloud_to_laserscan package to transfer the 3D Velodyne pointcloud to 2D laserscan. The 360 field-of-view ( and ) with has 2095 laser beams and everything between the height m is treated as an obstacle.

Network Architecture

The LfH planner is a three-layer neural network, with 256 hidden neurons and ReLU activation for each layer. It takes in LiDAR input (normalized to

), current and (normalized to and , respectively), and coordiantes of the local goal. The output is one single action .


A snapshot of the collected training data is shown in Figure 8. The green path in Figure (a)a represents the global trajectory of the robot, while the blue and red profiles in Figure (b)b represent the linear and angular velocity, and , respectively. For each data point, we take its 100 neighbors to create the hallucination. Examples of the hallucinated training LiDAR data are displayed in Figure 9.

(a) Robot Trajectory
(b) Linear and Angular Velocities
Figure 8: Visualization of the Training Data
Figure 9: Examples of Hallucinated Training LiDAR Data: Green dot represents the robot and green arrow is the orientation. Blue and red dots are the hallucinated obstacles on the left and right hand side of the robot trajectory, respectively. The black rays represent LiDAR beams. Some “leaky” LiDAR beams exist due to numerical error.


Safety Estimation

Gaussian noise of zero mean and 10% standard deviation is added to the produced

and for safety estimation. The MPC collision checker looks 20 steps ahead with 0.0625s integration interval on a batch size of 32 samples. For each sample, we check if any of the 40 points evenly distributed on the four edges of the vehicle after motion is further to the origin than the corresponding LiDAR beam reading. If any of them is, this sample causes a collision. We then compute the percentage of safe samples.

Speed Modulation

We modulate the controls with

where the weights for the speed modulation and correspond to roughly 50% - 150% modulation. In order to prevent noisy being amplified by the modulation in safe spaces, we suppress any rad/s to .

Recovery Behavior

When a predicted and modulated pair is not safe according to the MPC collision checker, the robot performs a three-stage recovery behavior. In the first stage, the robot iteratively decreases and increases , both by 2%, until the resulted is safe. If it is still not safe after 20 iterations, the robot enters the second stage. It negates , and increases both negated and original by 2% in each step. If this is still not safe after 20 more iterations, the robot backs up at 0.1m/s in stage three.

Appendix C: Experiment Details


ROS move_base navigation stack’s default local planner, DWA, is used as our first baseline. We use the default parameters recommended by the robot manufacturer, Clearpath Robotics. The specific values of the parameters are shown in the second row of Table 2.

Behavior Cloning

We use the same neural network architecture as LfH, but with realistic in stead of hallucinated LiDAR input. The experienced human demonstrator aims at tele-operating the robot to traverse through the obstacle course in a fast and safe manner. The entire demonstration lasts 67.8s.


Using the same 67.8s demonstration, APPLD first automatically segments the demonstration into four contexts. Within each context, we use a black-box optimization technique, CMA-ES, to tune the optimal set of DWA parameters that imitate the human demonstration. The resulted values of the four sets of parameters after eight hours of training are shown in the last four rows of Table 2. The robot switches among these four sets of parameters using the prediction of a LiDAR-based context predictor (a neural network with one hidden layer of ten neurons).

v w s t o p g
Default 0.50 1.57 6 20 0.10 0.75 1.00
Context 1 0.40 0.85 17 15 0.69 0.96 0.75
Context 2 0.35 1.52 16 15 0.92 0.69 0.43
Context 3 0.42 2.18 13 19 0.76 0.06 0.39
Context 4 0.25 1.34 8 40 0.43 0.65 0.98
Table 2: Parameter Values Used by Default DWA and APPLD: max_vel_x (v), max_vel_theta (w), vx_samples (s), vtheta_samples (t), occdist_scale (o), pdist_scale (p), gdist_scale (g)