Residual Reactive Navigation: Combining Classical and Learned Navigation Strategies for Deployment in Unknown Environments
In this work we focus on improving the efficiency and generalisation of learned navigation strategies when transferred from its training environment to previously unseen ones. We present an extension of the residual reinforcement learning framework from the robotic manipulation literature and adapt it to the vast and unstructured environments that mobile robots can operate in. The concept is based on learning a residual control effect to add to a typical sub-optimal classical controller in order to close the performance gap, whilst guiding the exploration process during training for improved data efficiency. We exploit this tight coupling and propose a novel deployment strategy, switching Residual Reactive Navigation (sRNN), which yields efficient trajectories whilst probabilistically switching to a classical controller in cases of high policy uncertainty. Our approach achieves improved performance over end-to-end alternatives and can be incorporated as part of a complete navigation stack for cluttered indoor navigation tasks in the real world. The code and training environment for this project is made publicly available at https://github.com/krishanrana/2D_SRRN.READ FULL TEXT VIEW PDF
Learning-based approaches often outperform hand-coded algorithmic soluti...
Recent research efforts enable study for natural language grounded navig...
We study how robots can autonomously learn skills that require a combina...
This paper describes and analyzes a reactive navigation framework for mo...
We propose SplitNet, a method for decoupling visual perception and polic...
Developments in mobile robot navigation have enabled robots to operate i...
PointGoal Navigation is an embodied task that requires agents to navigat...
Residual Reactive Navigation: Combining Classical and Learned Navigation Strategies for Deployment in Unknown Environments
Deep reinforcement learning approaches have been shown to learn effective and complex reactive navigation policies, portraying behaviours which are difficult to derive analytically. They however require extensive amounts of online training data which is a limiting factor for real robot applications. Additionally, at deployment, these systems could yield undesirable behaviour in states where the policy failed to generalise, particularly in novel, unseen environments. This reduces the overall reliability we can place on these systems. Classical approaches in reactive robotic navigation on the other hand, are generally deterministic and can portray competent obstacle avoidance capabilities in any given domain. They however lack the complexity required to efficiently navigate in cluttered environments and are commonly plagued by oscillations , seizure in local minima  and poor path efficiency. Such high level solutions are difficult to hand-engineer explicitly and tend to deteriorate in performance when extensively tuned for a particular environment.
In this work we attempt to combine the classical and learned strategies in order to address their respective limitations. We build upon prior work in the area of residual reinforcement learning, which primarily focuses on robotic manipulation tasks, and adapt it to the goal-directed navigation task presented by Anderson et al. . The concept is based on learning to improve upon prior experience rather than learning the entire task structure from scratch. We train a continuous drive mobile robot equipped with a 180
laser scanner to efficiently navigate towards arbitrary goals in cluttered environments by learning a residual policy to close the performance gap of a classical reactive controller. All training is completed in simulation and zero shot transferred to a real robot. We present a novel deployment strategy that exploits this tight coupling of the two systems and probabilistically falls back to the classical controller in cases of high policy uncertainty. This reduces the chances of undesirable and potentially catastrophic failure. We show a significant improvement to the variance and training sample efficiency required to learn the navigation task and superior generalisation performance at deployment on a real robot when compared to end-to-end learned systems, the baseline prior and the commonly used move-base ROS navigation stack. We created a lightweight simulation environment for this navigation task and release it together with our code for this work athttps://github.com/krishanrana/2D_SRRN.
a novel deployment strategy for continuous control reactive navigation agents which incorporates model uncertainty to stochastically inform a switching condition when the policy is uncertain.
an approach which tightly couples classical reactive navigation strategies with learning-based methods, showing significant reductions to the sample complexity and variance when learning to navigate end-to-end.
we show that our approach can directly be transferred from simulation to the real-world for cluttered indoor navigation tasks without any fine-tuning and can outperform end-to-end trained systems.
With the recent surge in deep learning, several supervised and reinforcement learning based approaches have been presented for the task of reactive robot navigation. Supervised approaches for obstacle avoidance have been shown to enable a robot to navigate along forest trails using an imitation learning approach. Kim et al.  and Pfeiffer et al.  train a navigation agent using a global path planner as labelled training data. These systems are however limited to the performance of the training data. Zhu et al.  train a monocular-based robot, using deep reinforcement learning for target driven navigation. Given the significant amount of environment interaction required, the agent is trained in a high fidelity simulator and then fine tuned before deployment in the real world. Given the close correspondence of laser scans in simulation and the real-world,  and  show that training an agent in simulation can be transferred directly to a real robot. Despite showing reasonable performance, the robot was still shown to fail in scenarios it failed to generalise to. Additionally both approaches utilise a dense reward signal to accelerate the training process which we found to yeild inefficient routes to the target.
Navigation is a well developed field in the robotics community and can be divided into two categories: deliberative and reactive systems. While deliberative systems typically rely on the availability of a globally consistent map, reactive planners determine collision free paths based on immediate perception of their surrounding environment. They can handle dynamic obstacles and those unaccounted for in the global map. Potential field approaches 
are a popular set of reactive algorithms which utilises a local potential function to attract the robot towards the goal and repel it away from obstacles. Vector Field Histograms (VHF) is a real time motion planning algorithm that generates a polar histogram to represent the polar density of obstacles around the robot. The robot’s steering angle is then chosen based on least polar density and closeness to goal. Such an approach requires the the polar histogram to be computed at every timestep and is suited for sparse moving obstacles. Other approaches utilise short term memory to inherently build a local map , allowing them to escape from local minima. The bug family of algorithms  can guarantee completeness but is inefficient. A common disadvantage of these classical approaches is that they require significant tuning and hand engineering to achieve good performance, coupled with a tendency to deteriorate when deployed in domains with different characteristics. Further common problems are the occurrence of local minima , oscillations  and suboptimal path length.
There has been a recent interest in the way of combining the best of both worlds in the robotics community: classical and learned systems. Bansal et al.  train a perception module to produce obstacle-free waypoints with which an optimal controller can follow. Xie et al.  leverage a simple classical controller in form of a P-controller to speed up the training process for a navigation agent. The idea is based on the fact that a simple P-controller will yield higher rewards than random exploration alone. More closely related to this work is the residual reinforcement learning approach introduced by Johannink et al.  and Silver et al.  concurrently. The approach is based on learning a residual policy to improve upon a sub-optimal classical controller. Both works focused on robotic manipulation tasks where the residual learned to deal with contacts and friction, typically difficult to model analytically. We extend the approach for the first time to the navigation domain, which exhibits a larger unstructured operational space. We particularly focus on the deployment of these system where we take advantage of the close coupling of the learned and classical controller as a result of the joint training process. Other works in this area  show how a learned policy can be used to modulate Trajectory Generators in order to improve upon their base level of performance and show its applicability for real robot locomotion. In all cases we see how the learned system benefit from prior knowledge during training and the prior systems benefit from the learned system during deployment. However there has been little work in leveraging this close coupling to improve the generalisation capabilities of these system when deployed in novel environments. In this work we take a step towards this notion.
In this work we attempt to gain the efficiency and adept capabilities provided by learned strategies to reactive navigation whilst improving the generalisation of these systems when transferred to the real world. We closely couple the learned policy with a classical controller using residual reinforcement learning and utilise an uncertainty measure to inform a stochastic switching condition in cases of high policy uncertainty. We introduce the navigation task and key components of our strategy in this section before describing our approach in detail.
We formulate the problem as a decision making process where an agent is required to avoid obstacles and reach a target location in the shortest time possible in unknown environments. Given an input at time , the agent is required to determine a suitable action . After executing the action, the robot transitions to the next state and receives a reward from the environment. We found the standard dense reward used in prior work [26, 29, 28] to bias the policy towards learning long sub-optimal paths to the goal in its attempt to maximise its reward. Given our motivation to learn an efficient planner, we focus on the sparse reward setting, whilst being significantly difficult, allows the policy to identify novel and potentially efficient solutions. We define the reward as if and otherwise, where is the distance between the target and the agent and is a set threshold. The overall objective of the decision making process is to maximise the cumulative future rewards , where is a discount factor.
There have been several approaches introduced in the deep learning literature which attempt to extract uncertainty from neural networks[8, 16, 22, 18]. With the extension of deep neural networks in the reinforcement learning community, recent work has explored the applicability of these approaches to deep reinforcement learning algorithms and has shown impressive results particularly in the area of safe reinforcement learning. Kahn et al.  utilise a combination of MC-dropout  and bootstrapping [16, 21] to predict uncertainty in a collision prediction model which modulates the agent’s velocity. Lütjens et al.
utilise a similar uncertainty estimation method to enable safe pedestrian avoidance. Clementset al.  introduce an approach which can separately determine the aleatoric risk using a distributional approach and epistemic uncertainty using a Bayesian framework. Other works have suggested the use of discriminative models to provide estimates of uncertainty . Such estimates yield poor results in novel states which the model had not generalised well to, limiting its applicability.
We extend the field of residual reinforcement learning to the navigation domain and exploit the tight coupling between a classical and learned controller to address the generalisation problem faced by trained systems when transferred to a different domain. We achieve this by utilising an uncertainty estimate of the residual policy to inform a switching condition to the prior in cases of high uncertainty. We firstly describe the training process of our approach and then describe our novel deployment strategy.
We follow a similar process to Johannink et al.  for residual reinforcement learning. The system is decomposed into two components: a residual policy and classical controller referred to as the prior.
The resulting action effect from the two systems can be described as , where represents the learned residual policy and represents the prior. We refer to this combination as a hybrid action. The residual policy is parameterised by a neural network and produces a suitable action effect to modify the priors’ output. For the prior, we utilise the Artificial Potential Fields controller introduced by Warren et al. . It demonstrates a competent level of obstacle avoidance capabilities and exhibits the same limitations faced by most reactive planners . We note here that any other prior controller for this task can be utilised. The policy is trained using TD3, an off-policy deep reinforcement learning algorithm for continuous action spaces in conjunction with the prior as described in . The output of the prior is a 2 dimensional vector consisting of a continuous linear, and angular velocity. The output of the policy is a 2 dimensional vector consisting of the learned residual terms for both the linear, and angular velocities, . The range of the residual action space allows the network to negate the priors actions in states it deems necessary. To enable the extraction of the policies uncertainty at deployment, we train the actor network with MC-dropout with a ratio of 0.2. We note here that any epistemic uncertainty measure can be used here and the focus of this work is primarily on how it can be used within the given system.
We assume that the robot can localise itself within a global map in order to determine its relative position to a goal location. We divide the 180 laser scan range data into 15 bins and concatenate the robots angle and distance to goal, previous linear and angular velocity and additionally the linear and angular velocities determined by the prior controller for the current state . We found that this was a necessary step to improve the stability during training, which yielded smoother trajectories during deployment. The observation space is hence a 21 dimensional vector.
Directly transferring a policy trained in simulation to the real world is typically plagued by the failure of the policy to generalise to unseen states. We propose a novel deployment strategy which builds on the RRN approach described previously. The prior within the RRN system is enhanced by the learned residual; however in cases where the residual fails to generalise it could potentially deteriorate the performance of the system, limiting its reliability. We address this by introducing switching-RRN (sRNN) which probabilistically switches off the effect from the learned residual in cases of high policy uncertainty, instead only executing the prior. The stochastic switching is reminiscent of the -greedy strategy used in multi-armed bandit problems 
and removes the need for fixed thresholds which are difficult to determine beforehand. The higher the uncertainty, the higher the probability of choosing theprior over the hybrid action. The prior is a reliable fallback, which despite being sub-optimal, is deterministic and can be guaranteed to no have no collisions. The complete sRRN algorithm is given in Algorithm 1.
The residual policy action and corresponding uncertainty is extracted by finding the mean and variance of 100 stochastic forward passes of the actor network using MC-dropout.
We conducted several experiments to evaluate the performance of our approach in both simulation and the real world. Due to the lack of an efficient environment dedicated to learning navigation strategies using 2D laser scanners, we created a custom simulation environment using Box2D, a 2-dimensional physics simulation engine which provides a significant performance boost to the learning process over Gazebo and VREP used in prior work  . We freely release our environment111https://github.com/krishanrana/2D_SRRN to enable further development in this area. Given the close correspondence of 2D laser scans in simulation and the real world, we perform zero shot transfer of our trained policy to a real robot and evaluate the effectiveness of the system when incorporated into the ROS navigation stack for cluttered indoor navigation. Our approach, which we term Residual Reactive Navigation (RRN) is divided in two variants at deployment: with switching (sRNN) and without switching. The first approach executes the combined system at all stages during deployment and the second follows the switching strategy we introduce in Algorithm 1. We clip the resulting action outputs within the range -1 and 1. RRN is compared across 4 different baselines:
End-to-end: Using TD3, a policy is trained end-to-end to learn the entire navigation task from scratch. To keep consistent with reward definitions in this study we utilise the sparse reward signal for this method. The input to the network is a 19 dimensional state vector similar to that described in Section IV-A however does not include the prior controller actions. Both actions are clipped to the range -1 and 1 by a tanh()activation function.
Prior: This represents the analytically derived local navigation controller based on the Artificial Potential Fields approach . The input to this system is the angle to the goal and the raw 180 laser scan data.
The priors linear velocity is limited to be between 0 and 1 whilst, its angular velocity spans the range -1 and 1. Random:
Actions are randomly sampled from a uniform distribution between -1 and 1.
To evaluate the performance of these systems, we report the average episodic success which is achieved when the agent arrives within 0.2m of the goal. The episode is timed out after 300 timesteps and is considered unsuccessful thereafter. In addition to success rate, we report Success weighted by (normalised inverse) Path Length (SPL), a specific measure for the PointGoal task presented by Anderson et al. , and the episodic actuation time as a measure of efficiency of the approach. The SPL ratio requires a measure of the shortest path to the goal which we approximate using the path found by an A-Star search across a 2000 1000 grid.
We evaluated the training performance of each approach after every 10 episodes during training and the results are shown in Figure 2. All evaluations were made across 10 different seeds. The performance of the prior and random agent are overlayed for comparison. The end-to-end based approach shows considerably higher variance when compared to our approach, and requires a larger number of training episodes before convergence. The prior alone shows sub-optimal performance, but it is significantly better than a totally random agent which was incapable of achieving any reward. The residual learning approach shows significantly faster convergence which can be attributed to the prior guiding the exploration process, whilst allowing the policy to explore the regions around it for potential improved returns. In all the learning cases, the agent is driven by the sparse reward signal to identify the fastest route to the goal.
The main contribution in this work focuses on a novel stochastic deployment strategy which takes advantage of this close coupling to enable robust deployment in unseen environments. The evaluation is made both in simulation and on a real robot for target driven navigation tasks in cluttered indoor environments.
We evaluate the agent in two different scenarios for its goal and domain generalisation capabilities. The goal generalisation is reported for the agent’s performance to navigate towards a set of unseen goal locations during training. This is done within the environments used for training. The environment generalisation performance is evaluated within a set of novel environments previously unseen to the agent. We present the results in Table I.
|Goal Generalisation||Environment Generalisation|
|RRN (without switching)||1.00||0.906||64.1||0.867||0.887||62.1|
Our approach shows improved performance in all cases, achieving the highest success rate across both scenarios. We particularly note that the switching variant of RRN, as presented in Algorithm 1, plays a significant role when deploying the agent in novel unseen environments. This is attributed to the ability of the hybrid system (residual policy + prior) to fall back on to solely the prior in cases of high residual policy uncertainty. This is a limitation faced by the end-to-end approach which totally relies on the policy generalising to all unseen cases, which is not the case given finite training time in non-exhaustive environments. The prior, whist showing significantly longer actuation times, is still capable of reaching the goal reliably as indicated by its high success rate. The high actuation time is a result of the oscillations the prior controller is prone to when navigating. Whilst there are possible analytical solutions that can solve this problem we focus on showing how learning a residual can overcome this need and potentially learn far superior solutions. The high success rate of the prior makes it a dependable fall back for the sRRN system at the expense of an efficiency drop. This is a reasonable trade-off which improves the overall robustness of the system in novel environments. Figure 3 shows the trajectory taken by the 3 different systems.
The trajectory (given in yellow) is plotted using transparency to illustrate the amount of time the agent took in a given location before progressing. We show a typical failure case experienced by end-to-end trained systems, which in the presence of an unfamiliar state, acts in an undesirable manner. Both the prior and our approach successfully reach the goal however as shown by the more pronounced trajectory of the prior, the prior spent a longer period in a given location as a result of oscillations before progressing forward. This shows how the residual policy in RRN was capable of learning the appropriate control effect to modify the prior with in order to overcome such a drawback.
We conduct a study to examine the effect of the residual on the control action taken and how the uncertainty impacts a meaningful switching condition. A segment of an sRRN agent’s trajectory is recorded and the corresponding action components are plotted as a stacked bar plot given in Figure 4. The length of each coloured bar indicates the magnitude of the control effect from the residual and prior individually, and the total sum of their magnitudes is the total action effect executed in the environment. We overlay the residual policy’s uncertainty we extracted using dropout on the bar plots and indicate areas of interest from A to E. Segments A and C illustrate the residual policy slowing the priors’ angular transition from left to right to reduce the chances of oscillations by providing a gradual opposing velocity to the prior. In cases of agreement between the two systems, the residual amplifies the prior’s control effect in states it deems safe to, given its motivation to reach the goal as fast as possible. This is shown in Segments B and E. Segment D indicates a region of higher policy uncertainty and a case of prior fallback as shown by the sole orange bar in the region. We show that this ability to fallback plays a significant role when deployed on a real robot, particularly when transferred directly from simulation. We discuss this in more detail in Section V-B2.
Given the close correspondence of laser scans between simulation and the real world, we directly transfer the trained policy to a real robot for deployment after training. A PatrolBot mobile base shown in Figure 1 is used, which is equipped with a 180 laser scanner. All velocity outputs were scaled to 0.25 before execution on the robot.
The deployment environment was a typical office environment which had been previously mapped using the laser scanner. We utilise the AMCL ROS package to localise the robot within this map and pass the relevant state information to both the prior and policy. We note here that despite the availability of a global map, the robot is only provided with global pose information and no additional information about its operational space. The environment additionally contained clutter which was unaccounted for in the mapping process. We utilise a global planner to generate target locations 3 meters apart for our local planner to enable large traversals through the office space.
|Trajectory 1||Trajectory 2|
|RRN (without switching)||Fail||Fail||Fail||Fail|
We evaluate the performance at different start and target locations indicated as Trajectory 1 and Trajectory 2 in Figure 5. Trajectory 1 exhibits a range of obstacles, tight turns and dynamic human subjects along the traversal whilst Trajectory 2 consists of a long narrow office corridor. The second scenario and dynamic subjects were not encountered in any of the training environments. For a baseline comparison, we additionally evaluate the default move-base local planner distributed as part of the ROS navigation stack. The results are summarised in Table II. We note the shorter actuation times portrayed by our sRNN system over the prior only approach, and competitive results when compared to the widely used ROS move-base planner.
We overlay the trajectories taken by the four systems of interest to evaluate the reasons for the success of our approach. Given the two-fold benefit sRRN can provide, we firstly compare its performance to its non-switching variant and the end-to-end learned system.
The sRRN trajectory, shown in Figure 5 (a) is colour coded to indicate regions of prior fallback (purple) or execution of the hybrid (prior + policy) system (red). Dense regions of purple indicate higher residual policy uncertainty whereas the sparse switches seen along the trajectory are a result of the stochasiticity of the algorithm. Whist Trajectory 1 was successful for both the RRN variants, we can see how sRRN determined a smoother trajectory at region of interest A as a result of prior fallback (shown by the dense regions of purple). Trajectory 2 was a difficult scenario for both the non-switching variant of RRN and the end-to-end trained policy particularly because narrow corridors were different to any of the scenarios encountered during training. Both of these approaches failed at region of interest B. sRRN on the other hand was capable of progressing from this region as a result of prior fallback given the residual policy’s higher uncertainty in this region. We enlarge the key regions that our switching system contributed to the success of the trajectory shown within the blue outline.
The trajectories additionally show how learning a residual was capable of improving the efficiency of the prior. We direct the reader’s attention to the enlarged regions outlined in orange shown in Figure 5 (a) and (c) which illustrate the amount of time spend by the agent in a particular region. Darker regions indicate slow and inefficient oscillatory motions. It was interesting to note the reversing behaviour exhibited by RRN in the cases where it found itself in corners, or too close to obstacles. Whilst such behaviours can be hard-coded after identifying the dead-end, we show learning a residual was capable of modifying the prior’s behaviours to achieve this. We provide a video to demonstrate these behaviours on our project page.
We show that by learning an enhancement policy on top of a competent classical controller, we can attain the performance advantages of learned systems whilst exploiting the availability of the prior to fall back on in cases of high policy uncertainty. This allows for superior performance when transferring from simulation to a real robot when compared to end-to-end based approaches. Given that we allow the residual to develop reversing behaviours, we found our robot to get stuck in particular scenarios where it blindly reversed into a wall. This is a particular limitation in our implementation which can be addressed by utilising rear-facing senses to inform the agent. In future work we will explore alternate fusion strategies in cases where the uncertainty of the prior can be extracted.
international conference on machine learning, pp. 1050–1059. Cited by: §III-B.