Bayesian Controller Fusion: Leveraging Control Priors in Deep Reinforcement Learning for Robotics

by   Krishan Rana, et al.

We present Bayesian Controller Fusion (BCF): a hybrid control strategy that combines the strengths of traditional hand-crafted controllers and model-free deep reinforcement learning (RL). BCF thrives in the robotics domain, where reliable but suboptimal control priors exist for many tasks, but RL from scratch remains unsafe and data-inefficient. By fusing uncertainty-aware distributional outputs from each system, BCF arbitrates control between them, exploiting their respective strengths. We study BCF on two real-world robotics tasks involving navigation in a vast and long-horizon environment, and a complex reaching task that involves manipulability maximisation. For both these domains, there exist simple handcrafted controllers that can solve the task at hand in a risk-averse manner but do not necessarily exhibit the optimal solution given limitations in analytical modelling, controller miscalibration and task variation. As exploration is naturally guided by the prior in the early stages of training, BCF accelerates learning, while substantially improving beyond the performance of the control prior, as the policy gains more experience. More importantly, given the risk-aversity of the control prior, BCF ensures safe exploration and deployment, where the control prior naturally dominates the action distribution in states unknown to the policy. We additionally show BCF's applicability to the zero-shot sim-to-real setting and its ability to deal with out-of-distribution states in the real-world. BCF is a promising approach for combining the complementary strengths of deep RL and traditional robotic control, surpassing what either can achieve independently. The code and supplementary video material are made publicly available at



page 1

page 9

page 11

page 12

page 13

page 15

page 17


Zero-Shot Uncertainty-Aware Deployment of Simulation Trained Policies on Real-World Robots

While deep reinforcement learning (RL) agents have demonstrated incredib...

Residual Policy Learning

We present Residual Policy Learning (RPL): a simple method for improving...

Multiplicative Controller Fusion: A Hybrid Navigation Strategy For Deployment in Unknown Environments

Learning-based approaches often outperform hand-coded algorithmic soluti...

Learning to Drive Using Sparse Imitation Reinforcement Learning

In this paper, we propose Sparse Imitation Reinforcement Learning (SIRL)...

Recurrent Off-policy Baselines for Memory-based Continuous Control

When the environment is partially observable (PO), a deep reinforcement ...

LS3: Latent Space Safe Sets for Long-Horizon Visuomotor Control of Iterative Tasks

Reinforcement learning (RL) algorithms have shown impressive success in ...

RoCUS: Robot Controller Understanding via Sampling

As robots are deployed in complex situations, engineers and end users mu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

As the adoption of autonomous robotic systems increases around us, there is a need for the controllers driving them to exhibit the level of sophistication required to operate in our everyday unstructured environments. Recent advances in reinforcement learning (RL) coupled with deep neural networks as function approximators, have shown impressive results across a range of complex control tasks in robotics including dexterous in-hand manipulation

[dexterous], quadrupedal locomotion [haarnoja2018learning], and targeted throwing [throwing].

Nevertheless, the widespread adoption of deep RL for robot control is bottle-necked by two key factors: sample efficiency and safety [Ibarz_2021]. Learning these behaviours requires large amounts of potentially unsafe interaction with the environment and the deployment of these systems in the real-world comes with little to no performance guarantees. The latter can be attributed to the black-box nature of neural networks, while the sample complexity is related to the fact that RL agents tend to randomly search the solution space. This is further exacerbated by the sparse, long-horizon reward setting.

Fig. 1: Bayesian Controller Fusion: We learn a compositional policy (red) for robotic agents that combines an uncertainty-aware deep RL policy (green) and a traditional handcrafted controller (blue). Utilising this compositional policy to govern exploration allows for accelerated learning towards an optimal policy and safe behaviours in unknown states. It additionally allows for reliable sim-to-real transfer of RL policies.

A general avenue to addressing the sample complexity in RL is the deliberate use of inductive bias or prior knowledge to aid the exploratory process. This includes reward-shaping [schoettler2019deep, andrychowicz2018hindsight, rew_shaping_theoretical], curriculum learning [bengio2009curriculum], learning from demonstrations [hester2018deep, vecerik2017leveraging], and the use of behavioural priors [johannink2018residual, jeong2020learning, rana_mcf, lee2020guided]. The incorporation of prior knowledge in the form of behavioural priors has been gaining increasing traction in recent years. These priors include learned policies or hand-crafted controllers that capture the core capabilities for solving a task. They, however, are not necessarily the optimal solution to solving the task at hand. We refer to this class of methods as Reinforcement Learning from Behavioural Priors (RLBP). RLBP approaches can directly query an action from the prior at any given state. This allows for a diverse range of mechanisms for introducing inductive bias during training, including regularisation [tirumala2019exploiting_, galashov2019information_, cheng2019control], exploration bias [jeong2020learning, rana_mcf, cheng2019control] and residual learning [johannink2018residual, silver2018residual].

We focus on the robotics setting, where decades of research have yielded numerous behavioural priors in the form of hand-crafted controllers and algorithmic approaches for the vast majority of real-world physical systems (from mobile robots to humanoids) and tasks [siciliano2016springer]. These include classical feedback controllers [maxwell1868governors], trajectory generators [ijspeert2008central] and behaviour trees [colledanchise2016behavior]. While the explicit analytic nature of these controllers come with certain performance guarantees including safety and robustness, they can be highly suboptimal with respect to variations in the task and tend to require great effort in system modelling and controller tuning to achieve higher levels of performance.

In this work, we present Bayesian Controller Fusion (BCF), a novel RLBP approach that closely couples hand-crafted robot controllers within the RL framework for accelerated learning and safety awareness during both training and deployment. BCF enables RL agents to learn substantially better behaviours than the underlying hand-crafted controllers, while exploiting their risk-aversity for safe behaviours in out-of-distribution states, a limiting factor for neural network based policies.

In order to exploit the respective strengths of each of these systems, we draw inspiration from the dual-process theory of decision making, seen in the human brain, that demonstrates the use of multiple neural controllers when acting [daw2005uncertainty_, dayan2008decision, hassabis2017neuroscience]

. As opposed to leveraging just a single controller, this hybrid control strategy allows humans to exploit the strengths of each system when necessary in order to successfully complete a task. BCF follows this strategy, learning a compositional policy that interpolates between the behaviours specified by uncertainty-aware stochastic outputs generated by an RL policy and a hand-crafted controller as shown in Figure

1. Our Bayesian formulation allows the system exhibiting the least uncertainty to dominate control. This has important implications both during training and deployment. In states of high policy uncertainty, BCF biases the composite action distribution heavily towards the risk-averse prior, reducing the chances of catastrophic failure. In more certain states, it exploits the optimal behaviours discovered by the policy. We highlight our key contributions below:

  • A novel uncertainty aware training strategy that accelerates learning in sparse reward, long-horizon task settings while allowing for safe exploration in unknown environments.

  • The ability to leverage a suboptimal controller to aid learning without inhibiting the final policy from identifying the optimal behaviours.

  • A novel deployment strategy for robot control that combines the strengths of classical controllers and learned policies, allowing for reliable performance even in out-of-distribution states.

  • An evaluation of our deployment strategy to transfer a simulation trained policy directly to the real-world, for two different robotics tasks, and its ability to act reliably in out-of-distribution states.

This paper is structured as follows. In Section II, we briefly introduce the most important RL concepts and algorithms used in this work, an overview of control priors in robotics, and the mechanisms governing control in the human brain. Section III discusses related work and based on their limitations we formulate the problem in Section IV. In Section V we describe our algorithm, the derivation of the relevant components we require and its applicability to both exploration and sim-to-real transfer. Section VI describes our experimental set up and in Section VII and VIII we provide both qualitative and quantitative results for training, and sim-to-real deployment of our system on a physical robot respectively. We provide additional ablation studies to explore the limitations of our strategy and conclude with Section IX.

Ii Background

In this section, we provide an overview of two broad areas for robot control: learned controllers based on the deep reinforcement learning framework, and traditional hand-crafted controllers. We additionally provide key insights into the control mechanisms governing behaviours in the human brain, its ability to exploit the strengths of two different sets of controllers and its applicability to the robotics setting.

Ii-a Reinforcement Learning

We consider the reinforcement learning framework in which an agent learns an optimal policy for a given task through environment interaction in order to solve a Markov Decision Process

[sutton1998introduction]. A Markov Decision Process (MDP) is described by a tuple where: is the set of all possible states and is the set of all possible actions. is a stochastic policy mapping state-action pairs

to the probability of choosing an action

when in state at timestep . is a state transition function mapping tuples to the probability of arriving in state after taking action at state . is the reward function that assigns a scalar reward value to each state action pair. is a discount factor and is the time horizon.

Using an MDP, we can generate a sequence of states and actions as follows. Given an initial state , iteratively select the next action and evolve the state by sampling . This generates a sequence of states and actions , called a trajectory. A reinforcement learning algorithm seeks to obtain an optimal policy, , by optimising over the policy parameters, , such that the expected sum of discounted rewards across a trajectory, , is maximised:


where discounts rewards later in the trajectory. There exist two main strategies to optimise this objective; either directly via the policy gradient [williams1992simple, schulman2017trust], or indirectly via Q-learning [watkins1989learning, mnih2015human]. We focus on the latter as they tend to be more sample efficient, and are able to make effective use of multiple sources of data for exploration, a key component of our approach. We describe the specific deep RL algorithm used throughout this work in more detail below.

Ii-A1 Soft-Actor Critic

SAC is an off-policy, actor-critic algorithm that has achieved state-of-the-art results in recent years for continuous control tasks [haarnoja2019soft_]. It is based on the maximum entropy RL framework that optimises a stochastic policy to maximise a trade-off between the expected return and policy entropy, :


This is closely related to the exploration-exploitation trade-off [sutton1998introduction]

, encouraging exploration in previously unvisited states, where actions are chosen by sampling from a Gaussian distribution. While effective in exhaustively searching the solution space, the associated risk and sample inefficiency are unsuitable for robotics. In this work we focus on leveraging priors to better inform this action selection process. SAC learns three functions: two Q-functions

and that play the role of the critics, and the policy

, referred to as the actor. The Q-functions are learnt individually by minimising the mean squared bootstrapped estimate (MSBE), augmented with the entropy regularisation term:


where represents the replay buffer and the target, , is given by:


where is an action sampled from the current policy and explicitly controls the exploration-exploitation trade-off. The policy is a reparameterised Gaussian with a squashing function to ensure actions are in the range [-1,1]:


where represents independent sampled noise and and

are the mean and standard deviation of the policy outputs respectively. Finally the policy parameters are updated by maximising the bootstrapped entropy regularised future return.


Ii-B Control Priors

The classical robotics community have developed a plethora of algorithms and controllers that are capable of solving various tasks or part of more complex tasks using explicit analytic derivations [siciliano2016springer]. In the context of our work, we refer to these as control priors. In their most general form we can view these control priors as deterministic mapping functions from state to action:

Fig. 2: Examples of control priors in robotics, these include (a) traditional feedback controllers and (b) state machines. Note that despite the difference in structure of these priors, they both can be treated as mapping functions, from state to action in their most general form.

These methods make the assumption that a significant amount is known about the system dynamics, such as the differential equations governing state transitions. Given these explicit models, an error signal can be generated and traditional feedback approaches (e.g. optimal [kirk2004optimal], model predictive [camacho2013model] and robust control [zhou1998essentials]) can be used to control the input of the system in order to attain the desired behaviours. In some cases, it may be difficult to solve the control problem directly using conventional feedback methods due to the fact that for many control objectives, the states need to fulfil conditions that cannot be expressed as deviations (errors) from desired states. This can often be the case when we know the final goal rather than a full trajectory. In such cases, algorithms have been developed to process the state information into a more suitable representation that captures the desired objective, or to incorporate sub-systems within a larger control stack, state machine or behavioural tree. Figure 2 shows examples of control priors and how we can treat them as mapping functions from state to actions.

Given their known explicit derivation and deterministic nature, we make the assumption that these controllers are risk-averse and ensure the safety of the robot. This is in contrast to the black-box nature of deep RL policies.

These control priors, however, tend to require significant amounts of hand engineering, linear approximations and domain knowledge instilled to ensure that they are functional and can solve the task. As the complexity of the tasks increases and the operational environment for these controllers become unstructured as seen in the real world, such hand engineered solutions tend to be suboptimal. Explicitly engineering these controllers to meet the level of dexterity required becomes non-trivial and impractical amounts of modelling and calibration may be required.

Ii-C Control in Biological Systems

The dual-process theory of decision making [dickinson2002role] proposes that two distinct neural mechanisms are involved when controlling action selection in biological systems, notably the prefrontal cortex and stratium. [lee2014neural] provide evidence for this theory based on the human brain, and show the existence of an arbitration mechanism that determines the extent to which the different neural controllers govern behaviours. The arbitrator bases its selection on specific performance measures exhibited by each controller, exploiting their respective strengths in a given state. The two main control types identified by [dickinson2002role] include habitual and deliberative. The deliberative system considers the outcomes of actions and uses this to plan a more suitable action, whereas the habitual system is more reactive and responds directly to stimuli.

Multiple studies have suggested normative arbitration mechanisms between the two systems that can explain a wide range of behavioural traits in humans. [keramati2011speed] suggest that the arbitration mechanism trades-off speed and accuracy between the two systems while [daw2005uncertainty_] and [lee2014neural] suggest that the reliability of each system is the governing factor. Focusing on the latter, [lee2014neural] show that the inferior lateral prefrontal and frontopolar cortex encode a reliability signal for each neural controller and additionally generates a comparison between these signals that is then used by the arbitrator. This allows the controller currently exhibiting the highest degree of reliability to dominate control at any given stage.

Inspired by the idea of jointly leveraging two controllers to govern behavioural actions, we explore the applicability of these findings to robotics, where we have the two broad areas of control methods described previously, each exhibiting their own respective strengths. By combining these two strategies, governed by an underlying arbitration mechanism, we seek to build more reliable and dexterous controllers for robotic systems.

Iii Related Work

In this section we review some of the key RLBP methods introduced in the literature. We focus on the specific set of approaches which leverage behavioural priors in the form of control priors within the RL framework, and discuss their current limitations when considering application to the robotics domain.

Iii-a Residual Reinforcement Learning

The residual reinforcement learning framework [johannink2018residual, silver2018residual, srouji18a] focuses on learning a corrective residual policy for a control prior. The executed action is generated by summing the outputs from a control prior and a learned policy, i.e., . The residual policy, , can be learned using any off-policy RL algorithm.

[johannink2018residual, silver2018residual] utilise this approach to learn complex manipulation tasks that involve contacts and friction that are typically hard to model analytically. They leverage conventional feedback controllers as control priors that can solve part of the task, and utilise the residual to modify its behaviour in order to learn the unmodelled contact dynamics. Although showing promise towards accelerating learning of manipulation tasks, they only demonstrate its ability to make small changes to the underlying controller and the implications of utilising highly suboptimal controllers has not yet been explored.

[iscen_trajectory] extend this architecture to not only learn a residual, but additionally the parameters to modify the classical controller itself. They argue that this allows for a more expressive and flexible architecture, not attainable using solely a residual, and show its applicability to dexterously control a quadrupedal robot. However, allowing the policy direct access to modify the control prior, dampers the safety guarantees the prior can provide which is a limiting factor when considering the safety of the robot. [srouji18a] attempt to maintain the guarantees of the control prior by restricting the policy to solely learning a linear policy, limiting its ability to make significant modifications to the underlying behaviours of the prior. This strict decoupling additionally allows for a better interpretation of the two controllers, an important factor when assessing stability and robustness of the overall system. While an interesting perspective, this greatly limits the potential optimality of the final controller that the RL agent can achieve.

The combinational nature of residual RL allows the policy to explore within a constrained search space around the deterministic actions of the prior, biasing the agent towards the most relevant regions of the environment. This additionally prevents the agent from taking completely random actions which is desired when training robotic agents [silver2018residual]. While a promising approach, residual RL is limited towards learning a residual policy as opposed to a standalone policy, limiting its expressiveness and ability to learn more complex behaviours. In the case where a significant exists gap between the behavioural prior performance and the optimal policy, the residual policy may be incapable of expressing the optimal behaviours.

Iii-B KL-Regularised Reinforcement Learning

Recent algorithms directly optimise a regularised objective that trades off reward maximisation with the minimisation of the policy divergence from a behavioural prior which could be a fixed or learned reference distribution. The most common divergence measure used is the Kullback-Leibler (KL) divergence, which computes the degree of similarity (or dissimilarity) between two distributions. The KL-regularised RL objective is given by:


Typically this reference distribution could be a uniform distribution

[haarnoja2019soft_], learned policy [teh2017distral, pertsch2020accelerating, tirumala2019exploiting_, galashov2019information_, hunt2019composing, hausman2018learning], or an occupancy measure that characterises the distribution of state-action pairs when executing a policy [pmlr-v80-kang18a, jing2020reinforcement]. This approach is typically used for transfer and multi-task learning and has been shown to be an effective approach to distil prior knowledge into the policy during optimisation.

In the case of a uniform reference distribution, [pertsch2020accelerating] show that we recover the maximum entropy objective given in Equation 2. This is the least informative distribution when considering the incorporation of useful structure and prior to aid the learning process. A more informed distribution constitutes common behaviours [teh2017distral, galashov2019information_] that agents can share across multiple tasks which allows it to significantly accelerate learning. Such behavioural priors are typically learned in parallel or already trained policies that can solve simpler tasks [pertsch2020accelerating], forming a continual learning paradigm. This is an important motivation for our work, where it makes sense to build on the vast body of work already solved by the robotics community, as opposed to learning policies from scratch.

The KL constrained setting can be seen as a hard constraint that can severely restrict the policies from attaining the optimal behaviours given the potential suboptimality of the behavioural priors used. [pertsch2020accelerating] address this by automatically learning the weighting parameter, and show the applicability of the regularised objective to accelerate learning of low level skills within a hierarchical framework. Recent work by [pmlr-v80-kang18a] and [jing2020reinforcement] soften the constraint by gradually annealing the divergence tolerance. This allows the policy to deviate further away from the prior as training progresses in order to learn potentially better behaviours, an important factor when the prior is highly suboptimal. The choice of the annealing rate is however non-trivial and needs to be carefully selected.

While overall a promising approach to accelerate learning using behavioural priors, the KL regularised objective additionally provides limited safety guarantees to the agent during training. This is particularly important in the robotics case, where we additionally seek to ensure the safety of the robot during exploration. A more promising strategy would be to directly utilise the behavioural prior to influence the actions taken by the agent during exploration. We discuss these approaches below.

Iii-C Exploration Bias using Behavioural Priors

These methods can be interpreted as biasing the policy search close to the behavioural prior actions during exploration. This is done by utilising the prior as a source of exploration data, directly moderating its exploration in the given environment.

The simplest approach to this method is policy reuse/intertwining proposed by [Fernndez2006ProbabilisticPR, jeong2020learning], who utilise an epsilon-greedy like approach to balance exploration and the usage of a behavioural prior. While providing an attractive solution, this method is not suitable when considering the robotics scenario. Such an approach alternates between the policy, prior and random exploration throughout training without any concern for the safety of these actions. Taking completely random actions at the start of training can result in unsafe behaviours that can be detrimental to the both the robot and its surroundings.

An alternative approach is to restrict exploration to be governed by the behavioural prior only until the policy is capable of safely exploring on its own accord. [rana_mcf] proposed a multiplicative fusion strategy for stochastic policies and behavioural priors. This approach utilises a gating function to initially allow exploratory actions to only be sampled from the behavioural prior and gradually transition control towards the policy in order to exploit its own actions. Tuning the gating function can be tedious, and requires the correct balance, while the hard switch towards the policy has limited safety guarantees since the behavioural prior has no impact after this point. This can result in unsafe behaviours as the policy continues to explore novel regions of the solution space on its own. In this work we build on this formulation, exploring alternatives to the hard-gating function in order to address this limitation and allow for continual safe exploration.

A better informed strategy is to govern this transition based on the policy’s confidence. [cheng2019control]

utilise the temporal difference error (TD-error) to estimate how confident the policy is to act in a given state. This in theory should allow the prior to dominate exploration during the early stages of training and as the TD-error reduces, the policy can gradually take over. The TD-error however tends to be noisy in practise and can yield instabilities during training. In this work, we explore the use of epistemic uncertainty estimation techniques from the computer vision community

[lakshminarayanan2017simple] as a more stable alternative to this TD-error estimate. [xie2018learning] propose a separately trained critic network to govern this arbitration. While this allows for expressive compositions of the two systems, the requirement to learn a separate critic increases the sample complexity of the overall approach, with limited ability to deal with out-of-distributions states. These are core limitations we address in this work, where both accelerating training and the safety of the agent are important factors to consider for robotic systems.

Given the distribution mismatch between the current policy and the behavioural prior, an important factor to consider with these approaches is the exploration-exploitation trade-off between the two sources of information. Excessive reliance on the behavioural prior for experience collection, without adequately exploiting the policy’s behaviours can result in unstable learning due to extrapolation error [kumar2019stabilizing, fujimoto2018addressing, van2016deep]. Extrapolation error tends to occur as a result of the target Q value estimate, that is used to update the value network, where is selected by the current policy . However if all the exploration data is collected using the behavioural prior and the state-action pair is not contained in the training set, the value can be unpredictable. This is due to extrapolation from other state-action pairs as a result of function approximation using neural networks. We explore a novel uncertainty based strategy to balance exploration with the two systems in order to avoid this instability.

Given the range of RLPB approaches discussed above and their respective limitations towards applicability in the robotics setting, we formulate the problem in Section IV before presenting our approach to address these limitations in Section V.

Iv Problem Formulation

Decades of research and development have resulted in algorithmic solutions for most problems in robotics, distilling analytical approaches, domain knowledge, and human intuition [siciliano2016springer]. Instead of ignoring this vast resource and learning robotic tasks from scratch, we present a control strategy that seamlessly combines a learned policy with an existing hand-crafted algorithm , during both training and deployment. We refer to as a control prior. We assume that is suboptimal, but has a reasonable degree of task competence, and that it is risk-averse, i.e. it conservatively balances performance and safety considerations.

We consider learning a policy for a robotics task. We drop the subscript for clarity. Using the formalism of an MDP, we leverage off-policy model-free RL to learn this policy. However, in contrast to most existing RL approaches, we exploit the knowledge and structure encoded in an existing prior during both training and deployment for accelerated learning and safer behaviours.

Our goals are to: (1) use the control prior to guide exploration during the early phases of the learning process, thereby accelerating training and improving sample efficiency; (2) naturally let the learned policy dominate control as it gains more knowledge, ensuring it can improve beyond the performance of the existing solution and (3) monitor the uncertainty of during deployment and smoothly transfer control to as a safe-but-suboptimal fall-back in situations where the learned policy cannot be trusted. In Section V we explain how a Bayesian fusion between and can achieve the above goals.

V Bayesian Controller Fusion

We introduce Bayesian Controller Fusion (BCF), a hybrid control strategy that composes stochastic action outputs from two separate control mechanisms: an RL policy , and a control prior . These outputs are formulated as distributions over actions, where each distribution captures the relative state uncertainty for the system to act. The Bayesian composition of these two outputs forms our hybrid policy . We show that governing the agent’s behaviours in the environment using , significantly accelerates learning towards an optimal policy while allowing for safer exploration and operation in unknown environments. We describe the intuition behind BCF in more detail below.

Accelerated Learning:
Fig. 3: Evolution of the composite BCF sampling distribution over the course of training. During the early stages, note the strong bias towards the control prior which provides the exploration guidance.

As the action distribution generated by each system captures their relative uncertainty to act in a given state, our Bayesian fusion strategy allows the composite distribution to naturally bias towards the system exhibiting the least amount of uncertainty. Figure 3 shows the evolution of our exploration strategy throughout training. During the early stages of training, the policy exhibits high uncertainty across all states, biasing the composite distribution towards the control prior . As opposed to random exploration, sampling actions from this distribution allows the control prior to strongly influence exploration at this stage. This quickly biases the agent towards the reward yielding trajectories, while exploring the surrounding state-action space for potential improvements beyond the suboptimality of the control prior. As the policy becomes more certain about its environment it gradually dominates control, allowing the agent to exploit its learned behaviours and stabilise training.

Safe Exploration and Deployment:
Fig. 4: BCF hybrid control strategy for safe deployment on real robotic systems. We derive uncertainty-aware action outputs for each controller and compose these outputs to better inform the action selection process.

Figure 4 illustrates our hybrid control strategy that composes the outputs from the learned policy and control prior . The uncertainty-aware compositional policy allows for safe exploration and deployment of RL agents. In states of high uncertainty, the compositional distribution naturally biases towards the reliable, risk-averse and potentially suboptimal behaviours suggested by the control prior. In states of lower uncertainty, it biases towards the policy, allowing the agent to exploit the optimal behaviours discovered by it. This is reminiscent of the arbitration mechanism suggested by [lee2014neural] for behavioural control in the human brain, where the most reliable controller influences control in a given situation. This dual-control perspective provides a reliable strategy for bringing RL to real world robotics, where generalisation to all states is near impossible and the presence of a risk averse control prior serves as a reliable fallback.

V-a Method

Given a policy, and control prior, , we can obtain two independent estimates of an executable action, , in a given state. In a Bayesian context, we can utilise the normalised product to fuse these estimates under the assumption of a uniform prior, :


We assume Gaussian distributional outputs from each system and represent and , where, and denote the distribution parameters for the policy and control prior outputs respectively and is the dimensionality of the action space. We drop the state , to simplify notation.

Assuming statistical independence of and , we can expand our likelihood estimate, , as follows:


Substituting this result back into (9), we can simplify the fusion as a normalised product of the respective action distributions from each control mechanism:




The composite distribution forms our hybrid policy output . As we approximate the distributional output from each system to be univariate Gaussian for each action, the composite distribution will also be univariate Gaussian . As a result, we can compute the corresponding mean

and variance

for the distribution:


where this expansion implicitly handles the normalisation constant .

V-B Components

In order to leverage our proposed approach in practice, we describe the derivation of the distributional action outputs for each system below and provide the complete BCF algorithm for combining these systems in Algorithm 1.

V-B1 Uncertainty-Aware Policy

We leverage stochastic RL algorithms that output each action as an independent Gaussian where denotes the mean and denotes the corresponding variance. This distribution is optimised to reflect the action which would both maximise the returns from a given state, as well as the entropy [haarnoja2019soft_]. Such exploration distributions tend to be risk seeking and do not capture the state uncertainty of the agent. The latter is a key component required for our BCF formulation. To attain an uncertainty-aware distribution, we leverage epistemic uncertainty estimation techniques suggested in the computer vision literature based on ensemble learning [lakshminarayanan2017simple]. We train an ensemble of

agents to form a uniformly weighted Gaussian mixture model, and combine these predictions into a single univariate Gaussian whose mean and variance are respectively the mean,

and variance, of the mixture, as described in [lakshminarayanan2017simple]. The mean and variance of the mixture are given by:


The empirical variance, , of the resulting output distribution, approximates a measure of the policy’s epistemic uncertainty in a given state for a particular action. This allows for a broader distribution when presented with unknown states and a tighter distribution in familiar states. This plays an important role within our BCF formulation as described previously.

V-B2 Control Prior

In order to incorporate the inherently deterministic control priors developed by the robotics community within our stochastic RL framework, we require a distributional action output that captures its uncertainty to act in a given state. As the uncertainty is state centric, we empirically derive this action distribution by propagating noise (provided by the known sensor model) from the sensor measurements through to the action outputs using Monte Carlo (MC) sampling. By computing the mean and variance of the outputs, the distributional action output, is given by:


where denotes a deterministic action output from the control prior for a given state and is the number of sampled states. As this distributional output also serves as a structured action set that governs exploration during the early stages of training, we additionally set a minimum possible standard deviation for this distribution. This allows the agent to explore and identify potentially better actions captured by this distribution while additionally preventing the distribution from collapsing to a single deterministic value that would hinder exploration. The resulting variance for the control prior distribution is defined as:


The choice of is left as a hyper-parameter for the user to set based on the specific controller used and its optimality towards solving the task. Figure 5 provides some intuition into the choice of and its relation to exploration. The explorable region of the state space is denoted by the set , which grows as is increased and vice versa. Sampling actions from this trust region depicts the guided exploration provided by the control prior. The ability to bias the policy towards the optimal policy depends on this explorable set. If the optimal trajectory is within the explorable region, then we can learn the corresponding optimal policy - otherwise the policy will remain suboptimal. We conduct an ablation study to further explore the impact that this hyper-parameter has during training in Section VII-E.

Given the formulation for the distributional outputs from each system, we present the complete BCF algorithm for governing action selection both during training and deployment in Algorithm 1.

Fig. 5: Illustration of optimal trajectory vs. control prior trajectory with the explorable set extracted. The chosen standard deviation, should reflect the optimality of the controller such that the optimal behaviours can be captured within the distribution.

Vi Experimental Setup

In this section we describe the two different robotics tasks that we evaluate BCF on. For completeness, we additionally detail the analytic derivation of the control priors we use for these tasks, and their respective limitations. We note here that the control priors used are just one particular example of existing hand-crafted solutions to the tasks, and alternative approaches could also be used.

Vi-a Tasks

We study two different task domains (shown in Figure 6) to evaluate the performance of BCF as a viable strategy towards exploiting the strengths of RL and classical control, both during training and real-world deployment. To provide direct comparison with previous work [rana_mcf], we conduct experiments in the PointGoal Navigation task which was first presented by [anderson2018evaluation] as well as a complex reaching task that requires manipulability maximisation across the trajectory. For both tasks, we assume the sparse reward, long horizon setting and the presence of existing control priors that can provide some structure towards solving the task. We describe each task in more detail below.

1 Given: Ensemble of M policies (), control prior () and variance ()
Input: State
Output: Action
2 Approximate the policy ensemble predictions as a unimodal Gaussian described in Equations (15) and (16)
3 Compute the control prior action distribution as given in Equations (19) and (17)
4 Compute the composite distribution
as given in Equations (13) and (14)
5 Select action from the distribution
Algorithm 1 Bayesian Controller Fusion
PointGoal Navigation:

The objective of this task is to navigate a robot from a start location to a goal location in the shortest time possible, while avoiding obstacles along the way. We utilise the training environment provided by [rana2019residual], which consists of five arenas with different configurations of obstacles. The goal and start location of the robot are randomised at the start of every episode, each placed on the extreme opposite ends of the arena (see Figure 6 (a)). This sets the long horizon nature of the task. As we focus on the sparse reward setting, we define if and otherwise, where is the distance between the agent and the goal and is a set threshold. The action consists of two continuous values: linear velocity and angular velocity . We assume that the robot can localise itself within a global map in order to determine its relative position to a goal location. The 180 laser scan range data is divided into 15 bins and concatenated to the robot’s angle. The overall state of the environment is comprised of:

  • The binned laser scan data ,

  • The polar pose error between the robot’s pose and the goal location ,

  • The previous executed linear and angular velocity ,

for a total of 19 dimensions. The length of each episode is set to a maximum of 500 steps and does not terminate once the goal is achieved.

Fig. 6: Simulation training environments and real world deployment environments for (a) PointGoal Navigation and (b) Maximum Manipulability Reacher tasks. Note the stark discrepancy in obstacle profiles for the navigation task between the simulation environment and real world environments.
Reaching with Maximum Manipulability:

The objective of this task is to actuate each joint of a manipulator within a closed-loop velocity controller such that the end-effector moves towards and reaches the goal point, while the manipulability of the manipulator is maximised. The manipulability index describes how easily the manipulator can achieve any arbitrary velocity. The ability of the manipulator to achieve an arbitrary end-effector velocity is a function of the manipulator Jacobian. While there are many methods which seek to summarise this, the manipulability index proposed by [manip] is the most used and accepted within the robotics community [manip2]. Utilising Jacobian based indices in existing controllers have several limitations, require greater engineering effort than simple inverse kinematics based reaching systems, and precise tuning in order to ensure the system is operational [mmc]. We explore the use of RL to learn such behaviours by leveraging simple reaching controllers as priors. We utilise the PyRep simulation environment [james2019pyrep], with the Franka Emika Panda as our manipulator as shown in Figure 6 (b). For this task, we generate a random initial joint configuration, and random end-effector goal pose. We use a sparse goal reward, if and otherwise, where is the spatial translational error between the end-effector and the goal, is a set threshold and


is the manipulability of the robot at the particular joint configuration where is the manipulator Jacobian. The action space consists of the manipulator joint velocities , where the values are continuous, and is the number of joints within the manipulator. In this work the manipulator used consists of 7 joints. The state, , of the environment is comprised of:

  • The joint coordinate vector


  • The joint velocity vector ,

  • The translation error between the manipulator’s end-effector and the goal ,

  • The end-effector translation vector ,

for a total of 20 dimensions. Similar to the navigation task, the episode length for this task in fixed at 1000 steps and only terminates at the end of the episode.

Vi-B Control Prior Derivation

For each of the above tasks, we utilise existing classical controllers and algorithms already developed by the robotics community for solving the task at hand. They however are not necessarily optimal, due to limitations in analytical modelling, controller miscalibration and task variations. We note here that each controller was calibrated, and scripted conditions were implemented to ensure they exhibited safe and risk-averse behaviours across the state space. We describe each of these controllers in more detail below.

Artificial Potential Fields:

Assuming the availability of global localisation and map information for our robot, PointGoal Navigation can mostly be solved using classical reactive navigation techniques. These systems rely on the immediate perception of their surrounding environment which allows them to handle dynamic objects and those unaccounted for in the global map. In this work we focus on the Artificial Potential Fields (APF) family of algorithms [warren1989global, koren1991potential] which compute a local attractive potential that attracts the robot at location towards the goal , while a repulsive potential repels it away from obstacles :


where and are gain terms, and is a function of the obstacle dimensions and their distance from the robot, , where is the number of obstacles. can be tuned such that the robot is always a safe distance away from obstacles, ensuring that it does not experience collisions. The resulting potential function is then computed by combining these components:


By taking the gradient of the resultant potential function, we can determine the best direction to move within the environment that avoids obstacles while heading towards the goal.


For a detailed derivation, we refer the reader to [koren1991potential]. This direction is used to generate an error signal with which we derive a linear feedback controller to determine the suitable velocities to control the robot, where is the proportional gain. We implement a variant of this algorithm that utilises only the direct information provided from a laser scanner with no additional global information about the obstacles in the environment. This is a typical scenario in real-world dynamic environments where surroundings may constantly change and the robot has to quickly respond to these changes.

A key problem faced by most classical solutions to reactive navigation, including APF, is the need for extensive tuning and hand engineering to achieve good performance, and a tendency to deteriorate in performance when overfit to a particular region [koren1991potential, khatib1986real]. This makes them susceptible to oscillations, seizure in local minima, and suboptimal path efficiency.

Resolved Rate Motion Control:

This controller allows for direct control of a robotic manipulator’s end-effector, without expensive path planning. It serves as a simple prior for a reaching task. For a given goal pose, Resolved Rate Motion Control (RRMC) provides the robot with suitable joint velocity commands to move the end-effector in a straight line towards the goal.

For a given manipulator joint configuration where is the number of joints within the manipulator, we can calculate the forward kinematics through the non-linear surjective mapping


where is some parameterization of the end-effector pose, and the mapping function is a function of the robot’s geometric structure. We are using a manipulator with a task space , and therefore . Taking the time derivative of (25) gives:


where is the manipulator Jacobian for the robot at joint configuration . RRMC exploits the mapping between a Cartesian end-effector velocity to the manipulator’s joint velocity provided by the differential kinematics in (26) [rrmc]. By rearranging (26), the required joint velocities to achieve an arbitrary end-effector velocity can be calculated as:


(27) can only be solved when is square and non-singular. For redundant robots (where ), is not square and therefore no unique solution exists for (27). Consequently, the most common solution is to use the Moore-Penrose pseudoinverse in (27) as follows:


where denotes the pseudoinverse operation. The pseudoinverse will find with the minimum Euclidean norm. RRMC can be wrapped into a closed-loop velocity controller using position-based servoing (PBS). PBS seeks to drive the robot’s end-effector in a straight line from its current pose to a desired pose. This control scheme is formulated as


where is a gain term, is the current end-effector pose expressed in the robot’s base frame, is the desired end-effector pose expressed in the robot’s base frame, and is a function which converts a homogeneous transformation matrix to a spatial twist. The end-effector velocity can be substituted in (28) to compute the corresponding joint velocities to reach the target. We use the Robotics Toolbox for Python [rtb] for the PBS scheme and to calculate the forward and differential kinematics.

While this controller can reach arbitrary goals within the robot’s workspace, it tends to result in poor manipulability performance across the trajectory. This reduces the robustness of the robot’s behaviours and increases the chances of it hitting a singularity. Additionally, the robot’s final pose tends to be ill-conditioned for completion of a consecutive task. This renders it suboptimal with respect to the reaching task at hand.

With the given experimental setup, we investigate to what extent BCF learns faster and safer than model-free RL alone, improves upon the given control prior, and its ability to safely deploy RL policies in the real world.

Property Algorithm
BCF (Ours) Residual RL MCF KL Regularised CORE-RL SAC
Accelerated Learning
Uncertainty-Aware Exploration
Improvement from Control Prior - PointGoal Navigation
116% 95% 109% 21% 42% -
Improvement from Control Prior - Maximum Manipulability Reacher
282% 110% 229% - - -
TABLE I: Summary of training results. BCF demonstrates accelerated learning, uncertainty-aware exploration, and achieves the highest improvement beyond the suboptimal control priors used in comparison to existing RLBP approaches.
Fig. 7: Learning curves of BCF and existing RLBP baselines for PointGoal Navigation and Reaching with Maximum Manipulability tasks. Note the faster convergence and lower variance across multiple seeds exhibited by our proposed approach.

Vii Evaluation of Training Performance

We provide an evaluation of training performance when compared to four different RLBP baselines that have been proposed in related work. We additionally compare training curves for vanilla end-to-end trained polices, and indicate the average performance of the control prior used. For all experiments, we utilise SAC as the underlying RL algorithm and train each system across 10 random seeds. We present the training curves for both tasks in Figure 7 and a summary of the key characteristics demonstrated by each approach in Table I.

Vii-a Baselines

  1. Residual Reinforcement Learning: Implementation of the residual reinforcement learning algorithm proposed by [johannink2018residual].

  2. KL Regularised RL: Modified SAC algorithm which utilises a KL regularised objective towards a prior behaviour as proposed by [pertsch2020accelerating]. This method utilises an auto temperature adjustment for the KL objective.

  3. CORE-RL: Implementation of the TD-error based exploration strategy to balance exploration between a control prior and the policy as proposed by [cheng2019control].

  4. MCF: Our prior work that leverages a fixed gating function to switch between the control prior and policy over the course of training [rana_mcf].

  5. SAC: Vanilla SAC algorithm using maximum entropy based Gaussian exploration [haarnoja2019soft_].

  6. Control Prior: Classical controller based on the algorithms described in Section VI-B.

Vii-B Accelerated Learning

Across both tasks, BCF consistently demonstrates its ability to substantially accelerate training and achieve significantly higher final returns than the baselines. While all the RLBP baselines show the ability to accelerate learning when compared to vanilla SAC alone, it is also important to note their ability to improve beyond the control prior used. We quantify this improvement by computing the greatest change in performance attained by the approach as a fraction of the control prior’s performance. The results are summarised in Table I.

For the navigation task, both KL-regularised RL and CORE-RL converge towards a final policy that exhibits suboptimal performance, while failing to learn at all in the reaching task. Residual RL and MCF both yield improvements beyond the control prior similar to that attained by BCF in the navigation task, attaining a 95% and 109% increase to performance respectively. However in the reacher task, both approaches yield a substantially lower improvement when compared to that attained by BCF. We can speculate that this significant drop in performance between the two tasks is related to the performance gap between the control prior and the optimal policy for that task. For the Residual RL case, we can speculate that for highly suboptimal control priors, the residual’s ability to express the required modifications to achieve the optimal behaviours is limited. The performance drop across MCF, KL regularised RL and CORE-RL may be related to the significant distribution mismatch between the control prior’s behaviours and that of the current policy, where the current policies behaviours are inadequately exploited. This can cause instabilities in the Q-value updates as seen in the offline RL setting [kumar2019stabilizing, fujimoto2018addressing]. BCF on the other hand covers a broad distribution across both the control prior and its own behaviours, mitigating this phenomenon and achieving the best final performance.

To gain a better intuition into the success of BCF and the limitations of the existing RLBP approaches, we conducted a focused experiment in the navigation domain for a fixed start and goal location. Figure 8 shows the state-space coverage of the agent over the course of training for each of the RLBP approaches. The dashed line in the figure indicates the deterministic trajectory taken by the control prior, and the coloured regions indicate the states visited by the agent. The figure illustrates a key attribute across all RLBP approaches to bias the search space during exploration towards the most relevant regions for solving the task. This allows them to significantly accelerate training, particularly in the sparse, long-horizon reward setting. This is in stark contrast to the exhaustive exploration carried out by vanilla SAC, which hardly progresses beyond a third of the arena over the course of training, limiting its ability to learn at all.

As shown in Figure 8 (d) and 8 (e), the KL regularised and CORE-RL approach both heavily constrain the behaviour of the policy towards that of the control prior used. The limits their ability to learn new behaviours. The stochastic nature of BCF on the other hand allows for a broader search space around the control prior’s behaviours, allowing it to identify potentially optimal behaviours, while still biasing the agent towards the most relevant regions of the search space to accelerate learning.

Fig. 8: State space coverage during exploration. The dashed line illustrates the deterministic path taken by the control prior. Note how our formulation explores regions of the state space that are likely to lead to a meaningful progression to the goal, while still exploring a diverse region around the deterministic control prior for potential improvements. The yellow crosses indicates collisions with an obstacle that resulted in a failed episode.
Fig. 9: Uncertainty-aware exploration induced by BCF during training allows for safer behaviours when presented with unseen states. As the policy uncertainty rises, BCF naturally relies more on the prior controller to guide exploration, while transitioning to the policy in more certain states.

Vii-C Uncertainty-Aware Exploration

We additionally investigate the ability of our uncertainty-aware exploration strategy to allow for safer exploratory behaviours when compared to existing RLBP approaches. This is a less explored area in existing RLBP literature that can have significant benefits as we gradually transition towards training these systems in the real world. Particularly in the case of robotics, we exploit the risk-aversity of the control priors developed and leverage these traits to allow for safer exploratory behaviours. As shown in Figure 8, we additionally indicate obstacle collisions experienced by the agent that result in an overall failed episode during training. We mark the mean location for the collisions as yellow crosses in the figure. BCF completes training in this experiment without any collisions. We can attribute this to the uncertainty aware formulation of the composite policy, that allows the risk averse control prior to dominate control in states that the agent has never experienced before. This allows the agent to steer clear of these unsafe states throughout training, safely guiding the agent towards the goal. In contrast, all the baseline RLBP, while successfully constraining the search space, experience multiple collisions throughout training. It is important to note that while the KL regularised and CORE-RL approach do exhibit a lower number of collisions on average, this comes at the expense of a significant drop to the overall optimality of the final policy, as they over-constrain exploration. BCF on the other hand is able to balance these two characteristics naturally.

In Figure 9 we take a closer look at how the uncertainty-aware BCF formulation operates at the distributional output level when presented with unknown states during training. We explore its ability to balance the guided exploration provided by the control prior and the exploitation of the policy. We conduct the experiment in another focused PointGoal Navigation setting, where we expose the agent to different arenas over the course of training. We train the agent within the first arena until convergence and switch to a novel unknown environment after 390k steps, indicated by the dashed line in Figure 9. We indicate the performance of the agent over the course of training, as well as the empirical uncertainty of the policy. Directly below this graph we illustrate the corresponding distributional composition from BCF for the linear velocity component of the mobile robot at three key locations. As shown in the figure, once the agent has converged to the optimal policy, and the policy is highly certain of its surroundings, the policy component is predominantly governing the exploratory behaviours of the agent. Upon switching to the unknown environment, we see a significant increase in the uncertainty of the policy and the transition of the composite distribution towards the behaviours suggested by the control prior . While we see a significant drop in performance towards that of the control prior, it is important to note that its risk-averse behaviours help guide exploration, allowing for safer exploratory behaviours than the highly uncertain black-box policy outputs. As the agent becomes more certain with its surrounding states, we see BCF transition control to the policy, allowing it to exploit its newly found behaviours. This additionally enables it to further explore surrounding state-action pairs allowing for improvements beyond the performance of the control prior.

Vii-D Impact of Control Prior Performance

Fig. 10: Control prior competency spectrum for the PointGoal Navigation task.

In this study we analyse the impact that the competency of the control prior has on the overall training process. We seek to identify if too much structure from the control prior strongly biases the policy to a particular solution or if the control prior should solely serve as hints to guide the policy in a general direction. Figure 10 shows the range of control priors we evaluated with BCF in the PointGoal navigation environment. On the least competent end of the spectrum we test a random prior which is a naive controller that represents a system with no knowledge towards solving the task at hand. At the mid-range, we utilise a proportional (P) controller which is a simple controller based on the Euclidean distance to the goal. It provides basic structural knowledge towards reaching the goal, however is incapable of avoiding any obstacles. For the more competent prior we utilise the standard APF controller used throughout this work that can head to the goal while avoiding obstacles.

The training curves for each of these controllers are given in Figure 11. The controller exhibiting the most competency towards the task, yielded the best performance in terms of sample efficiency and low variance. It is interesting to note that despite being a severely limited control prior, the agent still benefits from the exploration bias provided by the P-controller and is capable of attaining a higher performing policy. The random prior does not provide any additional benefit to exploration and the policy is incapable of learning within the given number of training steps.

Fig. 11: Training curves exploring the impact of the control prior’s performance on the ability for BCF to accelerate learning in the PointGoal Navigation task. The corresponding dashed lines indicate the average return each control prior could achieve in the given environment.

These results indicate that BCF is capable of leveraging a control prior to assist learning, and is not crippled from attaining a better policy regardless of the level of the initial performance of the control prior. This support the results observed for the reaching task as shown in Table I. It also suggests that control priors provide a useful form of positive bias to guide exploration as opposed to random exploration alone. It is important to note that despite their ability to accelerate learning, the competency of the control prior does play an important role when it comes to the safety of the agent. In such cases, the P-controller is not risk-averse and is prone to obstacle collision. This makes it unsuitable when considering the application of BCF to safely training real-world systems.

Vii-E Impact of Control Prior Variance

Fig. 12: Learning curves for BCF using different default control prior standard deviations, . The dashed line indicates the average performance of the control prior in the environment.

A key component of BCF is the distributional nature of the policy and the control prior. As most control priors are deterministic by nature, we approximate a Gaussian distribution by propagating any noise from the state inputs through to the action outputs using Monte Carlo sampling. We additionally set a default standard deviation to allow for adequate exploration during training and to prevent collapse of the distribution towards a deterministic value. In such a case the control prior would always dominate and the impact of the policy would be rendered ineffective. In practise, we found that this default value tended to dominate over the empirical distribution derived using MC sampling, given the inherent robustness to noise of the control priors used in this work. As the value of is left as a hyper-parameter that needs to be carefully selected, we study its impact on the overall training performance of BCF by sweeping across a range of fixed standard deviation values. Note that the choice of is dependent on the particular action type and its corresponding unit of measurement.

We conducted these experiments in the PointGoal navigation environment utilising the APF controller as the underlying control prior. The resulting learning curves are provided in Figure 12. The chosen standard deviation was fixed for both the linear and angular velocity component. With low standard deviation values, the agent fails to learn at all. In this setting, the control prior exhibits a high confidence and hence strongly biases the composite sampling distribution towards its own behaviours. This limits the ability of the policy to exploit its own actions during exploration which results in it failing to learn. Such a setting is reminiscent of the offline RL setting, where the agent is solely trained on data not pertaining to its own behaviours, resulting in compounding errors due to overestimation bias. Standard deviation values set in the range 0.5 to 1 exhibit the best performance. Such distributions provide a softer constraint on exploration allowing the agent to balance exploration and exploitation. Larger standard deviation values, greater than 1, resulted in the agent not learning at all. This is a result of the BCF formulation constantly rendering the policy as the more confident system and hence limiting the impact the control prior could have on the overall system. This is equivalent to the agent learning without any guidance from the control prior, similar to vanilla SAC.

Viii Evaluation of Deployed System

An important motivation for this work is to leverage RL policies as a reliable strategy to control robots. In this section we assess the ability of BCF to attain this during deployment, and address the current limitations of both RL and classical robotics. As shown in Figure 9, BCF thrives in out-of-distribution states, a common occurrence when considering the sim-to-real setting. As opposed to a neural network-based policy catastrophically failing in these states, BCF has the ability to naturally transfer control to the risk-averse control prior that dominates control until a more suitable state is presented to the policy. In this setting, we gain the optimality of the learned policy in less uncertain states and the safety of the hand-crafted control prior otherwise. We thoroughly evaluate this control strategy for the two robotics tasks in both simulation and the real world. We evaluate the individual components in isolation against our compositional BCF formulation. We provide details of the evaluated systems below.

  1. Control Prior: The deterministic classical controller derived using analytic methods.

  2. SAC: Vanilla SAC agent, trained to convergence using BCF, and deployed as a standalone policy.

  3. BCF: Our proposed hybrid control strategy that combines uncertainty aware outputs from the control prior and the learned RL policy.

Note that all the policies used in this evaluation were trained to convergence in the simulation environments. We provide a detailed account for each task in the following sections.

Viii-a PointGoal Navigation

In this experiment, we examine whether BCF could overcome the limitations of an existing reactive navigation controller, in this case APF, while leveraging this control prior to safely deal with out-of-distribution states that the policy could fail in. The APF controller used exhibited suboptimal oscillatory behaviours particularly in between obstacles and tended to stagnate within local minima.

Viii-A1 Simulation Environment Evaluation

For this task, we report the Success weighted by (normalized inverse) Path Length (SPL) metric proposed by [anderson2018evaluation]. SPL weighs success by how efficiently the agent navigated to the goal relative to the shortest path. The metric requires a measure of the shortest path to goal which we approximate using the path found by an A-Star search across a 2000 1000 grid. An episode is deemed successful when the robot arrives within 0.2m of the goal. The episode is timed out after 500 steps and is considered unsuccessful thereafter. We additionally report the average actuation time it takes for the agent to reach the goal.

As shown in Table II, we divide this evaluation across the known training environment and an unseen environment in order to better evaluate the impact of BCF. Across both settings, BCF attains superior performance when compared to the control prior and is able to learn a controller that surpasses the performance of the control prior. The lower actuation time indicates its ability to overcome the inefficient oscillatory motion while the higher SPL indicates its ability to attain a shorter path to the goal. More importantly we note that BCF and SAC attain the same performance in the known training environment. This is an important result that shows that BCF does not impact the optimality of the learned policy. In the unseen environment however we see that BCF surpasses the performance of the SAC agent, given its ability to reliably deal with out-of-distribution states. We extend this evaluation to the sim-to-real setting, where we explore BCF as a reliable transfer strategy for an RL policy.

Training Environment Unseen Environment
Method SPL
Actuation Time
Actuation Time
Control Prior 0.299 462 78.1 0.273 401 149
SAC 0.958 141 97.6 0.780 185 179
BCF (Ours) 0.958 104 11.9 0.909 149 129
TABLE II: Evaluation of PointGoal Navigation in the Simulation Environment

Viii-A2 Real-World Evaluation

We utilise a GuiaBot mobile robot which is equipped with a 180 laser scanner, matching that used in the simulation environment. The velocity outputs from the policies are scaled to a maximum of before execution on the robot at a rate of 100 Hz. The system was deployed in a cluttered indoor office space that was previously mapped using the laser scanner. We utilise the ROS AMCL package to localise the robot within this map and extract the necessary state inputs for the policy network and control prior. Despite having a global map, the agent is only provided with global pose information with no additional information about its operational space. The environment also contained clutter which was unaccounted for in the mapping process. To enable large traversals through the office space, we utilise a global planner to generate target sub-goals, for our reactive agents to navigate towards. We do not report the SPL metric for the real robot experiments as we did not have access to an optimal path. We do, however, provide distance travelled along each path and compare them to the distance travelled by a fine-tuned ROS move_base controller. This controller is not necessarily the optimal solution but serves as a practical example of a commonly used controller on the Guiabot.

The evaluation was conducted on two different trajectories indicated as Trajectory 1 and 2 in Figure 13 and Table III. Trajectory 1 consisted of a lab space with multiple obstacles, tight turns, and dynamic human subjects along the trajectory, while Trajectory 2 consisted of narrow corridors never seen by the robot during training. We terminated a trajectory once a collision occurred and marked the run as a failed attempt. We summarise the results in Table III.

Trajectory 1 Trajectory 2
Actuation Time
Actuation Time
Control Prior 42.3 274 35.3 277
Policy Only Fail Fail Fail Fail
Move Base 62.6 263 35.8 258
BCF 41.2 135 30.4 117
TABLE III: Evaluation of PointGoal Navigation in the Real-World
Fig. 13: Trajectories taken by the real robot for different start (orange) and goal locations in a cluttered office environment with long narrow corridors. The trajectory was considered unsuccessful if a collision occurred. The trajectory taken by BCF is colour coded to represent the uncertainty in the linear velocity of the trained policy. We illustrate the behaviour of the fused distributions at key areas along the trajectory. The symbols 1 and 2 indicate the start locations for each trajectory and G indicates the corresponding goal locations.

Across both trajectories, the standalone SAC agent failed to complete a trajectory without any collisions, exhibiting sporadic reversing behaviours in out-of-distribution states. We can attribute these behaviour to its poor generalisation in such states, given the discrepancies in obstacle profiles seen during training in simulation and those encountered in the real world as shown in Figure 6 (a). The control prior was capable of completing all trajectories however required significantly long actuation times. We can attribute this to its inefficient oscillatory motion when moving through passageways and in between obstacles. BCF was successful across both trajectories exhibiting the lowest actuation times across all methods. This indicates its ability to exploit the optimal behaviours learned by the agent while ensuring it did not act sporadically when presented with out-of-distribution states. It also demonstrates superior results when compared with the fine-tuned ROS move_base controller.

To gain a better understanding into the reasons for BCF’s success when compared to the control prior and SAC agent acting in isolation, we examine the trajectories taken by these systems as shown in Figure 13. The trajectory attained using BCF is colour-coded to illustrate the uncertainty of the policy’s actions as given by the outputs of the ensemble. We draw the readers attention to the region marked A which exhibits higher values of policy uncertainty. The composition of the respective distributions at this region is shown within the orange ring. Given the higher policy uncertainty at this point, the resulting composite distribution was biased more towards the control prior which displayed greater certainty, allowing the robot to progress beyond this point safely. We note here that this is the particular region that the SAC agent failed as shown in Figure 13 (c). The purple ring at region C illustrates a region of low policy uncertainty with the composite distribution biased closer towards the policy. Comparing the performance benefit over the control prior gained in such a case, we draw the readers attention to regions B and D which show the path profile taken by the respective agents. The dense darker path shown by the control prior indicates regions of high oscillatory behaviour and significant time spent at a given location. On the other hand, we see that BCF does not exhibit this and attains a smoother trajectory which we can attribute to the learned policy having higher precedence in these regions, stabilising the oscillatory effects of the control prior. This illustrates the ability of BCF to exploit the relative strengths of each component throughout deployment.

Viii-B Maximum Manipulability Reacher

We evaluate the ability of BCF to build upon the basic structure provided by an RRMC reaching controller for a 7 DoF arm robot in order to learn a more complex manipulability maximising reaching controller. While RRMC provides the policy with the knowledge to reach a goal, the agent has to learn how to modify the individual joint velocities of each joint in order to maximise the manipulability of the controller.

Fig. 14: Manipulability and uncertainty curves for known and out-of-distribution goals for the reacher task, deployed on a real robot. The red cross indicates a failed trajectory.
Training Environment Unseen Environment
Control Prior 0.0633 0.0126 1.00 0.0628 0.0197 1.00
SAC 0.0949 0.00476 1.00 0.0810 0.0296 0.800
BCF (Ours) 0.0972 0.00593 1.00 0.0924 0.0161 1.00
TABLE IV: Evaluation of Maximum Manipulability Reacher in the Simulation Environment

Viii-B1 Simulation Environment Evaluation

For this task, we report the average manipulability across an entire trajectory and the success rate of the agent out of 10 trials. The robot was trained with a subset of goals randomly sampled from the positive x-axis region of its workspace frame as shown in Figure 6

. We classify goal states sampled from outside this region as out-of-distribution states during evaluation. Similar to the navigation task, BCF attains the best performance in both settings, improving the manipulability of the control prior by 34.9

without any failure cases. While the standalone SAC agent successfully attained optimal behaviours in goal states within the training distribution, it exhibited sporadic and unsafe behaviours when presented with out-of-distribution goal states. In these cases, the robot was seen to constantly crash into the counter top or hit its joint limits. We provide videos demonstrating these behaviours on our project page.

Viii-B2 Real-World Evaluation

To ensure that the simulation trained policies could be transferred directly to a real robot, we matched the coordinate frames of the PyRep simulator with the real Franka Emika Panda robot setup shown in Figure 6. The state and action space were matched with that used in the training environment, with the actions all scaled down to a maximum of before publishing them to the robot at a rate of 100 Hz.

Table V shows the results obtained when evaluating the agent on a random set of goals sampled from the robot’s entire workspace. In all cases BCF attains the highest manipulability and success rate surpassing both the control prior and SAC policy illustrating its ability to deal with higher dimensional action spaces. We take a closer look at individual trajectories across known and out-of-distribution goal states inorder to better understand the how BCF attains successful trajectories when compared to a standalone SAC policy. Figure 14 shows the manipulability curves of the robot across these two sets, for three different goals sampled from each region. For each goal, we additionally indicate the performance of the control prior and SAC agent, together with a separate plot of the ensemble uncertainty estimate across the trajectory indicated by the red curves.

Average Final
Success Rate
Control Prior 0.06290.00926 0.06580.0165 98.2%
SAC 0.08030.00514 0.078120.0150 78.6%
BCF 0.08360.0156 0.08890.0177 98.2%
TABLE V: Evaluation of Maximum Manipulability Reacher in the Real-World

In the case of the known goals, BCF and the SAC agent both attain similar performances, maximising the manipulability of the agent across the trajectory. This is in stark contrast to the control prior which exhibits significantly poor performance across the trajectory. Note here that while the control prior exhibits poor performance with regard to manipulability, it is still successful in completing the reaching task at hand without any failures. It is interesting to note the high uncertainty of the ensemble at the start of a trajectory which quickly drops to a significantly lower value. The high uncertainty could be a result of the multiple possible trajectories that the robot could take at the start, which quickly narrows down once the robot begins to move. Note that once the policy ensemble exhibits a lower uncertainty, the performance of BCF closely resembles that of the standalone SAC agent, indicating that BCF does not cripple the optimality of the learned policy.

When evaluating the agents on out-of-distribution goals, BCF plays an important role in ensuring that the robot can successfully and safely complete the task. Note the higher levels of uncertainty across these trajectories when compared to the known goals case. In all these cases, the standalone SAC agent fails to successfully complete a trajectory, frequently self-colliding or exhibiting random sporadic behaviours. We indicate these failed trajectories with a red cross in Figure 14. BCF is seen to closely follow the behaviours of the control prior in states of high uncertainty, averting it from such catastrophic failures. While the composite control strategy works well to ensure the safety of the robot, the higher reliance of the system on the control prior results in suboptimal behaviour with regards to manipulability. The trade-off between task optimality versus the safety of the robot is an interesting dilemma that BCF attempts to balance naturally. The fixed standard deviation chosen for the prior controller could serve as a tuning parameter to allow the user to control this trade-off at deployment. A smaller standard deviation would bias the resulting controller more strongly towards the control prior yielding more conservative and suboptimal actions; whereas a larger standard deviation would allow for close to optimal behaviours at the expense of the robots safety. We leave the exploration of this idea to future work.

We provide videos illustrating the behaviours of the real robot for both the navigation and reaching task for each of the different controllers on our project site 111

Ix Conclusion

Building on the large body of work already developed by the robotics community can greatly help accelerate the use of RL based systems, allowing us to develop better controllers for robots as they move towards solving more complex tasks. The ideas presented in this paper demonstrate a strategy that closely couples traditional controllers with learned systems, exploiting the strengths of each approach in order to attain more reliable and robust behaviours. We see this as a promising step towards bringing reinforcement learning to real-world robotics.

Our Bayesian Controller Fusion (BCF) approach combines uncertainty-aware outputs from the two control modalities. In doing this, we show that we not only accelerate training, but additionally learn a final policy that can substantially improve beyond the performance of the handcrafted controller, regardless of its degree of suboptimality. We show results across both a navigation and reaching task where BCF attains a final policy exhibiting a 116% and 282% improvement beyond the initial performance of the control prior used respectively, substantially higher than that attained by existing approaches. More importantly, we show that our approach can exploit the risk-aversity provided by these classical controllers to allow for safe exploratory behaviours when presented with unknown states.

At deployment, we show that forming a hybrid controller with BCF allows us to exploit the respective strengths of each controller, enabling reliable performance of RL policies in the real world. Across two real-world tasks for navigation and reaching, we show that BCF can safely deal with out-of-distribution states in the sim-to-real setting, succeeding where a typical standalone policy would fail, while attaining the optimality of the learned behaviours in known states. In the navigation domain, we overcome the inefficient oscillatory motion of an existing reactive navigation controller, decreasing the overall actuation time during real-world navigation runs by 50.7%. For the reaching task, we show that our hybrid controller achieves the highest success rate, and improves the manipulability of an existing reaching controller by 34.9%, a system typically difficult to attain using analytical approaches.

While the uncertainty-based compositional policy we derive using BCF does train with the control prior in the loop, the policy is not directly aware of the control prior’s presence. This could impact its overall ability to work in synergy with the control prior at deployment. In future work, we propose to incorporate the control prior in the Q-value update or alternatively learn a gating parameter to better inform the fusion process. This should allow the hybrid controller to operate on more complex tasks involving higher dimensional action spaces. We are also interested in exploring alternative state uncertainty estimation techniques for RL, that are faster than the ensemble-based method used in this work. This includes work from the supervised learning literature for out-of-distribution detection and distance-based uncertainty estimation techniques.


This research was supported by the Australian Research Council Centre of Excellence for Robotic Vision (project number CE140100016) and the QUT Centre for Robotics. The authors would like to thank Jake Bruce, Robert Lee, Mingda Xu, Dimity Miller, Thomas Coppin and Jordan Erskine for their valuable and insightful discussions towards this contribution.