Safe Driving via Expert Guided Policy Optimization

by   Zhenghao Peng, et al.

When learning common skills like driving, beginners usually have domain experts standing by to ensure the safety of the learning process. We formulate such learning scheme under the Expert-in-the-loop Reinforcement Learning where a guardian is introduced to safeguard the exploration of the learning agent. While allowing the sufficient exploration in the uncertain environment, the guardian intervenes under dangerous situations and demonstrates the correct actions to avoid potential accidents. Thus ERL enables both exploration and expert's partial demonstration as two training sources. Following such a setting, we develop a novel Expert Guided Policy Optimization (EGPO) method which integrates the guardian in the loop of reinforcement learning. The guardian is composed of an expert policy to generate demonstration and a switch function to decide when to intervene. Particularly, a constrained optimization technique is used to tackle the trivial solution that the agent deliberately behaves dangerously to deceive the expert into taking over. Offline RL technique is further used to learn from the partial demonstration generated by the expert. Safe driving experiments show that our method achieves superior training and test-time safety, outperforms baselines with a substantial margin in sample efficiency, and preserves the generalizabiliy to unseen environments in test-time. Demo video and source code are available at:



There are no comments yet.


page 6

page 12

page 16


Efficient Learning of Safe Driving Policy via Human-AI Copilot Optimization

Human intervention is an effective way to inject human knowledge into th...

Efficiently Training On-Policy Actor-Critic Networks in Robotic Deep Reinforcement Learning with Demonstration-like Sampled Exploration

In complex environments with high dimension, training a reinforcement le...

Efficient Deep Reinforcement Learning with Imitative Expert Priors for Autonomous Driving

Deep reinforcement learning (DRL) is a promising way to achieve human-li...

Improved Deep Reinforcement Learning with Expert Demonstrations for Urban Autonomous Driving

Currently, urban autonomous driving remains challenging because of the c...

Learning Transferable Graph Exploration

This paper considers the problem of efficient exploration of unseen envi...

CoachNet: An Adversarial Sampling Approach for Reinforcement Learning

Despite the recent successes of reinforcement learning in games and robo...

Gradient-free Policy Architecture Search and Adaptation

We develop a method for policy architecture search and adaptation via gr...

Code Repositories


Official implementation of CoRL 2021 paper "Safe Driving via Expert Guided Policy Optimization".

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement Learning (RL) shows promising results in human-interactive applications ranging from autonomous driving [kendall2019learning], the power system in smart building [mason2019review], to the surgical robotics arm [richter2019open]. However, training and test time safety remains as a great concern for the real-world applications of RL. This problem draws significant attention since the agent needs to explore the environment sufficiently in order to optimize its behaviors. It might be inevitable for the agent to experience dangerous situations before it can learn how to avoid them [chow2018lyapunov]

, even the training algorithms contain sophisticated techniques to reduce the probability of failures 

[achiam2017constrained, stooke2020responsive, bharadhwaj2020conservative].

We humans do not learn purely from trial-and-error exploration, for the sake of safety and efficiency. In daily life, when learning some common skills like driving, we usually ensure the safety by involving domain expert to safeguard the learning process. The expert not only demonstrates the correct actions but also acts as a guardian to allow our own safe exploration in the uncertain environment. For example as illustrated in Fig.1, when learning to drive, the student with the learner’s permit can directly operate the vehicle in the driver’s seat while the instructor stands by. When a risky situation happens, the instructor takes over the vehicle to avoid the potential accident. Thus the student can learn how to handle tough situations both from the exploration and the instructor’s demonstrations.

In this work, we formulate such learning scheme with Expert-in-the-loop RL (ERL). As shown in the right panel of Fig.1, ERL incorporates a guardian in the interaction between agent and environment. The guardian contains a switch mechanism and an expert policy. The switch decides to intervene the free exploration of the agent in the situations when the agent is conducting unreasonable behaviors or a potential critical failure is happening. In those cases the expert takes over the main operation and starts providing demonstrations on solving the task or avoiding dangers. Our setting of ERL extends previous works of Human-in-the-loop RL in two ways: First, the guardian inspects the exploration all the time and actively intervenes if necessary, instead of passively advising which action is good [mandel2017add] or evaluating the collected trajectories after the agent rolling out [christiano2017deep, guan2020explanation]. This feature guarantees the safe exploration in training time. Second, the guardian does not merely intervene the exploration and terminate the episode [saunders2018trial], instead, it demonstrates to the agent the correct actions to escape risky states. Those demonstrations become effective training data to the agent.

Figure 1: The expert intervenes the learner in dangerous situations. We model it through the Expert-in-the-loop RL scheme on the right panel where a guardian is introduced in the loop of the interaction between agent and environment.

Following the setting of ERL, we develop a novel method called Expert Guided Policy Optimization (EGPO). EGPO addresses two challenges in ERL. First, the learning agent may abuse the guardian and consistently cause intervention so that it can exploit the high performance and safety of the expert. To tackle this issue, we impose the Lagrangian method on the policy optimization to limit the intervention frequency. Moreover, we apply the PID controller to update the Lagrangian multiplier, which substantially improves the dual optimization with off-policy RL algorithm. The second issue is the partial demonstration data collected from the guardian. Since those data is highly off-policy to the learning agent, we introduce offline RL technique into EGPO to stabilize the training with the off-policy partial demonstration. The experiments show that our method can achieve superior training safety while yielding a well-performing policy with high safety in the test time. Furthermore, our method exhibits better generalization performance compared to previous methods.

As a summary, the main contributions of this work are: (1) We formulate the Expert-in-the-loop RL (ERL) framework that incorporates the guardian as a demonstrator as well as a safety guardian. (2) We develop a novel ERL method called Expert Guided Policy Optimization (EGPO) with a practical implementation of guardian mechanism and learning pipeline. (3) Experiments show that our method achieves superior training and test safety, outperforms baselines with a large margin in sample efficiency, and generalizes well to unseen environments in test time.

2 Related Work

Safe RL. Learning RL policy under safety constraints [garcia2015comprehensive, amodei2016concrete, bharadhwaj2020conservative] becomes an important topic in the community due to the safety concern in real-world applications. Many methods based on constrained optimization have been developed, such as the trust region methods [achiam2017constrained], Lagrangian methods [achiam2017constrained, stooke2020responsive, calian2020balancing], barrier methods [taylor2020learning, liu2020ipo], Lyapunov methods [chow2018lyapunov, sikchi2020lyapunov], etc

. Another direction is based on the safety critic, where an additional value estimator is learned to predict cost, apart from the primal critic estimating the discounted return 

[bharadhwaj2020conservative, srinivasan2020learning]. saunders2018trial

propose HIRL, a scheme for safe RL requiring extra manual efforts to demonstrate and train an imitation learning decider who intervenes the endangered agent. Differently, in our work the guardian does not terminate the exploration but instead continues the trajectory with the expert demonstrating the proper actions to escape risky states. However, majority of the aforementioned methods hold the issue that only the upper bound of failure probability of the learning agent can be guaranteed theoretically, but there is no mechanism to explicitly ensure the occurrence of the critical failures.


assume that cost function is the linear transformation of the action and thus equip the policy network with a safety layer that can modulate the output action as an absolutely safe action. The proposed EGPO utilizes the guardian to ensure safe exploration without assuming the structure of the cost function.

Learning from Demonstration. Many works consider leveraging the collected demonstrations to improve policy. Behavior Cloning (BC) [widrow1964pattern] and Inverse RL [sun2021adversarial]

uses supervised learning to fit the policy function or the reward function respectively to produce the same action as the expert. GAIL 

[ho2016generative, Sun_2020_corl, huanglearning2020] and SQIL [reddy2019sqil] ask the learning agent to execute in the environment and collect trajectories to evaluate the divergence between the agent and the expert. This exposes the agent to possibly dangerous states. DAgger [ross2011reduction] periodically queries the expert for new demonstrations and is successfully applied to extensive domains [kelly2019hg, zhang2016query]. Recently, offline RL draws wide attention which learns policy from the dataset generated by arbitrary policies [levine2020offline, fujimoto2018off, wu2019behavior]. The main challenge of offline RL is the out-of-distribution (OOD) actions [fujimoto2018off]. Conservative Q-Learning (CQL) [kumar2020conservative] addresses the impact of OOD actions by learning a conservative Q-function to estimate the lower bounds of true Q values. In this work, we use CQL technique to improve the training on the trajectories with partial demonstrations given by the guardian.

Human-in-the-loop RL. An increasing number of works focus on incorporating human into the training loop of RL. The human is responsible for evaluating the trajectories sampled by the learning agent [christiano2017deep, ibarz2018reward, guan2020explanation], or being a consultant to guide which action to take when the agent requests [mandel2017add]. Besides, the human can also actively monitor the training process, such as deciding whether to terminate the episode if potential danger is happening [abel2017agent, saunders2018trial]. Human-Gated DAgger (HG-DAgger) [kelly2019hg] and Expert Intervention Learning (EIL) [spencer2020learning] utilize experts to intervene exploration and carry the agent to safe states before giving back the control. However, it is much less explored in previous works on how to (1) optimize the agent to minimize interventions, (2) efficiently utilize the data generated in free exploration and (3) learn from the takeover trajectories given by the expert. Addressing these aforementioned challenges, our work is derived from the Human-in-the-loop framework where the guardian plays the role of human expert to provide feedback to the learning agent.

3 Expert Guided Policy Optimization

Extending the setting of Human-in-the-loop RL, we frame the Expert-in-the-loop RL (ERL) that incorporates the guardian to ensure training safety and improve efficiency. We develop a novel method called Expert Guided Policy Optimization (EGPO) to implement the guardian mechanism.

3.1 Overview of the Guardian Mechanism

Figure 2: Flowchart of the guardian mechanism.

Taking learning to drive as a motivating example, generally speaking, the student driver learns the skills of driving from the instructor through two approaches: (1) Student learns from instructor’s demonstrations. At the early stage of training, the student observes the demonstrations given by the instructor and learns rapidly by imitating the behaviors. Besides, the student also learns how the expert tackles dangerous situations; (2) Student in driver’s seat operates the vehicle in an exploratory way while the instructor serves as guardian. The student can explore freely until the instructor conducts takeover of the vehicle in dangerous situations. Therefore, the student learns to drive from both the imitation of the expert and the free exploration.

Based on this motivating example, we have the framework of Expert-in-the-loop RL (ERL). As illustrated in the right panel of Fig. 1, we introduce the component of guardian on top of the conventional RL scheme, which resembles the instructor who not only provides high-quality demonstrations to accelerate the learning, but also safeguards the exploration of agent in the environment. In the proposed EGPO method, the guardian is composed of two parts: an expert and a switch function.

The expert policy can output safe and reliable actions in most of the time. Besides, it can provide the probability of taking action produced by the agent: . This probability reflects the agreement of the expert on the agent’s action, which serves as an indicator for intervention in the switch function. We assume the access to such well-performing expert policy. The switch is another part of the guardian, which decides under what state and timing the expert should intervene and demonstrate the correct actions to the learning agent. As shown in Fig. 2, the switch function considers the agent action as well as the expert and outputs the modulated action fed to the environment and the intervention occurrence indicating whether the guardian is taking over the control:


wherein is the confidence level on the expert action probability and is the confident action space of the expert. The switch mechanism leads to the formal representation of the behavior policy:


wherein is a function denoting the probability of the agent choosing an action that will be rejected by the switch. Emulating how human drivers judge the risky situations, we rely on the expert’s evaluation of the safety during training, instead of any external objective criterion.

We derive the guarantee on the training safety from the introduction of guardian. We first have the assumption on the expert:

Assumption 1 (Failure probability of the expert).

For all state, the step-wise probability of expert producing unsafe action is bounded by a small value : , wherein is a Boolean denotes whether next state is an ground-truth unsafe state.

We use the expected cumulative probability of failure to measure the expected risk encountered by the behavior policy: wherein refers to the trajectory distribution deduced by the behavior policy. We propose the main theorem of this work:

Theorem 1 (Upper bound of the training risk).

The expected cumulative probability of failure of the behavior policy in EGPO is bounded by the step-wise failure probability of the expert as well as the confidence level :

wherein has negative correlation to .

When is fixed, increasing the confidence level will shrink the upper bound of , leading to better training safety. The proof is given in the Appendix.

In the implementation, the actions from agent are firstly modulated by the guardian and the safe actions will be applied to the environment. We update the learning agent with off-policy RL algorithm. Meanwhile, we also leverage a recent offline RL technique to address the partial demonstrations provided by the guardian and further improve the learning stability. The policy learning is presented in Sec. 3.2. Since the intervention from guardian indicates the agent has done something wrong, we also optimize the policy to reduce intervention frequency through the constrained optimization in Sec. 3.3.

3.2 Learning Policy from Exploration and Partial Demonstration

The proposed EGPO method can work with most of the RL algorithms to train the safe policy since the guardian mechanism does not impose any assumption on the underlying RL methods. In this work, we use an off-policy actor-critic method Soft Actor-Critic (SAC) [haarnoja2018soft]

to train the agent. The method utilizes two neural networks including a Q network estimating the state-action value:

, and a policy network: . and are the parameters. The training algorithm alternates between the policy evaluation and the policy improvement in each iteration. The policy evaluation process updates the estimated Q function by minimizing the L2 norm of the entropy regularized TD error:


Here is the replay buffer, is the delayed parameters, is a temperature parameter. On the other hand, the policy improvement objective, which should be minimized, is written as:


Since we use a safety-ensured mixed policy to explore the environment, part of the collected transitions contain the actions from the expert. This part of data comes as partial demonstration denoted as , which leads to the distributional shift problem. Many works have been proposed to overcome this problem, such as the V-trace in the on-policy algorithm IMPALA [espeholt2018impala], the advantage-weighted actor-critic [peng2019advantage] in the off-policy algorithm, and many other offline RL methods [wu2019behavior, fujimoto2018off, kumar2020conservative]. To train with the off-policy data produced by the guardian, we adopt the recent Conservative Q-Learning (CQL) [kumar2020conservative], known as an effective offline RL method, in our Learning from Partial Demonstration (LfPD) setting. The objective to update Q function becomes:


Note that the 1st Term and 2nd Term are expectations over only the partial demonstration , instead of the whole batch . In the partial demonstration data, the 1st Term reduces the Q values for the actions taken by the agent, while the 2nd Term increases the Q values of expert actions. The 3rd Term is the original TD learning objective in Eq. 3. CQL reflects such an idea: be conservative to the actions sampled by the agent, and be optimistic to the actions sampled by the expert. Minimizing Eq. 5 can lead to a better and more stable Q function. In next section, we discuss another hurdle in the training and propose a solution for intervention minimization.

3.3 Intervention Minimization via Constrained Optimization

The guardian intervenes the exploration of the agent once it behaves dangerously or inefficiently. However, if no measure is taken to limit intervention frequency, the learning policy is prone to heavily rely on the guardian. It deceives guardian mechanism by always taking dangerous actions so the guardian will take over all the time. In this case, the learning policy receives high reward under the supervision of guardian but fails to finish tasks independently.

In this section, we consider the intervention minimization as a constrained optimization problem and apply the Lagrangian method into the policy improvement process. Concretely, the optimization problem becomes: wherein is the intervention frequency limit in one episode. The Lagrangian dual form of the above problem becomes an unconstrained optimization problem with a penalty term:


where is known as the Lagrangian multiplier. The optimization over and

can be conducted iteratively between policy gradient ascent and stochastic gradient descent (SGD).

We additionally introduce an intervention critic to estimate the cumulative intervention occurrence . This network can be optimized following Eq. 3 with the reward replaced by the intervention occurrence. intervention minimization objective can be written as:


Now we can update the policy by combining the policy improvement objective Eq. 4 with the intervention minimization objective Eq. 7 to the final objective:


Conducting SGD on Eq. 8 w.r.t. can improve the return while reduce the intervention.

The SAC with the Lagrangian method has been proposed by ha2020learning. From the attempt to reproduce the result in our task, we find that directly optimizing the Lagrangian dual in the off-policy RL algorithm SAC is highly unstable. stooke2020responsive analyze that optimizing Lagrangian multiplier brings oscillations and overshoot, which destabilizes the policy learning. This is because the update of the multiplier is an integral control from the perspective of control theory. Introducing the extra proportional and derivative control to update the Lagrangian multiplier can reduce the oscillations and corresponding cost violations. We thus adopt a PID controller to update and form the responsive intervention minimization as:


where we denote the training iteration as , and , , are the hyper-parameters. Optimizing with Eq. 6 reduces to the proportional term in Eq. 9, while the integral and derivative terms compensate the accumulated error and overshoot in the intervention occurrence. We apply the PID controller in EGPO, as well as the baseline SAC-Lagrangian method in the experiments. Empirical results validate that PID control on

brings stabler learning and robustness to hyperparameter.

4 Experiments

Figure 3: A. The interface of the environment from MetaDrive [li2021metadrive]. B. The observations feeding to the target vehicle. C. The examples of the scenes we use in training and test. D. The three events creating costs: crashing with warning triangle, cone or other vehicles. cost is given once those events occur.

4.1 Experimental Settings

Environment. We evaluate the proposed method and baselines in the recent driving simulator MetaDrive [li2021metadrive]. The environment supports generating an unlimited number of scenes via the Procedural Generation. Each of the scenes includes the vehicle agent, the complex road network, the dense traffic flow, and many obstacles such as cones and warning triangles, as shown in Fig. 3D. The task for the agent is to steer the target vehicle with low-level signals, namely acceleration, brake and steering, to reach the predefined destination. Each collision to the traffic vehicles or obstacles yields environmental cost. The episodic cost in test time is the measurement on the safety of a policy, which is independent to whether the expert is used during training. The reward function only contains a dense driving reward and a sparse terminal reward. The dense reward is the longitudinal movement toward destination in Frenet coordinates. The sparse reward is given when the agent arrives the destination. We build our testing benchmark based on MetaDrive rather than other RL environments like the safety gym [safety_gym_Ray2019] because we target on the application of autonomous driving and the generalization of the RL methods. Different to those environments, MetaDrive can generate an infinite number of driving scenes which allows evaluating the generalization of different methods by splitting the training and test sets in the context of safe RL.

Split of training and test sets. Different from the conventional RL setting where the agent is trained and tested in the same fixed environment, we focus on evaluating the generalization through testing performance. We split the scenes into the training set and test set with 100 and 50 different scenes respectively. At the beginning of each episode, a scene in the training or test set is randomly selected. After each training iteration, we roll out the learning agent without guardian in the test environments and record the percentage of successful episodes over multiple evaluation episodes, called success rate. Besides, we also record the episodic cost given by the environment and present it in following tables and figures.

Training expert policy. In our experiment, the expert policy is a stochastic policy trained from the Lagrangian PPO [safety_gym_Ray2019] with batch size as large as 160,000 and a long training time. To further improve the performance of the expert, we have reward engineering by doubling the cost and adding complex penalty to dangerous actions.

Implementation details. We conduct experiments using RLLib [liang2018rllib], a distributed learning system which allows large-scale parallel experiments. Generally, we host 8 concurrent trials in an Nvidia GeForce RTX 2080 Ti GPU. Each trial consumes 2 CPUs with 8 parallel rollout workers. Each trial is trained over roughly environmental steps, which corresponds to hours of individual driving experience. All experiments are repeated 5 times with different random seeds. Information about other hyper-parameters is given in Appendix.

4.2 Results

Category Method Episodic Return Episodic Cost Success Rate
Expert PPO-Lag 392.38 99.47 1.260.57 0.860.05
RL SAC-RS 346.49 16.51 8.68 3.34 0.68 0.10
PPO-RS 294.10 22.28 3.93 4.19 0.410.09
Safe RL SAC-Lag 333.90 19.00 2.21 1.08 0.65 0.14
PPO-Lag 288.04 53.72 1.03 0.34 0.43 0.21
CPO 194.06 108.86 1.71 1.02 0.21 0.29
Offline RL CQL 373.95 8.89 0.24 0.30 0.72 0.11
IL BC 362.18 6.39 0.13 0.17 0.57 0.12
Dagger 346.16 22.62 0.67 0.23 0.66 0.12
GAIL 309.66 12.47 0.68 0.20 0.60 0.07
Ours EGPO 388.37 10.01 0.56 0.35 0.85 0.05
Figure 4: Comparison between our method and safe RL baselines.
Table 1: The test performance of different approaches.
Figure 5: The learning dynamics of EGPO and baseline methods during training.
Figure 6: The curves of EGPO agents when varying the quality of experts.

Compared to RL and Safe RL baselines. We evaluate two RL baselines PPO [schulman2017proximal] and SAC [haarnoja2018soft] with the reward shaping (RS) method that considers negative cost as auxiliary reward. We also evaluate three safe RL methods, namely the Lagrangian version of PPO and SAC [stooke2020responsive, ha2020learning] and Constrained Policy Optimization (CPO) [achiam2017constrained]. As shown in Fig. 4 and Table 1, EGPO shows superior training and test time safety compared to the baselines. During training, EGPO limits the occurrence of dangers, denoted by the episodic cost, to almost zero. Noticeably, EGPO achieves lower cost compared to the expert policy. EGPO also learns rapidly and results to a high test success rate.

Compared to Imitation Learning and Offline RL baselines. We use the expert to generate steps of transitions from training environments and use this dataset to train with Behavior Cloning (BC), GAIL [ho2016generative], DAgger [ross2011reduction], and offline RL method CQL [kumar2020conservative]. As shown in Table 1, EGPO yields better test time success rate compared to the imitation learning baselines. BC outperforms ours in test time safety, but we find that BC agent learns conservative behaviors resulting in poor success rate and low average velocity to 15.05 km/h, while EGPO runs normally in 27.52 km/h, as shown in Fig. 6.

Experiment Episodic Return Episodic Cost Success Rate
(a) W/ rule-based switch 339.1011.41 0.910.60 0.570.09
(b) W/o intervention min. 38.313.61 1.000.00 0.000.00
(c) W/o PID in SAC-Lag. 338.80 16.23 0.590.40 0.670.10
(d) W/o CQL loss 378.00 6.77 0.43 0.54 0.80 0.08
(e) W/o environmental reward 379.917.87 0.430.26 0.790.06
EGPO 388.37 10.01 0.56 0.35 0.85 0.05
Figure 7: Ablation study on .
Table 2: The test performance when ablating components in EGPO.

Learning dynamics. We denote the intervention frequency by the average episodic intervention occurrence . As illustrated in Fig. 6, at the beginning of the training, the guardian is involved more frequently to provide driving demonstrations and prevent agent from entering dangerous states. After acquiring primary driving skills, the agent is prone to choosing actions that are more acceptable by guardian and thus the takeover frequency decreases.

4.3 Ablation Studies

The impact of expert quality. To investigate the impact of the expert if its quality is not as good as the well-performing expert used in the main experiments, we involve two expert policies with and test success rate into the training of EGPO. Those two policies are retrieved from the intermediate checkpoints when training the expert. The result of training EGPO with the inferior experts is shown in Fig. 6. We can see that improving the expert’s quality can reduce the training cost. This result also empirically justifies the Theorem 1 where the training safety is bounded by the expert safety. Besides, we find better expert leads to better EGPO agent in term of the episodic return. We hypothesize this is because using premature policies as expert will make the switch function produce chaotic intervention signals that mystifies the exploration of the agent.

The impact of confidence level. The confidence level is a hyper-parameter. As shown in Fig. 7, we find that when , the performance decreases as increases. This is because higher means less freedom of free exploration. In the extreme case where , all data is collected by the expert. In this case, the intervention minimization multiplier will goes to large value, which damages the training. When , the whole algorithm reduces to vanilla SAC.

Ablations of the guardian mechanism. (a) We adopt a rule-based switch designed to validate the effectiveness of the statistical switch in Sec. 3.1. The intervention happens when the distance to the nearby vehicles or to the boundary of road is too small. We find that the statistical switch performs better than rules. This is because it is hard to enumerate manual rules that cover all possible dangerous situations. (b) Removing the intervention minimization technique, the takeover frequency becomes extremely high and the agent learns to drive directly toward the boundary of the road. This causes consistent out-of-the-road failures, resulting in the zero success rate and episodic cost. This result shows the importance of the intervention minimization in Sec. 3.3. (c) We find that removing the PID controller on updating in intervention minimization causes a highly unstable training. It is consistent with the result in [stooke2020responsive]. We therefore need to use PID controller to optimize in EGPO and SAC-Lag. (d) Removing CQL loss in Eq. 5 damages the performance. We find this ablation reduces the training stability. (e) We set the environment reward always to zero in EGPO, so that the only supervision signal to train the policy is the intervention occurrence. This method outperforms IL baselines with a large margin, but remains lower than EGPO in the return and success rate. This suggests EGPO can be turned into a practical online Imitation Learning method.

Human-in-the-loop experiment. To demonstrate the potential of EGPO, we conduct a human-in-the-loop experiment, where a human expert supervises the learning progress of the agent. The evaluation result suggests that EGPO can achieve 90% success rate with merely 15,000 environmental steps of training, while SAC-Lag takes 185,000 steps to achieve similar results. EGPO also outperforms Behavior Cloning method in a large margin, while BC even consumes more human data. Please refer to Appendix for more details.

5 Conclusion

We develop an Expert Guided Policy Optimization method for the Expert-in-the-loop Reinforcement Learning. The method incorporates the guardian mechanism in the interaction of agent and environment to ensure safe and efficient exploration. The experiments on safe driving show that the proposed method can achieve training and test-time safety and outperform previous safe RL and imitation baselines. In future work we will explore the potential of involving human to provide feedback in the learning process.

This project was supported by the Centre for Perceptual and Interactive Intelligence (CPII) Ltd under the Innovation and Technology Fund.


Appendix A Rationale on the Evaluation

Evaluation on driving simulator. The major focus of this work is the safety. However, in the domain of autonomous driving, evaluating systems’ safety in real robot is costly and even unavailable. Thus we benchmark the safety performance of baseline methods and the proposed EGPO method in driving simulator. Using driving simulator to prototype allows us to focus on the algorithmic part of the problem. The exact reproducible environments and vehicles allow safe and effective evaluation of different safe training algorithms. In this work, we conduct experiments on the driving simulator MetaDrive [li2021metadrive] instead of CARLA because we want to evaluate the generalization of the different safe exploration methods. Different to the fixed maps in CARLA, MetaDrive uses procedural generation to synthesize an unlimited number of driving maps for the split of training and test sets, which is useful to benchmark the generalization capability of different reinforcement learning in the context of safe driving. MetaDrive also supports scattering diverse obstacles in the driving scenes such as fixed or movable traffic vehicles, traffic cones and warning triangles. The simulator is also extremely efficient and flexible. The above unique features of MetaDrive driving simulator enables us to develop new algorithms and benchmark different approaches. We intend to validate and extend the proposed method with real data in the following two ways.

Extension to the human-in-the-loop framework. We are extending the proposed method to replace the pre-trained policy in the guardian with real human. A preliminary experiment is provided in Appendix B. We invite human expert to supervise the real-time exploration of the learning agent with hands on the steering wheel. When dangerous situation is going to happen, the human guardian takes over the vehicle by pressing the paddle and steering the wheel. Such trajectories will be explicitly marked as “intervention occurred”. EGPO can incorporate the data generated by either a virtual policy or human being. Therefore, EGPO can be applied to such human-in-the-loop framework directly. We are working on further improvement of the sample efficiency of the proposed method to accommodate the limited budget of human intervention.

Extension to the mobile robot platform. We design the workflow to immigrate EGPO to real robot in future work. Our system includes several components: (1) a computer controlling the vehicle remotely and training the agent with EGPO; (2) a human expert steering vehicle and watching the images from camera on the robot; and (3) an UGV robot simulating a full-scale vehicle (as shown in Fig. 8). During exploration, the on-board processor receives the low-level actions from human and queries the policy network for agent’s action. Then the on-board processor executes the action on the robot and receives new sensory data. The data is recorded and used to train the agent. EGPO algorithm can train such real-world robot based on the above workflow.

To summarize, the essential ideas proposed in the work, such as expert as guardian, intervention minimization, learning from partial demonstration, are sufficiently evaluated through the safe driving experiments in the driving simulator. With on-going efforts, we are validating our method with real data from human-in-the-loop framework and extending our method for the real-world mobile robot experiments.

Figure 8: We extend the proposed EGPO to Human-in-the-loop setting and real mobile robot platform.

Appendix B Preliminary Human-in-the-loop Experiment

To further demonstrate the capacity of the proposed framework, in this experiment, a human staff supervises the learning progress of the agent in a single training map. The expert takes over once he/she feels necessary by pressing the paddle in the wheel. At this time, an intervention cost is yielded and the action sequences of the expert are recorded and fed into the replay buffer.

Table 3 captures the result of this experiment. We find that EGPO with a human expert can achieve a high success rate in merely 15,000 environmental steps, while SAC-Lagrangian (with PID update) takes 185,000 steps to achieve similar results. We also ask the expert to generate 15,000 steps demonstrations (note that in EGPO experiment, only a small part of the 15,000 steps is given by the expert) and train a BC agent based on those demonstrations. However, BC fails to learn a satisfactory policy. This experiment shows the applicability of the proposed framework even with human experts.

Experiment Total Training Cost Test Reward Test Cost Test Success Rate
Human expert (20 episodes) - 219.50 39.53 0.30 0.550 0.95
Behavior Cloning - 33.21 5.46 0.990 0.030 0.000 0.000
PPO-Lagrangian (200K steps) 285.1 197.76 7.90 0.427 0.043 0.598 0.029
SAC-Lagrangian (185K steps) 452.5 221.381 7.90 0.060 0.049 0.940 0.049
EGPO (with human expert) (15K steps) 6.14 221.058 32.562 0.120 0.325 0.900 0.300
Table 3: Human-in-the-loop experiment results

Appendix C Proof of Main Theorem

In this section, we derive the upper bound of the discounted probability of failure of EGPO, showing that we can bound the training safety with the guardian.

Notations. Before starting, we firstly recap and describe the notations. The switch function used in this work is:


Therefore, at a given state, we can split the action space into two parts: where intervention will happen or will not happen if we sample action in it. We denote the confident action space as , which is related to the expert as well as . We also define the ground-truth indicator denoting whether the action will lead to unsafe state. This unsafe state is determined by the environment and is not revealed to learning algorithm:


Therefore, at a given state the step-wise probability of failure for arbitrary policy is .

Now we denote the cumulative discounted probability of failure as , counting for the chance of entering dangerous states in current time step as well as in future trajectories deduced by the policy . We use to denote the expected cumulative discounted probability of failure of the expert .

For simplicity, we can consider the actions post-processed by the guardian mechanism during training are sampled from a mixed policy , whose action probability can be written as:


Here the second term captures the situation that the learning agent takes arbitrary action that triggers the expert to take over and chooses the action . For simplicity, we use a shorthand .

Following the same definition as , we can also write the expected cumulative discounted probability of failure of the behavior policy as: .

Assumption. Now we introduce one important assumption on the expert.

Assumption 2.

For all states, the step-wise probability of expert producing unsafe action is bounded by a small value :


This assumption does not impose any constrain on the structure of the expert policy.

Lemmas. We propose several useful lemmas and the correspondent proofs, which are used in the main theorem.

Lemma 2 (The performance difference lemma).

Here the means the states are subject to the marginal state distribution deduced by the behavior policy . is the advantage of the expert in current state action pair: and is the next state. This lemma is proposed and proved by kakade2002approximately and is useful to show the behavior policy’s safety. In the original proposition, the and represents the expected discounted return and advantage w.r.t. the reward, respectively. However, we replace the reward with the indicator so that the value function and presenting the expected cumulative failure probability.

Lemma 3.

Only a small subspace of the confident action space of expert covers the ground-truth unsafe actions:


According to the Assumption, we have:


Following the definition of , we get . Therefore:


Therefore is hold. ∎

Lemma 4.

The cumulative probability of failure of the expert is bounded for all state:


Theorem. We introduce the main theorem of this work, which shows that the training safety is related to the safety of the expert and the confidence level .

Theorem 5 (Upper bound of the training risk).

The expected cumulative probability of failure of the behavior policy in EGPO is bounded by the step-wise failure probability of the expert as well as the confidence level :

wherein is negatively correlated to .


We use the performance difference lemma to show the upper bound. At starting, we first decompose the advantage by splitting the behavior policy:


The second term is equivalent to , which is equal to zero, according to the definition of advantage. So we only need to compute the first term. Firstly we split the integral over whole action space into the confident action space and non-confident action space (which removed by the operation), then we expand the advantage into detailed form, we have:


Following the Lemma 3, the term (a) can be bounded as:


Following the Lemma 4, the term (b) can be written as:


wherein denoting the area of feasible region in the action space. It is a function related to the expert and . If we tighten the guardian by increasing , the confident action space determined by the expert will shrink and the will decrease. Therefore is negatively correlated to . The term (c) is always non-negative, so after applying the minus to term (c) will make it always .

Aggregating the upper bounds of three terms, we have the bound on the advantage:


Now we put Eq. 22 as well as Lemma 4 into the performance difference lemma (Lemma 2), we have:


Here we have . Now we have proved the upper bound of the cumulative probability of failure for the behavior policy in EGPO.

Appendix D Detail on Simulator and the Safe Driving Environments

The MetaDrive simulator is implemented based on Panda3D [goslin2004panda3d] and Bullet Engine that has high efficiency as well as accurate physics-based 3D kinetics. Some traffic cones and broken vehicles (with warning triangles) are scattered in the road network, as shown in Fig. 9. Collision to any object raises an environmental cost . The cost signal can be used to train agents or to evaluate the safety capacity of the trained agents.

In all environments, the observation of vehicle contains (1) current states such as the steering, heading, velocity and relative distance to boundaries etc.

, (2) the navigation information that guides the vehicle toward the destination, and (3) the surrounding information encoded by a vector of length of 240 Lidar-like cloud points with

maximum detecting distance measures of the nearby vehicles.

Figure 9: The demonstrations of generated safety environments.

Appendix E Learning Curves

Fig. 10 and Fig. 11 present the detailed learning curves of different approaches. Note that in CQL, the first 200,000 steps is for warming up and it uses the behavior cloning to train. In each DAgger iteration, a mixed policy will explore the environment and collect new data aggregated into the dataset. The mixed policy chooses action following , where the parameter anneals from 1 to 0 during training. Therefore DAgger agent achieves high training success rate at the beginning. In DAgger experiment, we only plot the result after each DAgger iteration.

We find that EGPO achieves expert-level training success rate at the very beginning of the training, due to the takeover mechanism. Besides, the test success rate improves drastically and achieves similar results as the expert. On the contrary, other baselines show inferior training efficiency.

In term of safety, due to the guardian mechanism, EGPO can constrain the training cost to a minimal value. Interestingly, during test time, EGPO agent shows even better safety compared to the expert. However, according to the main table in paper and the curves in Fig. 11, BC agent can achieve lower cost than EGPO agent. We find that the reason is because BC agent drives the vehicle conservatively in low velocity, while EGPO agent drives more naturally with similar velocity as the expert.

Figure 10: Detailed learning curves of EGPO and Safe RL baselines.
Figure 11: Detailed learning curves of BC, CQL, GAIL and DAgger.

Appendix F Hyper-parameters

Hyper-parameter Value
Discounted Factor 0.99
for target network update 0.005
Learning Rate 0.0001
Environmental horizon 1500
Steps before Learning start 10000
Intervention Occurrence Limit 20
Number of Online Evaluation Episode 5
CQL Loss Temperature 3.0
Table 5: PPO/PPO-Lag
Hyper-parameter Value
KL Coefficient 0.2
for GAE [schulman2018highdimensional] 0.95
Discounted Factor 0.99

Number of SGD epochs

Train Batch Size 2000
SGD mini batch size 100
Learning Rate 0.00005
Clip Parameter 0.2
Cost Limit for PPO-Lag 1
Table 4: EGPO
Hyper-parameter Value
Discounted Factor 0.99
for target network update 0.005
Learning Rate 0.0001
Environmental horizon 1500
Steps before Learning start 10000
Cost Limit for SAC-Lag 1
BC iterations for CQL 200000
CQL Loss Temperature 5
Min Q Weight Multiplier 0.2
Table 7: BC
Hyper-parameter Value
Dataset Size 250000
SGD Batch Size 32
SGD Epoch 200000
Learning Rate 0.0001
Table 6: SAC/SAC-Lag/CQL
Hyper-parameter Value
SGD Batch Size 64
SGD Epoch 2000
Learning Rate 0.0005
Number of DAgger Iteration 5
Initial 0.3
Batch Size to Aggregate 5000
Table 9: GAIL
Hyper-parameter Value
Dataset Size 250000
SGD Batch Size 64
Sample Batch Size 12800
Generator Learning Rate 0.0001
Discriminator Learning Rate 0.005
Generator Optimization Epoch 5
Discriminator Optimization Epoch 2000
Clip Parameter 0.2
Table 8: DAgger