1 Introduction
Autonomous driving has received exceedingly high research interests in the past two decades as it offers the promise of releasing drivers from exhausting driving. While great advances have been achieved in the field of path planning, perception and controls, highlevel decisionmaking remains a challenge especially in mixed traffic with complex and dynamic driving environment. Recently, numerous reinforcement learning (RL) approaches have been applied to autonomous driving tasks and promising results are reported sallab2017deep; wang2018automated; shi2019driving; chen2020autonomous; chen2021interpretable; wu2021human. However, conventional RL algorithms evolve through interacting with the environment, via sometimes trialanderror exploratory actions that make the vehicles vulnerable to accidents in realworld traffic.
Offline RL (also known as batch RL) has been proposed as a promising framework to address the safety issue where agents learn from precollected datasets without interacting with the realworld environment. As such, it has received increased interests in safetycritical applications such as decision making in healthcare, robotics, and autonomous driving levine2020offline. In particular, the batchconstrained RL (BCQ) algorithm is proposed in fujimoto2019off
, where a statedependent generative model is used to restrict predicted actions to be similar to previously observed ones to tackle the issue of extrapolation error caused by erroneously estimating seen stateaction pairs. In addition, the authors in
wu2019behavior exploit the schemes of value penalty factor and policy regularization in the value and policy objective functions to regularize the learned policy towards the expert policy and worthy performance gains on recently proposed offline RL methods are obtained. The aforementioned behaviorconstrained approaches essentially restrict the learned policy distribution to resemble the datasets to mitigate the effects of extrapolation error, which on the other hand will generally drive the agents to act conservatively without efficiently exploring the state and action space fujimoto2019off. This tends to result in poor diversity of seen stateaction pairs, which negatively impairs the learning performance.Learning to explore is an emerging paradigm to address the issue of insufficient exploration fujimoto2019off; lillicrap2015continuous; plappert2017parameter; xu2018learning. For instance, plappert2017parameter
has shown improved exploratory behavior through adding additive Gaussian noise to the parameter vectors on 3 offpolicy deep RL algorithms. Deep Deterministic Policy Gradient (DDPG)
lillicrap2015continuous is then used to independently train an exploration policy by integrating it with an autocorrelated noise added to the actor policy. Despite promising results, the aforementioned approaches apply stateindependent noises to enhance exploration, which may not adapt satisfactorily to more diverse environments like the case in autonomous driving.In this paper, we build upon the stateoftheart offline RL algorithm, BCQ, and develop a more efficient RL framework with a learnable parameter noise in the perturbation model to enhance exploration and achieve increased diversity in seen actions. Furthermore, Lyapunovbased safety regulation is adopted to enhance the safety in explorations. The main contributions and the technical advancements of this paper are summarized as follows.

We build upon BCQ and develop a more efficient and safetyenhanced offline RL framework that are applicable to many safetycritical realworld applications.

A novel learnable parameter noise scheme is employed to enhance the diversity of seen actions and a Lyapunovbased risk factor is constructed to restrict the exploratory state space within the safe region.

We conduct comprehensive experiments on autonomous driving in both highway and parking traffic scenarios, and the results show that our approach consistently outperforms standard RL and several stateoftheart offline RL algorithms in terms of driving safety and efficiency.
The remainder of this paper is organized as follows. Section 2 briefly introduces the preliminaries of RL, offline RL and Lyapunov stability theory. The proposed offline RL framework with enhanced safety and exploration efficiency is described in Section 3 whereas experiments, results, and discussions are presented in Section 4. Finally, we conclude the paper and discuss future works in Section 5.
2 Background
2.1 Preliminaries of Reinforcement Learning
In a RL setting, the objective is to learn an optimal policy that maximizes the accumulated return , where is the reward at time step and is the discount factor. More specifically, the agent observes the state of the environment at each time , and interacts with the environment by performing an action according to a policy . The stateaction value function (or Qfunction) of a policy is the expected return when following the policy after taking action in state . The optimal value function , representing the reward of taking action in state followed by the optimal policy through greedy action choices, can be obtained from the following Bellman equation:
(1) 
where denotes the Bellman operator and
is the transition probability. Offpolicy algorithms like Qlearning
sutton2018reinforcement; mnih2013playingfit the Qfunction with a parametric model
and update the parameters with sampled data from the experience buffer dataset lin1992self. Actorcritic networks like DDPG lillicrap2015continuous adopt two networks: an actor network for policy learning and a critic networkto reduce variance, where the policy network is updated as:
(2) 
2.2 Offline Reinforcement Learning
Offline Reinforcement learning is essentially a type of offpolicy RL that works on a precollected and static dataset without the requirement of continuous interactions with the environment levine2020offline; ardoinextracting. Typically, the dataset of unknown quality is first obtained. BatchConstrained deep Qlearning (BCQ) fujimoto2019off is a stateoftheart offline RL method aiming at enforcing the learned policy to be similar to the behavior policy exhibited in the data. BCQ aims to solve a key challenge in offline RL that the values of the seen stateaction pairs are often erroneously estimated (also known the as extrapolation error phenomenon). Towards that end, BCQ samples multistep actions from a generative model (i.e., VAE kingma2013auto), which is then used to train the policy by producing actions similar to the ones in the observed data batch:
(3) 
where with being the action generated from a generative model and being a perturbation model added to increase the diversity of seen actions fujimoto2019off. The perturbation model is updated as:
(4) 
and the critic network is updated as:
(5) 
where is a combination of the two target Qvalues, and , from the target networks and is defined as:
(6) 
where is a parameter that controls the uncertainty introduced from future time steps.
2.3 Lyapunov Stability
Consider the following dynamical system:
(7) 
where is the state vector with being the domain, and is the control input vector. The closedloop system is stable at the origin if for any , there exists , such that if then for all . Furthermore, the system is asymptotically stable if it is stable and the state goes to zero asymptotically, i.e., for all chang2019neural.
Lyapunov theory Khalil2002NLsys is a wellstudied method to characterize the stability conditions. Specifically, if there exists a continuously differentiable function for the closedloop system such that
(8) 
Here is the Lie derivative and defined as
(9) 
3 Methodology
3.1 Learning to Explore
In BCQ, a perturbation model parameterized by is used to generate a noise signal, which is added to the VAEgenerated action to facilitate exploration and increase the diversity of the seen actions. As reported in plappert2017parameter, injecting parameter noises within traditional RL methods can generally promote the exploration. As such, we extend the BCQ algorithm by adding a learnable parameter noise fortunato2017noisy to the perturbation model as . Taking a fullyconnected layer as an example, where and are the input and output features, respectively, and is the network parameter. Then the corresponding network with perturbation parameter noise is modified as:
(10) 
where the parameters and are learnable parameters of the perturbation network. Here,
are noisy random variables that can be learned through backpropagation. The modified perturbation model is thus updated as:
(11) 
where is the parameter of the new perturbation model after incorporating the learnable noise parameters.
3.2 Learning to Provide Safety Guarantee
We consider the case that the operation space is defined and restricted based upon those observed within the static dataset . We aim at enhancing the BCQ algorithm with guaranteed safety. Towards that end, we perform a joint learning framework to obtain the system dynamics in Eqn. 7 together with its Lyapunov function. This collective learning schemes ensures system stability according to the Lyapunov stability criterion introduced in Section 2.3. Specifically, we define a “nominal” closedloop system dynamics and the corresponding Lyapunov function
as two neural networks. From
manek2020learning, it follows that:(12) 
where the structure of can be conveniently chosen as random fully connected network whereas the network for Lyapunov function learning is generally chosen as Input Convex Neural Network (ICNN) amos2017input. Here is an assigned parameter, and
is a smoothed ReLU activation with a quadratic region in
:(13) 
By enforcing that no positive component of is along the direction of , according to the aforementioned Lyapunov stability theory, the stability of is guaranteed.
Furthermore, in addition to system stability, we also seek to provide safety guarantees with the optimized solution from the exploration policy. According to Eqn. 8, an extended Lyapunov function design can be formulated as the following minimax based cost function chang2019neural:
(14) 
Note that even the convexity of ICNN ensures that has only a single global optimum amos2017input, it does not require the optimum is at . To address this issue while avoiding increased computational burden and maintaining the function convexity, we perform an internal kernel function shifting manek2020learning to achieve . In the meantime, a small positive term is added to ensure strict positivedefiniteness:
(15) 
where is a small constant and is an ICNN. In practice, Eqn. 14 can be solved as the following empirical Lyapunov risk index through Monte Carlo estimation,
(16) 
where is the state variable sampled according to distribution from the data batch . Finally, the following Lyapunov risk is added to the critic network as:
(17) 
Pseudocode of the proposed offline RL algorithm with enhanced safety and promoted exploration is summarized in Algorithm 1, and the major changes from the BCQ algorithm are highlighted in blue.
4 Experiments
4.1 Experimental Setup
We apply our new offline RL framework to autonomous driving tasks, where the opensourced gymbased environment, highwayenv simulator
^{1}^{1}1https://highwayenv.readthedocs.io/en/latest/, is adapted as our simulation platform. In this platform, vehicle trajectories are generated based on the kinematic bicycle model polack2017kinematic, where the vehicles take continuousvalued actions for steering and throttle controls as defined in highwayenv. To collect data for offline RL training, a DDPG agent over 5,000 time steps is trained and the experience buffer is trained. We use the DDPG implementation from the OpenAI baselines^{2}^{2}2https://stablebaselines.readthedocs.io/en/master/. The proposed approach is experimented on the following two traffic scenarios.4.1.1 Highway scenario
The highway environment is illustrated in Fig. 1, where autonomous vehicle (AV, blue) intends to navigate as fast as possible without colliding with the humandriven vehicles (HDVs, green). The AV is expected to make lane changes to overtake slowmoving vehicles whenever possible to achieve higher speed. The reward function is defined as:
(18) 
where are the current, minimum and maximum speed of the egovehicle, respectively, and are two weighting coefficients.
4.1.2 Parking scenario
Fig. 1 shows the parking scenario, where the objective of the AV is to park successfully to stay within a desired space with appropriate heading while not colliding with the obstacles (dark green boxes). In this scenario, the reward is defined as:
(19) 
where represents the current state of the AV whereas is the goal position and orientation. The violation term represents the penalty on hitting obstacles. Here is the heading angle, and are two weighting coefficients.
4.2 Baselines
To demonstrate the effectiveness of our proposed approach, we compare our approach with a stateoftheart conventional offpolicy RL, as well as BCQ, a stateoftheart offline RL algorithm:

Deep Deterministic Policy Gradient (DDPG) lillicrap2015continuous: DDPG is an offpolicy deterministic version of modelfree RL algorithm that can handle continuous action space. We adapt the implementation based on OpenAI stable baseline^{3}^{3}3https://stablebaselines.readthedocs.io/en/master/.

Batch Constraint Reinforcement Learning (BCQ) fujimoto2019benchmarking: BCQ is a stateoftheart offline RL algorithm for continuous control with a statedependent generative model used to restrict predicted actions to be similar to previous observed ones.

Noisy BCQ: In this version, we extend BCQ by only adding the explorationpromotion strategy on the policy as detailed in Section 3.1, without employing any safetyenhancement schemes.

Ours: The framework extends BCQ by incorporating a new perturbation model with learnable parameter noise as well as a Lyapunovbased safetyenhancement scheme.
For this comparsion, we train all algorithms over 200 episodes and evaluate the models every 10 episodes with 5 different random seeds while the same random seeds are shared among the models. We set the discount factor as 0.7.
4.3 Performance Comparison
4.3.1 Comparison with stateoftheart benchmarks
The comparison between the proposed algorithm and stateoftheart offpolicy and offline algorithms are shown in Fig. 2 and Fig. 2 on the highway and parking scenarios, respectively. It is clear that our proposed approach consistently outperforms the benchmark algorithms in terms of evaluation returns and training efficiency, which is a result of the proposed parameter noise injection and safety guarantee schemes that facilitate exploration and enhance system safety. It is also noted that Noisy BCQ also outperforms standard BCQ in both traffic scenarios, which demonstrates that adding parameter noises to the perturbation model in BCQ can promote efficient explorations in BCQ.
To show the correlation between state and action pairs, we plot the stateaction density in the parking scenario in Fig. 3
, where we transform the multidimensional features of state and action into one dimensional vectors using principal component analysis (PCA) to show the diversity of the observed stateaction pairs. It can be seen that BCQ explores rather “cautiously” with very limited state and action space. In contrast, the Noisy BCQ exhibits more efficient and “aggressive” exploration, surveying a much larger stateaction space. This demonstrates that the proposed parameter noise injection scheme can effectively promote exploration in BCQ. With additional Lyapunovbased safetyenhancement, our proposed approach shows the same range of visited action space as Noisy BCQ but restricts the state space in a reasonable range, striking a good balance between exploration and safety as can be seen next.
4.3.2 Performance of safety enhancement
To evaluate the performance of the proposed safety scheme, we compare the proposed approach with the Noisy BCQ that only has the parameter noise injection scheme without safety enhancement. Fig. 4 shows the minimum distance to the surrounding vehicles in the highway scenario for the proposed approach and Noisy BCQ. It is obvious that our approach presents a much higher minimum distance than Noisy BCQ which frequently leads to distances smaller than 5 . This is because Noisy BCQ only promotes exploration without considering the safety issues.
Furthermore, we compare the performance of our approach with Noisy BCQ in the parking scenario in terms of steering angle, acceleration and success rat. As shown in Fig. 5, our proposed approach has a smooth steering angle than Noisy BCQ which has sharp changes in steering angle that is risky and leads to poor ride comfort in realworld driving. The acceleration plots in Fig. 5 indicates that our approach also has a lower acceleration compared to Noisy BCQ. Higher and more oscillatory accelerations can cause very poor drive comfort and reduce the lifespan of vehicles. Above all, our approach achieves the highest success rates than the BCQ and Noisy BCQ in the parking scenario as shown in Fig. 6.
5 Conclusion and Future Work
In this paper, we developed an efficient and safetyenhanced offline RL framework with application to autonomous driving in highway and parking traffic scenarios. To facilitate exploration, we improved the BCQ algorithm by exploiting learnable parameterized noises in the perturbation model. A novel safety scheme was developed using Lyapunov stability theory to enhance safety during explorations. Comprehensive experiments on the application of autonomous driving were conducted to compare our approach with several stateoftheart algorithms, which demonstrated that the proposed approach consistently outperformed the benchmark approaches in terms of training efficiency and safety. In our future work, we plan to collect and employ more diverse data such as data from conventional control methods and realworld data from autonomous vehicles to further improve the performance.
Comments
There are no comments yet.