. The increasing interest in RL is primarily stimulated by its data-driven nature, which requires little prior knowledge of the environmental dynamics, and its combination with powerful function approximators, e.g. deep neural networks. In spite of these advantages, many purely data-driven RL suffers from slow convergence rate in continuous action space of stochastic systems, which hinders its widespread adoption in real-world applications[3, 4].
To alleviate this problem, researchers have investigated the use of model-driven RL algorithms, which searches the optimal policy with known environmental models by employing the principle of Bellman optimality[5, 6, 7, 8]. Model-driven RL has shown faster convergence compared to the data-driven counterparts, since environmental models provide the information of environmental evolution in the whole state-action space. Thus, gradient calculation can be easier and more accurate than merely using data samples . To solve the Bellman equation in the continuous action space, most existing RL methods adopt an iterative technique to gradually find the optimum. One classic framework is called policy iteration RL, which consists of two steps: 1) policy evaluation (PEV), that aims at solving self-consistency condition equation and evaluating the current policy, and 2) policy improvement (PIM) that seeks to optimize the corresponding value function[10, 11].
A number of prior works focus on improving the PEV step by using model-driven value expansion, which corrects the cumulative return or the approximated value function by using environmental models[12, 13]. However, due to the inherent model inaccuracy, this technique is not suitable for long-term PEV. To partly solve this problem, model-based value expansion algorithm proposed a hybrid algorithm that uses environmental dynamic model only to simulate the short-term horizon, and utilizes the explored data to estimate the long-term value beyond the simulation horizon. Nevertheless, the inaccuracy problem hinders the application of environmental model in PEV.
So far, the environmental model has limited applications in the PIM step due to two main issues: 1) the inaccuracy and overfitting of environmental dynamic models and 2) policy oscillation caused by the time-varying models, since the system model is iteratively learned or updated in the training process [15, 16, 17]. Prior works provide the model ensemble technique for solving these problems. For example, the model-ensemble trust region policy optimization (TRPO) algorithm  limits model over-training by using an ensemble metric during policy search. The stochastic ensemble value expansion 
, which is an extension to the model-based value expansion, interpolates between many different horizon lengths and different models to favor models that generate more accurate estimates. Although the ensemble techniques effectively avoid over-fitting, it brings extra computational overhead.
Facing the aforementioned challenges of RL algorithms, this paper proposes a mixed reinforcement learning (mixed RL) algorithm that utilizes the dual representations of environmental dynamics to improve both learning accuracy and training speed. The environmental model, either empirical or theoretical, is used as the prior information to avoid overfitting, while the model error is iteratively compensated by the measured data of states and actions using Bayesian estimation. Precisely, the contributions of this paper are as follows,
A dual representation of environmental dynamics is utilized in RL by integrating the designer’s knowledge with the measured data. An iterative Bayesian estimator (IBE) with explored data is designed for improving the model accuracy and computation efficiency.
A mixed RL algorithm is developed by embedding the iterative Bayesian estimator into the policy iteration. We propose the sufficient recursive stability and convergence condition which limits the estimated difference of iterative Bayesian estimator between two consecutive iterations. And we proved that the sufficient condition holds with probability one after sufficient iterations.
The rest of this paper is organized as follows. Section II defines a mixed RL problem. Section III introduces the mixed representation of environmental dynamics. Section IV and Section V presents the mixed RL algorithm, as well as the parametrization of the policy and value function. Section VI evaluates the effectiveness of mixed RL problem using the double lane change task with a automated vehicle, and Section VII concludes this paper.
Ii Problem Description
We consider a discrete-time environment with additive stochastic uncertainty and its actual dynamics is mathematically described as
where is the current time, is the state, is the action, is the deterministic part of environmental dynamics, is the additive stochastic uncertainty with unknown mean and covariance . In this study, we assume that the additive stochastic uncertainty
follows the Gaussian distribution and. Parameters and can be completely independent of or form a functional relationship with .
As shown in Fig. 1, actual environmental dynamics contains both deterministic part and uncertain part , where is the probability density of and is the probability density of under given .
The objective of mixed RL is to minimize the expectation of cumulative cost under the distribution of additive stochastic uncertainty , shown as (2):
where is policy, is the state value, which is a function of initial state , is the utility function, which is positive definite, is the discounting factor with , and is the expectation w.r.t. the additive stochastic uncertainty . Here, the policy is a deterministic mapping:
The optimal cost function is defined as
where is the action sequence starting from time . In mixed RL, the self-consistency condition (5) is used to describe the relationship of state values between current time and next time:
By using Bellman’s principle of optimality. we have the well-known Bellman equation:
The Bellman equation implies that optimal policy can be calculated in a step-by-step backward manner. Therefore, optimal action is
where represent the optimal policy that maps from an arbitrary state to its optimal action . Similar to other indirect RL problems, mixed RL aims to find an optimal policy by minimizing cost (2) while being subjected to the constrains of environmental dynamics. The searching procedure can be replaced by solving the Bellman equation in an iterative way. Obviously, the performance of the generated policy depends on the accuracy of the representation of the environmental dynamics. In fact, either an analytical model or state-action samples can be an useful representation, which corresponds to the so-called model-driven RL and data-driven RL, respectively. The analytical model is usually inaccurate due to environmental uncertainties, which will impair the optimality of the generated policy. The state-action samples, on the other hand, have low sampling efficiency and will slow down the training process.
Iii Dual Representation of Environmental Dynamics
In mixed RL, the environmental dynamics are dually represented by both an analytical model and state-action data . The former represents the designer’s knowledge about the environmental dynamics. It is defined in the whole state-action space and can be used to accelerate the training speed. The latter comes from direct measurement of state-action pairs during learning. It is generally more accurate than , and therefore can improve the estimation of the uncertain part in the analytical model. The mixed RL uses the dual representation of environmental dynamics, i.e., both analytical model and state-action data , to search for optimal policy. Such dual representation can have accelerated training compared to purely data-driven RL while achieving better policy satisfaction than purely model-driven counterpart.
The analytical model is similar to (1):
where the mean and covariance of are given in advance by designers. The given distribution can be quite different from actual distribution due to the modelling errors. Here, and are taken as the prior knowledge of environmental dynamics.
The state-action data, i.e., a sequence of triples , is denoted by :
where is the -th state in , is the -th action in , and is the length of data samples. Obviously, the measured data also inherently contain the distribution information of , and are taken as the posterior knowledge of environmental dynamics.
If the environmental dynamics is exactly known, optimal policy can be computed by only using the dynamic model, which is also the most efficient RL. However, exact model is inaccessible in reality, and thus the generated policy might not converge to . Although collecting samples is less efficient, it can be quite accurate to represent the environment, thus being able to improve the generated policy. Therefore, the mixed representation is able to utilize advantages of both model and data to improve training efficiency and policy accuracy.
Improve model by using data :
We utilize data samples to improve the estimation of the additive stochastic uncertainty in the analytical model . The uncertainty that inherently exists in a state-action triple is equal to
A Bayesian estimator is adopted to fuse the distribution information of the additive stochastic uncertainty from both model and data
. The Bayesian estimator aims to maximize the posterior probability. In general, we introduce and as the the prior distribution of and , then the maximum likelihood problem becomes
Under the assumption that data is iid, (11) can be rewritten into iterative form:
Therefore, we can build an iterative Bayesian estimator with the following general form,
Here, we discuss two simplified cases of the Bayesian estimator:
Case 1: Assume that the covariance is known and is independent from and , we introduce provided by model as the prior distribution of . Thus, the objective function of Bayesian estimation becomes,
where is the prior distribution and is a constant. The optimal estimation of is calculated by (15).
The can be iteratively computed by using IBE. Define , and , the iterative Bayesian estimator is
Case 2: Assume that both the mean and covariance are unknown. The same prior distribution in case 1 is applied to . The covariance is estimated by the maximum likelihood estimation, since the parameters of the prior distribution of are inconvenient to determine by human designer. Subsequently, the optimal estimation of and are as follows,
Define and . Then and can be iteratively computed by the following IBE,
For more general cases where is related to and , i.e. , where is a general function with parameter , the likelihood becomes (19) and the optimal estimation of is the minimum of .
Iv Mixed RL Algorithm
Iv-a Mixed RL Algorithm Framework
Existing RL algorithms that compute the optimal policy via the use of Bellman equation are known as indirect RL and they usually involve PEV and PIM steps. Different from traditional indirect RL algorithms, mixed RL consists of three alternating steps, i.e., IBE, PEV and PIM, as shown in Fig. 2. IBE that is proposed in Section III is used to estimate the mean and covariance of the additive stochastic uncertainty iteratively. PEV seeks to numerically solve a group of algebraic equations governed by the self-consistency condition (5) under current-step policy , and PIM is to search a better policy by minimizing a “weak” Bellman equation.
In the first step, IBE calculates and with the latest data and the mixed model is updated accordingly, i.e.,
where is the current policy at -step iteration, and is the state value to be solved under policy . In the third step, PIM computes an improved policy by minimizing (22):
where is the new policy. The use of estimated naturally embeds both analytical model and state-action data into RL, which is able to improve the accuracy of the additive stochastic uncertainty and and achieve high convergence speed. The mixed RL algorithm is summarized in Algorithm 1.
Iv-B Recursive Stability and Convergence Under Fixed
In this section, we prove the recursive stability and convergence under fixed additive uncertainty .
Iv-B1 Recursive stability
Recursive stability means can stabilize the plant so long as can. We call the closed-loop stochastic system is stable in probability, if for any , the following equality holds,
Lemma 1 (Lyapunov stability criterion ):
If there exists a positive definite Lyapunov sequence on , which satisfies
then the stochastic system is stable in probability, where is a continuous function, and .
Next, we prove the recursive stability criterion for mixed RL algorithm under fixed using Lemma 24.
Theorem 1 (Recursive stability theorem):
For any step in mixed RL, is stable in probability if is stable in probability and the discount factor is selected appropriately under the mixed model.
Since is optimal for “weak” Bellman equation, and is non-optimal for step value, we have:
where is the next state with under the mixed model, and is the next state with . Therefore,
Since is stable in probability, is bounded, thus, is bounded. Considering the fact that and is positive definite function, holds, except for which is stable in probability naturally.
We choose a proper to satisfy:
Therefore, is monotonically decreasing w.r.t time with approximate , i.e.,
In short, is stable in probability.
Iv-B2 Convergence of mixed RL
The convergence property describes whether the generated policy, , can converge to the optimum under the mixed RL. Here, we prove the convergence of mixed RL algorithm under fixed .
Theorem 2 (State value decreasing theorem):
For any under the additive stochastic uncertainty , is monotonically decreasing with respect to , i.e.,
The key is to examine (except for )
At each RL iteration, we initialize step value function by . The first PEV iteration for is
With respect to (25), we know
For following PEV iterations, we need to reuse the inequality (32):
Similarly, . Therefore, is a monotonically decreasing sequence and bounded by 0 for always holds. Finally, will converge
So we have .
Iv-C Recursive Stability and Convergence Under Varying
In this section, we discuss the recursive stability and convergence under varying additive uncertainty , and propose the sufficient condition by designing an upper bound for the differences between and .
Under , the self-consistency condition is
Since is the optimal action with respect to of in the k-th iteration, we have
which is the key inequality in the proof in section IV-B.
However, when is updated from to , the variation of should be bounded in the interest of stability and convergence. Here, we give the sufficient condition of recursive stability and convergence under varying , that is, the maximum variation condition (MVC) of the additive stochastic uncertainty (38).
Define as the expected cumulative cost under the additive stochastic uncertainty ,
Theorem 3 (Sufficient condition for recursive stability and convergence):
For any step in mixed RL, is recursive stable and is monotonically decreasing with respect to , if the following MVC is satisfied
where is the decrease of cumulative cost after PIM,
The MVC requires that the change of have less impact on the cumulative cost calculation than PIM in the last iteration.
Since , when MVC is satisfied, , thus, we have
Next, we first present Lemma 2 that will be used for the convergence analysis of IBE, then we prove the MVC is satisfied with probability one.
Lemma 2 (Convergence criterion of Bayesian estimation ):
In Bayesian estimation, if the empirical data and the parameter’s prior distribution obey Gauss distribution and the covariance matrix of prior distribution is full rank, then the estimation result and will converge to the sample’s mean and covariance asymptotically.
Theorem 4 (MVC is satisfied with probability one criteria):
The MVC is satisfied with probability one after sufficient iterations, with the assumption that the IBE converges faster than PIM and PEV.
Using Kolmogorov strong law of large numbers, we have
Since , , and both and obey Gaussian distribution, the KL-divergence between and converge to with probability one , i.e.,
Thus, we have
Since , when , the MVC (38) holds with probability one, i.e.,
In general, MVC indicates that the excessive difference between and should be avoided. In mixed RL, we update the distribution of the additive stochastic uncertainty by Bayesian estimation. As shown in Fig. 3, if a single data batch has large deviation from the total data, the Bayesian estimator can reduce the deviation between the posterior distribution and the total data distribution by introducing appropriate prior distribution of parameters.
V Mixed RL with Parameterized Functions
For large state spaces, both value function and policy are parameterized in mixed RL, as shown in (48). The parameterized value function with known parameter is called the “critic”, and the parameterized policy with known parameter is called the “actor” .
The parameterized critic is to minimize the average square error (49) in PEV, i.e.,
The semi-gradient of the critic is
where and .
The parameterized actor is to minimize the “weak” Bellman condition, i.e., to minimize the following objective function,
where and are the mean and covariance of . The gradient of is calculated as follows,
In essence, the parameterized method is called generalized policy iteration (GPI). Different from the traditional policy iteration, PEV and PIM each has only one step in GPI, which greatly improves the computational efficiency when RL is combined with neural network.
Since in each GPI cycle, the gradient descent of PIM is only carried out once, the maximum variation condition (MVC) may not be satisfied. We propose a Adaptive GPI (AGPI) method to solve this problem. In every iteration, we check whether the PIM results satisfy MVC. If not, the algorithm will continue the gradient descent steps in PIM until the MVC is satisfied or when the maximum internal circulation step is reached. Subsequently, the mixed RL algorithm with parameterized Adaptive GPI (AGPI) is summarized in Algorithm 2.
Vi Numerical Experiments
We consider a typical optimal control problem of stochastic non-affine nonlinear systems, i.e., the combined lateral and longitudinal control of an automated vehicle with stochastic disturbance (i.e., the influence of small road slope and road bumps). The vehicle is subjected to random longitudinal interference force in the tracking process and the vehicle dynamics is shown in (53) .
where the state , is the lateral velocity, is yaw rate, is the difference between longitudinal velocity and desired velocity, is the yaw angle, and is the distance between vehicle’s centroid and the target trajectory. For the control input , where is the front wheel angle and is the longitudinal acceleration. The and are the lateral tire forces of the front and rear tires respectively, which are calculated by the Fiala tire model . In the tire model,the tire-road friction coefficient is set as 1.0. The front wheel cornering stiffness and rear wheel cornering stiffness are set as 88000 and 94000 respectively. The mass is set as 1500 , the and are the distances from centroid to front axle and rear axle, and set as 1.14 and 1.40
respectively. The polar moment of inertiaat centroid is set as 2420 . The random longitudinal interference force and the desired velocity is set as 12 .
For comparison purpose, a double-lane change task was simulated respectively with three different RL algorithms. The task is to track the desired trajectory in the lateral direction while maintaining the desired longitudinal velocity under the longitudinal interference . Hence, the optimal control problem with discretized stochastic system equation is given by
where is the discounting factor, is the deterministic part of the discretized system equation of (47), is the additive stochastic uncertainty and the simulation time interval is set as . In this simulated task, we compared the performance of mixed RL with both model-driven RL and data-driven RL. The data-driven RL computes the control policy only by using the state-action data with a typical data-driven algorithm (i.e., DDPG) . The model-driven RL computes the policy by GPI  directly using the given empirical model
where the prior distribution is set as and is a diagonal matrix, whose diagonal elements are .
The convergence performance of these three algorithms are compared in Fig. 4. The mixed RL and model-driven RL can converge in 1 iterations, while the data-driven RL needs 4 iterations to converge under the same hyper-parameter.
For control performance, we test the policies calculated by three methods in the double lane change task. As shown in Fig. 5, all three policies stably tracked the target trajectory, but with different control error. In fact,
as shown in Fig. 6, the mixed RL has the minimum longitudinal speed error, since it enables the vehicle to decelerate rapidly at sharp turns and adjust back appropriately after passing the turns. In contrast, due to the model error, the model-driven RL has higher speed error and its deceleration when making turns is insufficient. Due to the slow convergence, the data-driven RL generates a poor solution and has the largest speed error.
The mixed RL also outperforms the other two benchmark methods in terms of the lateral position error. As shown in Fig. 7, the mixed RL has the minimum steady-state lateral position error, while data-driven RL has the largest lateral position error and frequent speed fluctuation.
The mean absolute errors of three methods are compared in Table I. The longitudinal speed error of mixed RL is 77.41 less, and the lateral position error is 33.77 less than the data-driven RL. Besides, the longitudinal speed error of mixed RL is 58.82 less, and the lateral position error is 15.64 less than the model-driven RL.
|Method||Position error||Speed error|
In summary, mixed RL exhibits the fastest convergence speed during the training process and the greatest control performance in double lane change task. The model-driven RL has similar convergence speed as the mixed RL, but has higher control error due to the model error. The data-driven RL has the slowest convergence rate and the largest control error, due to the difficulties in finding the optimal policy only by state-action data.
This paper proposes a mixed reinforcement learning approach with better performances on convergence speed and policy accuracy for non-linear systems with additive Gaussian uncertainty. The mixed RL utilizes an iterative Bayesian estimator to accurately model the environmental dynamics by integrating the designer’s knowledge with the measured state transition data. The convergence and recursive stability of learned policy were proved via Bellman’s principle of optimality and Lyapunov analysis. It is observed that mixed RL achieves faster convergence rate and more stable training process than the data-driven counterpart. Meanwhile, mixed RL has lower policy error than model-driven counterpart since the environmental model is refined iteratively by Bayesian estimation. The benefits of mixed RL are demonstrated by a double-lane change task with an automated vehicle. The potential of mixed RL in more general environmental dynamics and non-Gauss uncertainties will be investigated in the future.
-  E. Gibney, “Google ai algorithm masters ancient game of go,” Nature News, vol. 529, no. 7587, p. 445, 2016.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  J. Duan, Y. Guan, Y. Ren, S. E. Li, and B. Cheng, “Addressing value estimation errors in reinforcement learning with a state-action return distribution function,” arXiv preprint arXiv:2001.02811, 2020.
-  T. Bian, Y. Jiang, and Z.-P. Jiang, “Adaptive dynamic programming and optimal control of nonlinear nonaffine systems,” Automatica, vol. 50, no. 10, pp. 2624–2632, 2014.
-  D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Dynamic programming and optimal control. Athena scientific Belmont, MA, 1995, vol. 1, no. 2.
-  J. Duan, S. E. Li, Z. Liu, M. Bujarbaruah, and B. Cheng, “Generalized policy iteration for optimal control in continuous time,” arXiv preprint arXiv:1909.05402, 2019.
-  J. Duan, Z. Liu, S. E. Li, Q. Sun, Z. Jia, and B. Cheng, “Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints,” arXiv preprint arXiv:1911.11397, 2019.
-  F. L. Lewis and D. Liu, Reinforcement learning and approximate dynamic programming for feedback control. John Wiley & Sons, 2013, vol. 17.
-  D. P. Bertsekas, “Approximate policy iteration: a survey and some new methods,” Journal of Control Theory and Applications, vol. 9, no. 3, pp. 310–335, 2011.
-  Y. Guan, S. E. Li, J. Duan, J. Li, Y. Ren, and B. Cheng, “Direct and indirect reinforcement learning,” arXiv preprint arXiv:1912.10600, 2019.
-  S. Bansal, R. Calandra, K. Chua, S. Levine, and C. Tomlin, “Mbmf: Model-based priors for model-free reinforcement learning,” arXiv preprint arXiv:1709.03153, 2017.
-  A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7559–7566.
V. Feinberg, A. Wan, I. Stoica, M. Jordan, J. Gonzalez, and S. Levine,
“Model-based value expansion for efficient model-free reinforcement
Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
-  S. Levine and P. Abbeel, “Learning neural network policies with guided policy search under unknown dynamics,” in Advances in Neural Information Processing Systems, 2014, pp. 1071–1079.
-  M. C. Yip and D. B. Camarillo, “Model-less feedback control of continuum manipulators in constrained environments,” IEEE Transactions on Robotics, vol. 30, no. 4, pp. 880–889, 2014.
-  R. Lioutikov, A. Paraschos, J. Peters, and G. Neumann, “Sample-based informationl-theoretic stochastic optimal control,” in 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2014, pp. 3896–3902.
-  T. Kurutach, I. Clavera, Y. Duan, A. Tamar, and P. Abbeel, “Model-ensemble trust-region policy optimization,” arXiv preprint arXiv:1802.10592, 2018.
-  J. Buckman, D. Hafner, G. Tucker, E. Brevdo, and H. Lee, “Sample-efficient reinforcement learning with stochastic ensemble value expansion,” in Advances in Neural Information Processing Systems, 2018, pp. 8224–8234.
-  H. Deng, M. Krstic, and R. J. Williams, “Stabilization of stochastic nonlinear systems driven by noise of unknown covariance,” IEEE Transactions on automatic control, vol. 46, no. 8, pp. 1237–1253, 2001.
-  P. Diaconis and D. Freedman, “On the consistency of bayes estimates,” The Annals of Statistics, pp. 1–26, 1986.
-  K. Chung, “The strong law of large numbers,” Selected Works of Kai Lai Chung, pp. 145–156, 2008.
J. M. H. Lobato, “Expectation propagation for approximate bayesian inference,” 2007.
-  R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” in Advances in neural information processing systems, 2000, pp. 1057–1063.
-  S. E. Li, H. Chen, R. Li, Z. Liu, Z. Wang, and Z. Xin, “Predictive lateral control to stabilise highly automated vehicles at tire-road friction limits,” Vehicle System Dynamics, pp. 1–19, 2020.
-  Y.-H. J. Hsu, S. M. Laws, and J. C. Gerdes, “Estimation of tire slip angle and friction limits using steering torque,” IEEE Transactions on Control Systems Technology, vol. 18, no. 4, pp. 896–907, 2009.
-  S. Xu, S. E. Li, B. Cheng, and K. Li, “Instantaneous feedback control for a fuel-prioritized vehicle cruising system on highways with a varying slope,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 5, pp. 1210–1220, 2016.
-  D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, no. 2, pp. 477–484, 2009.