Mixed Reinforcement Learning with Additive Stochastic Uncertainty

by   Yao Mu, et al.
cornell university

Reinforcement learning (RL) methods often rely on massive exploration data to search optimal policies, and suffer from poor sampling efficiency. This paper presents a mixed reinforcement learning (mixed RL) algorithm by simultaneously using dual representations of environmental dynamics to search the optimal policy with the purpose of improving both learning accuracy and training speed. The dual representations indicate the environmental model and the state-action data: the former can accelerate the learning process of RL, while its inherent model uncertainty generally leads to worse policy accuracy than the latter, which comes from direct measurements of states and actions. In the framework design of the mixed RL, the compensation of the additive stochastic model uncertainty is embedded inside the policy iteration RL framework by using explored state-action data via iterative Bayesian estimator (IBE). The optimal policy is then computed in an iterative way by alternating between policy evaluation (PEV) and policy improvement (PIM). The convergence of the mixed RL is proved using the Bellman's principle of optimality, and the recursive stability of the generated policy is proved via the Lyapunov's direct method. The effectiveness of the mixed RL is demonstrated by a typical optimal control problem of stochastic non-affine nonlinear systems (i.e., double lane change task with an automated vehicle).



page 1

page 9

page 10


Improving the Exploration of Deep Reinforcement Learning in Continuous Domains using Planning for Policy Search

Local policy search is performed by most Deep Reinforcement Learning (D-...

Don't Forget Your Teacher: A Corrective Reinforcement Learning Framework

Although reinforcement learning (RL) can provide reliable solutions in m...

Soft policy optimization using dual-track advantage estimator

In reinforcement learning (RL), we always expect the agent to explore as...

Constrained Reinforcement Learning for Dynamic Optimization under Uncertainty

Dynamic real-time optimization (DRTO) is a challenging task due to the f...

Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Stochastic control with both inherent random system noise and lack of kn...

Mixed Policy Gradient

Reinforcement learning (RL) has great potential in sequential decision-m...

On the Convergence of Reinforcement Learning

We consider the problem of Reinforcement Learning for nonlinear stochast...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) has been successfully applied in a variety of challenging tasks, such as Go game and robotic control [1, 2]

. The increasing interest in RL is primarily stimulated by its data-driven nature, which requires little prior knowledge of the environmental dynamics, and its combination with powerful function approximators, e.g. deep neural networks. In spite of these advantages, many purely data-driven RL suffers from slow convergence rate in continuous action space of stochastic systems, which hinders its widespread adoption in real-world applications

[3, 4].

To alleviate this problem, researchers have investigated the use of model-driven RL algorithms, which searches the optimal policy with known environmental models by employing the principle of Bellman optimality[5, 6, 7, 8]. Model-driven RL has shown faster convergence compared to the data-driven counterparts, since environmental models provide the information of environmental evolution in the whole state-action space. Thus, gradient calculation can be easier and more accurate than merely using data samples [9]. To solve the Bellman equation in the continuous action space, most existing RL methods adopt an iterative technique to gradually find the optimum. One classic framework is called policy iteration RL, which consists of two steps: 1) policy evaluation (PEV), that aims at solving self-consistency condition equation and evaluating the current policy, and 2) policy improvement (PIM) that seeks to optimize the corresponding value function[10, 11].

A number of prior works focus on improving the PEV step by using model-driven value expansion, which corrects the cumulative return or the approximated value function by using environmental models[12, 13]. However, due to the inherent model inaccuracy, this technique is not suitable for long-term PEV. To partly solve this problem, model-based value expansion algorithm proposed a hybrid algorithm that uses environmental dynamic model only to simulate the short-term horizon, and utilizes the explored data to estimate the long-term value beyond the simulation horizon[14]. Nevertheless, the inaccuracy problem hinders the application of environmental model in PEV.

So far, the environmental model has limited applications in the PIM step due to two main issues: 1) the inaccuracy and overfitting of environmental dynamic models and 2) policy oscillation caused by the time-varying models, since the system model is iteratively learned or updated in the training process [15, 16, 17]. Prior works provide the model ensemble technique for solving these problems. For example, the model-ensemble trust region policy optimization (TRPO) algorithm [18] limits model over-training by using an ensemble metric during policy search. The stochastic ensemble value expansion [19]

, which is an extension to the model-based value expansion, interpolates between many different horizon lengths and different models to favor models that generate more accurate estimates. Although the ensemble techniques effectively avoid over-fitting, it brings extra computational overhead.

Facing the aforementioned challenges of RL algorithms, this paper proposes a mixed reinforcement learning (mixed RL) algorithm that utilizes the dual representations of environmental dynamics to improve both learning accuracy and training speed. The environmental model, either empirical or theoretical, is used as the prior information to avoid overfitting, while the model error is iteratively compensated by the measured data of states and actions using Bayesian estimation. Precisely, the contributions of this paper are as follows,

  1. A dual representation of environmental dynamics is utilized in RL by integrating the designer’s knowledge with the measured data. An iterative Bayesian estimator (IBE) with explored data is designed for improving the model accuracy and computation efficiency.

  2. A mixed RL algorithm is developed by embedding the iterative Bayesian estimator into the policy iteration. We propose the sufficient recursive stability and convergence condition which limits the estimated difference of iterative Bayesian estimator between two consecutive iterations. And we proved that the sufficient condition holds with probability one after sufficient iterations.

The rest of this paper is organized as follows. Section II defines a mixed RL problem. Section III introduces the mixed representation of environmental dynamics. Section IV and Section V presents the mixed RL algorithm, as well as the parametrization of the policy and value function. Section VI evaluates the effectiveness of mixed RL problem using the double lane change task with a automated vehicle, and Section VII concludes this paper.

Ii Problem Description

We consider a discrete-time environment with additive stochastic uncertainty and its actual dynamics is mathematically described as


where is the current time, is the state, is the action, is the deterministic part of environmental dynamics, is the additive stochastic uncertainty with unknown mean and covariance . In this study, we assume that the additive stochastic uncertainty

follows the Gaussian distribution and

. Parameters and can be completely independent of or form a functional relationship with .

As shown in Fig. 1, actual environmental dynamics contains both deterministic part and uncertain part , where is the probability density of and is the probability density of under given .

Fig. 1: Dynamics for the stochastic environment.

The objective of mixed RL is to minimize the expectation of cumulative cost under the distribution of additive stochastic uncertainty , shown as (2):


where is policy, is the state value, which is a function of initial state , is the utility function, which is positive definite, is the discounting factor with , and is the expectation w.r.t. the additive stochastic uncertainty . Here, the policy is a deterministic mapping:


The optimal cost function is defined as


where is the action sequence starting from time . In mixed RL, the self-consistency condition (5) is used to describe the relationship of state values between current time and next time:


By using Bellman’s principle of optimality. we have the well-known Bellman equation:


The Bellman equation implies that optimal policy can be calculated in a step-by-step backward manner. Therefore, optimal action is


where represent the optimal policy that maps from an arbitrary state to its optimal action . Similar to other indirect RL problems, mixed RL aims to find an optimal policy by minimizing cost (2) while being subjected to the constrains of environmental dynamics. The searching procedure can be replaced by solving the Bellman equation in an iterative way. Obviously, the performance of the generated policy depends on the accuracy of the representation of the environmental dynamics. In fact, either an analytical model or state-action samples can be an useful representation, which corresponds to the so-called model-driven RL and data-driven RL, respectively. The analytical model is usually inaccurate due to environmental uncertainties, which will impair the optimality of the generated policy. The state-action samples, on the other hand, have low sampling efficiency and will slow down the training process.

Iii Dual Representation of Environmental Dynamics

In mixed RL, the environmental dynamics are dually represented by both an analytical model and state-action data . The former represents the designer’s knowledge about the environmental dynamics. It is defined in the whole state-action space and can be used to accelerate the training speed. The latter comes from direct measurement of state-action pairs during learning. It is generally more accurate than , and therefore can improve the estimation of the uncertain part in the analytical model. The mixed RL uses the dual representation of environmental dynamics, i.e., both analytical model and state-action data , to search for optimal policy. Such dual representation can have accelerated training compared to purely data-driven RL while achieving better policy satisfaction than purely model-driven counterpart.

The analytical model is similar to (1):


where the mean and covariance of are given in advance by designers. The given distribution can be quite different from actual distribution due to the modelling errors. Here, and are taken as the prior knowledge of environmental dynamics.

The state-action data, i.e., a sequence of triples , is denoted by :


where is the -th state in , is the -th action in , and is the length of data samples. Obviously, the measured data also inherently contain the distribution information of , and are taken as the posterior knowledge of environmental dynamics.

If the environmental dynamics is exactly known, optimal policy can be computed by only using the dynamic model, which is also the most efficient RL. However, exact model is inaccessible in reality, and thus the generated policy might not converge to . Although collecting samples is less efficient, it can be quite accurate to represent the environment, thus being able to improve the generated policy. Therefore, the mixed representation is able to utilize advantages of both model and data to improve training efficiency and policy accuracy.

Improve model by using data :

We utilize data samples to improve the estimation of the additive stochastic uncertainty in the analytical model . The uncertainty that inherently exists in a state-action triple is equal to


A Bayesian estimator is adopted to fuse the distribution information of the additive stochastic uncertainty from both model and data

. The Bayesian estimator aims to maximize the posterior probability

. In general, we introduce and as the the prior distribution of and , then the maximum likelihood problem becomes


Under the assumption that data is iid, (11) can be rewritten into iterative form:


Therefore, we can build an iterative Bayesian estimator with the following general form,


Here, we discuss two simplified cases of the Bayesian estimator:

Case 1: Assume that the covariance is known and is independent from and , we introduce provided by model as the prior distribution of . Thus, the objective function of Bayesian estimation becomes,


where is the prior distribution and is a constant. The optimal estimation of is calculated by (15).


The can be iteratively computed by using IBE. Define , and , the iterative Bayesian estimator is


Case 2: Assume that both the mean and covariance are unknown. The same prior distribution in case 1 is applied to . The covariance is estimated by the maximum likelihood estimation, since the parameters of the prior distribution of are inconvenient to determine by human designer. Subsequently, the optimal estimation of and are as follows,


Define and . Then and can be iteratively computed by the following IBE,


For more general cases where is related to and , i.e. , where is a general function with parameter , the likelihood becomes (19) and the optimal estimation of is the minimum of .


Iv Mixed RL Algorithm

Iv-a Mixed RL Algorithm Framework

Existing RL algorithms that compute the optimal policy via the use of Bellman equation are known as indirect RL and they usually involve PEV and PIM steps. Different from traditional indirect RL algorithms, mixed RL consists of three alternating steps, i.e., IBE, PEV and PIM, as shown in Fig. 2. IBE that is proposed in Section III is used to estimate the mean and covariance of the additive stochastic uncertainty iteratively. PEV seeks to numerically solve a group of algebraic equations governed by the self-consistency condition (5) under current-step policy , and PIM is to search a better policy by minimizing a “weak” Bellman equation.

Fig. 2: The framework of the mixed RL algorithm.

In the first step, IBE calculates and with the latest data and the mixed model is updated accordingly, i.e.,


where is defined in (13). The optimal policy is searched by policy iteration with the mixed model (20). In the second step, PEV solves (21) under the estimated distribution of :


where is the current policy at -step iteration, and is the state value to be solved under policy . In the third step, PIM computes an improved policy by minimizing (22):


where is the new policy. The use of estimated naturally embeds both analytical model and state-action data into RL, which is able to improve the accuracy of the additive stochastic uncertainty and and achieve high convergence speed. The mixed RL algorithm is summarized in Algorithm 1.

  Initialize IBE parameters and
  Initialize state ,
     update distribution of and mixed model with by IBE (20)
     PEV with mixed model:
     PIM with mixed model:
  until  and
Algorithm 1 Mixed RL algorithm

Iv-B Recursive Stability and Convergence Under Fixed

In this section, we prove the recursive stability and convergence under fixed additive uncertainty .

Iv-B1 Recursive stability

Recursive stability means can stabilize the plant so long as can. We call the closed-loop stochastic system is stable in probability, if for any , the following equality holds,

Lemma 1 (Lyapunov stability criterion [20]):

If there exists a positive definite Lyapunov sequence on , which satisfies


then the stochastic system is stable in probability, where is a continuous function, and .

Next, we prove the recursive stability criterion for mixed RL algorithm under fixed using Lemma 24.

Theorem 1 (Recursive stability theorem):

For any step in mixed RL, is stable in probability if is stable in probability and the discount factor is selected appropriately under the mixed model.


Since is optimal for “weak” Bellman equation, and is non-optimal for step value, we have:


where is the next state with under the mixed model, and is the next state with . Therefore,


Since is stable in probability, is bounded, thus, is bounded. Considering the fact that and is positive definite function, holds, except for which is stable in probability naturally.

We choose a proper to satisfy:


Therefore, is monotonically decreasing w.r.t time with approximate , i.e.,


In short, is stable in probability.

Iv-B2 Convergence of mixed RL

The convergence property describes whether the generated policy, , can converge to the optimum under the mixed RL. Here, we prove the convergence of mixed RL algorithm under fixed .

Theorem 2 (State value decreasing theorem):

For any under the additive stochastic uncertainty , is monotonically decreasing with respect to , i.e.,


The key is to examine (except for )


At each RL iteration, we initialize step value function by . The first PEV iteration for is


With respect to (25), we know


For following PEV iterations, we need to reuse the inequality (32):


Similarly, . Therefore, is a monotonically decreasing sequence and bounded by 0 for always holds. Finally, will converge


So we have .

Iv-C Recursive Stability and Convergence Under Varying

In this section, we discuss the recursive stability and convergence under varying additive uncertainty , and propose the sufficient condition by designing an upper bound for the differences between and .

Under , the self-consistency condition is


Since is the optimal action with respect to of in the k-th iteration, we have


which is the key inequality in the proof in section IV-B.

However, when is updated from to , the variation of should be bounded in the interest of stability and convergence. Here, we give the sufficient condition of recursive stability and convergence under varying , that is, the maximum variation condition (MVC) of the additive stochastic uncertainty (38).

Define as the expected cumulative cost under the additive stochastic uncertainty ,

Theorem 3 (Sufficient condition for recursive stability and convergence):

For any step in mixed RL, is recursive stable and is monotonically decreasing with respect to , if the following MVC is satisfied


where is the decrease of cumulative cost after PIM,


The MVC requires that the change of have less impact on the cumulative cost calculation than PIM in the last iteration.


Since , when MVC is satisfied, , thus, we have


Subsequently, recursive stability theorem (41) and state value decreasing theorem (42) under varying can be proved similarly to section IV-B


Next, we first present Lemma 2 that will be used for the convergence analysis of IBE, then we prove the MVC is satisfied with probability one.

Lemma 2 (Convergence criterion of Bayesian estimation [21]):

In Bayesian estimation, if the empirical data and the parameter’s prior distribution obey Gauss distribution and the covariance matrix of prior distribution is full rank, then the estimation result and will converge to the sample’s mean and covariance asymptotically.

Theorem 4 (MVC is satisfied with probability one criteria):

The MVC is satisfied with probability one after sufficient iterations, with the assumption that the IBE converges faster than PIM and PEV.


Using Kolmogorov strong law of large numbers

[22], we have


where and are arbitrary small positive constants, and are the true mean and covariance of . Thus, using Lemma 2 and (43), we know that, when , and converges to and in probability one [21], i.e.,


Since , , and both and obey Gaussian distribution, the KL-divergence between and converge to with probability one [23], i.e.,


Thus, we have


Since , when , the MVC (38) holds with probability one, i.e.,


In general, MVC indicates that the excessive difference between and should be avoided. In mixed RL, we update the distribution of the additive stochastic uncertainty by Bayesian estimation. As shown in Fig. 3, if a single data batch has large deviation from the total data, the Bayesian estimator can reduce the deviation between the posterior distribution and the total data distribution by introducing appropriate prior distribution of parameters.

Fig. 3: IBE is effective to prevent excessive difference between and due to the use of appropriate prior distribution of parameters.

V Mixed RL with Parameterized Functions

For large state spaces, both value function and policy are parameterized in mixed RL, as shown in (48). The parameterized value function with known parameter is called the “critic”, and the parameterized policy with known parameter is called the “actor” [24].


The parameterized critic is to minimize the average square error (49) in PEV, i.e.,


The semi-gradient of the critic is


where and .

The parameterized actor is to minimize the “weak” Bellman condition, i.e., to minimize the following objective function,


where and are the mean and covariance of . The gradient of is calculated as follows,


In essence, the parameterized method is called generalized policy iteration (GPI). Different from the traditional policy iteration, PEV and PIM each has only one step in GPI, which greatly improves the computational efficiency when RL is combined with neural network.

Since in each GPI cycle, the gradient descent of PIM is only carried out once, the maximum variation condition (MVC) may not be satisfied. We propose a Adaptive GPI (AGPI) method to solve this problem. In every iteration, we check whether the PIM results satisfy MVC. If not, the algorithm will continue the gradient descent steps in PIM until the MVC is satisfied or when the maximum internal circulation step is reached. Subsequently, the mixed RL algorithm with parameterized Adaptive GPI (AGPI) is summarized in Algorithm 2.

  Initialize IBE parameters and
  Initialize network weights and , choose appropriate learning rates and ,
     update distribution of and mixed model with by IBE (20)
     update Critic with :
     update Actor with :
        update policy net,
     until MVC (38) is satisfied or
  until  and
Algorithm 2 Mixed RL with parameterized value and policy

Vi Numerical Experiments

We consider a typical optimal control problem of stochastic non-affine nonlinear systems, i.e., the combined lateral and longitudinal control of an automated vehicle with stochastic disturbance (i.e., the influence of small road slope and road bumps). The vehicle is subjected to random longitudinal interference force in the tracking process and the vehicle dynamics is shown in (53) [25].


where the state , is the lateral velocity, is yaw rate, is the difference between longitudinal velocity and desired velocity, is the yaw angle, and is the distance between vehicle’s centroid and the target trajectory. For the control input , where is the front wheel angle and is the longitudinal acceleration. The and are the lateral tire forces of the front and rear tires respectively, which are calculated by the Fiala tire model [26]. In the tire model,the tire-road friction coefficient is set as 1.0. The front wheel cornering stiffness and rear wheel cornering stiffness are set as 88000 and 94000 respectively. The mass is set as 1500 , the and are the distances from centroid to front axle and rear axle, and set as 1.14 and 1.40

respectively. The polar moment of inertia

at centroid is set as 2420 . The random longitudinal interference force and the desired velocity is set as 12 [27].

For comparison purpose, a double-lane change task was simulated respectively with three different RL algorithms. The task is to track the desired trajectory in the lateral direction while maintaining the desired longitudinal velocity under the longitudinal interference . Hence, the optimal control problem with discretized stochastic system equation is given by


where is the discounting factor, is the deterministic part of the discretized system equation of (47), is the additive stochastic uncertainty and the simulation time interval is set as . In this simulated task, we compared the performance of mixed RL with both model-driven RL and data-driven RL. The data-driven RL computes the control policy only by using the state-action data with a typical data-driven algorithm (i.e., DDPG) [3]. The model-driven RL computes the policy by GPI [28] directly using the given empirical model


where the prior distribution is set as and is a diagonal matrix, whose diagonal elements are .

The convergence performance of these three algorithms are compared in Fig. 4. The mixed RL and model-driven RL can converge in 1 iterations, while the data-driven RL needs 4 iterations to converge under the same hyper-parameter.

Fig. 4: Convergence rate comparison between mixed RL, model-driven RL, and data-driven RL.

For control performance, we test the policies calculated by three methods in the double lane change task. As shown in Fig. 5, all three policies stably tracked the target trajectory, but with different control error. In fact,

Fig. 5: Tracking performance comparison.

as shown in Fig. 6, the mixed RL has the minimum longitudinal speed error, since it enables the vehicle to decelerate rapidly at sharp turns and adjust back appropriately after passing the turns. In contrast, due to the model error, the model-driven RL has higher speed error and its deceleration when making turns is insufficient. Due to the slow convergence, the data-driven RL generates a poor solution and has the largest speed error.

Fig. 6: Longitudinal speed error.

The mixed RL also outperforms the other two benchmark methods in terms of the lateral position error. As shown in Fig. 7, the mixed RL has the minimum steady-state lateral position error, while data-driven RL has the largest lateral position error and frequent speed fluctuation.

Fig. 7: Lateral position error.

The mean absolute errors of three methods are compared in Table I. The longitudinal speed error of mixed RL is 77.41 less, and the lateral position error is 33.77 less than the data-driven RL. Besides, the longitudinal speed error of mixed RL is 58.82 less, and the lateral position error is 15.64 less than the model-driven RL.

Method Position error Speed error
Mixed RL 0.151 0.021
Data-driven RL 0.228 0.093
Model-driven RL 0.179 0.051
TABLE I: Performance comparison of three methods

In summary, mixed RL exhibits the fastest convergence speed during the training process and the greatest control performance in double lane change task. The model-driven RL has similar convergence speed as the mixed RL, but has higher control error due to the model error. The data-driven RL has the slowest convergence rate and the largest control error, due to the difficulties in finding the optimal policy only by state-action data.

Vii Conclusion

This paper proposes a mixed reinforcement learning approach with better performances on convergence speed and policy accuracy for non-linear systems with additive Gaussian uncertainty. The mixed RL utilizes an iterative Bayesian estimator to accurately model the environmental dynamics by integrating the designer’s knowledge with the measured state transition data. The convergence and recursive stability of learned policy were proved via Bellman’s principle of optimality and Lyapunov analysis. It is observed that mixed RL achieves faster convergence rate and more stable training process than the data-driven counterpart. Meanwhile, mixed RL has lower policy error than model-driven counterpart since the environmental model is refined iteratively by Bayesian estimation. The benefits of mixed RL are demonstrated by a double-lane change task with an automated vehicle. The potential of mixed RL in more general environmental dynamics and non-Gauss uncertainties will be investigated in the future.