I Introduction
Reinforcement learning (RL) is a machine learning technique that has been widely studied from the computational intelligence and machine learning scope in the artificial intelligence community
[1, 2, 3, 4]. RL technique refers to an actor or agent that interacts with its environment and aims to learn the optimal actions, or control policies, by observing their responses from the environment. In [2], Sutton and Barto suggested a definition of RL method, i.e., any method that is well suited for solving RL problem can be regarded as a RL method, where the RL problem is defined in terms of optimal control of Markov decision processes. This obviously established the relationship between the RL method and control community. Moreover, RL methods have the ability to find an optimal control policy from unknown environment, which makes RL a promising method for control design of real systems.
In the past few years, many RL approaches [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] have been introduced for solving the optimal control problems. Especially, some extremely important results were reported by using RL for solving the optimal control problem of discretetime systems [7, 10, 14, 17, 18, 22]. Such as, Liu and Wei suggested a finiteapproximationerror based iterative adaptive dynamic programming approach [17], and a novel policy iteration (PI) method [22] for discretetime nonlinear systems. For continuoustime systems, Murray et al. [5] presented two PI algorithms that avoid the necessity of knowing the internal system dynamics. Vrabie et al. [8, 9, 13] introduced the thought of PI and proposed an important framework of integral reinforcement learning (IRL). Modares et al. [21] developed an experience replay based IRL algorithm for nonlinear partially unknown constrainedinput systems. In [19], an online neural network (NN) based decentralized control strategy was developed for stabilizing a class of continuoustime nonlinear interconnected largescale systems. In addition, it worth mentioning that the thought of RL methods have also been introduced to solve the optimal control problem of partial differential equation systems [6, 12, 15, 16, 23]. However, for most of practical real systems, the existence of external disturbances is usually unavoidable.
To reduce the effects of disturbance, robust controller is required for disturbance rejection. One effective solution is the control method, which achieves disturbance attenuation in the gain setting [24, 25, 26], that is, to design a controller such that the ratio of the objective output energy to the disturbance energy is less than a prescribed level. Over the past few decades, a large number of theoretical results on nonlinear control have been reported [27, 28, 29], where the control problem can be transformed into how to solve the socalled HamiltonJacobiIsaacs (HJI) equation. However, the HJI equation is a nonlinear partial differential equation (PDE), which is difficult or impossible to solve, and may not have global analytic solutions even in simple cases.
Thus, some works have been reported to solve the HJI equation approximately [27, 30, 31, 32, 33, 34, 35]. In [27], it was shown that there exists a sequence of policy iterations on the control input such that the HJI equation is successively approximated with a sequence of HamiltonJacobiBellman (HJB)like equations. Then, the methods for solving HJB equation can be extended for the HJI equation. In [36], the HJB equation was successively approximated by a sequence of linear PDEs, which were solved with Galerkin approximation in [30, 37, 38]. In [39], the successive approximation method was extended to solve the discretetime HJI equation. Similar to [30], a policy iteration scheme was developed in [31] for the constrained input system. For implementation purpose of this scheme, a neurodynamic programming approach was introduced in [40] and an online adaptive method was given in [41]. This approach suits for the case that the saddle point exists, thus a situation that the smooth saddle point does not exist was considered in [42]. In [32], a synchronous policy iteration method was developed, which is the extension of the work [43]. To improve the efficiency for computing the solution of HJI equation, Luo and Wu [44] proposed a computationally efficient simultaneous policy update algorithm (SPUA). In addition, in [45] the solution of the HJI equation was approximated by the Taylor series expansion, and an efficient algorithm was furnished to generate the coefficients of the Taylor series. It is observed that most of these methods [27, 30, 31, 32, 33, 35, 40, 44, 45] are modelbased, where the full system model is required. However, the accurate system model is usually unavailable or costly to obtain for many practical systems. Thus, some RL approaches have been proposed for control design of linear systems [46, 47] and nonlinear systems [48] with unknown internal system model. But these methods are onpolicy learning approaches [32, 46, 41, 47, 48, 49], where the cost function should be evaluated by using system data generated with policies being evaluated. It is found that there are several drawbacks (to be discussed in Section III) to apply the onpolicy learning to solve real control problem.
To overcome this problem, this paper introduces an offpolicy RL method to solve the nonlinear continuoustime control problem with unknown internal system model. The rest of the paper is rearranged as follows. Sections II and III present the problem description and the motivation. The offpolicy learning methods for nonlinear systems and linear systems are developed in IV and V respectively. The simulation studies are conducted in Section VI and a brief conclusion is given in Section VII.
Notations: and are the set of real numbers, the dimensional Euclidean space and the set of all real matrices, respectively.
denotes the vector norm or matrix norm in
or , respectively. The superscript is used for the transpose and denotes the identify matrix of appropriate dimension. denotes a gradient operator notation. For a symmetric matrix means that it is a positive (semipositive) definite matrix. for some real vector and symmetric matrix with appropriate dimensions. is function space on with first derivatives are continuous. is a Banach space, for . Let and be compact sets, denote . For column vector functions and , where Ôºå define inner product and norm . is a Sobolev space that consists of functions in space such that their derivatives of order at least are also in .Ii Problem description
Let us consider the following affine nonlinear continuoustime dynamical system:
(1)  
(2) 
where is the state, is the control input and , is the external disturbance and , is the objective output. is Lipschitz continuous on a set that contains the origin, . represents the internal system model which is assumed to be unknown in this paper. and are known continuous vector or matrix functions of appropriate dimensions.
The control problem under consideration is to find a state feedback control law such that the system (1) is closedloop asymptotically stable, and has gain less than or equal to , that is,
(3) 
for all and is some prescribed level of disturbance attenuation. From [27], this problem can be transformed to solve the socalled HJI equation, which is summarized in Lemma 1.
Lemma 1.
Assume the system (1) and (2) is zerostate observable. For , suppose there exists a solution to the HJI equation
(4) 
where and . Then, the closedloop system with the state feedback control
(5) 
has gain less than or equal to , and the closedloop system (1), (2) and (5) (when ) is locally asymptotically stable.
Iii Motivation from investigation of related work
From Lemma 1, it is noted that the control (5) relies on the solution of the HJI equation (1). Therefore, a modelbased iterative method was proposed in [30], where the HJI equation is successively approximated by a sequence of linear PDEs:
(6) 
and then update control and disturbance policies with
(7)  
(8) 
with . From [27, 30], it was indicated that the can converge to the solution of the HJI equation, i.e., .
Remark 1.
Note that the key point of the iterative scheme (III)(8) depends on the solution of the linear PDE (III). Thus, several related methods were developed, such as, Galerkin approximation [30], synchronous policy iteration [32], neurodynamic programming approach [31, 40] and online adaptive control method [41] for constrained input systems, and Galerkin approximation method for discretetime systems [39]. Obviously, the iteration (III)(8) will generate two iterative loops since the control and disturbance policies are updated at the different iterative steps, i.e., the inner loop for updating disturbance policy with index , and the outer iterative loop for updating control policy with index . The outer loop will not be activated until the inner loop is convergent, which results in low efficiency. Therefore, Luo and Wu [44] proposed a simultaneous policy update algorithm (SPUA), where the control and disturbance policies are updated at the same iterative step, and thus only one iterative loop is required. It worth noting that the word “simultaneous” in [44] and the word “synchronous/simultaneous” in [32, 41] represent different meanings. The former emphasizes the same “iterative step,‚Äù while the latter emphasizes the same “time instant”. In other words, the SPUA in [44] updates control and disturbance policies at the “same” iterative step, while the algorithms in [32, 41] update control and disturbance policies at the “different” iterative steps.
The procedure of modelbased SPUA is given in Algorithm 1.
Algorithm 1.
Modelbased SPUA.

Step 1: Give an initial function ( is determined by Lemma 5 in [44]. Let ;

Step 2: Update the control and disturbance policies with
(9) (10) 
Step 3: Solve the following linear PDE for :
(11) where and .

Step 4: Let , go back to Step 2 and continue.
It worth noting that Algorithm 1 is an infinite iterative procedure, which is used for theoretical analysis rather than for implementation purpose. That is to say, Algorithm 1 will converge to the solution of the HJI equation (1) when the iteration goes to infinity. By constructing a fixed point equation, the convergence of Algorithm 1 is established in [44] by proving that it is essentially a Newton‚Äôs iteration method for finding the fixed point. With the increase of index , the sequence obtained by the SPUA with equations (9)(1) can converge to the solution of HJI equation (1), i.e., .
Remark 2.
It is necessary to explain the rationale of using equations (9) and (10) for control and disturbance policies update. The control problem (1)(3) can be viewed as a twoplayers zerosum differential game problem [26, 33, 40, 47, 32, 34, 41]. The game problem is a minimax problem, where the control policy acts as the minimizing player and the disturbance policy is the maximizing player. The game problem aims at finding the saddle point , where is given by expression (5) and is given by . Correspondingly, for the control problem (1)(3), and are the associated control policy and the worst disturbance signal [26, 31, 40, 47, 32, 34], respectively. Thus, it is reasonable using expressions (9) and (10) (that are consistent with and in form) for control and disturbance policies update. Similar control and disturbance policy update method could be found in references [27, 30, 31, 40, 34, 41].
Observe that both iterative equations (III) and (1) require the full system model. For the control problem that the internal system dynamic is unknown, data based methods [47, 48] were suggested to solve the HJI equation online. However, most of related existing online methods are onpolicy learning approaches [32, 41, 47, 48, 49]. From the definition of onpolicy learning [2], the cost function should be evaluated with the data generated from the evaluating policies. For example, in equation (III) is the cost function of the policies and , which means that should be evaluated with system data by using evaluating policies and . It is observed that these onpolicy learning approaches for solving the control problem have several drawbacks:

1) For real implementation of onpolicy learning methods [32, 41, 48, 49], the approximate evaluating control and disturbance policies (rather than the actual policies) are used to generate data for learning their cost function. In other words, the onpolicy learning methods using the “inaccurate” data to learn their cost function, which will increase the accumulated error. For example, to learn the cost function in equation (III), some approximate policies and (rather than its actual policies and
, which are usually unknown because of estimate error) are employed to generate data;

2) The evaluating control and disturbance policies are required to generate data for onpolicy learning, thus disturbance signal should be adjustable, which is usually impractical for most of real systems;

3) It is known [2, 50] that the issue of “exploration” is extremely important in RL for learning the optimal control policy, and the lack of exploration during the learning process may lead to divergency. Nevertheless, for onpolicy learning, exploration is restricted because only the evaluating policies can be used to generate data. From the literature investigation, it is found that the “exploration” issue is rarely discussed in existing work that using RL techniques for control design;

5) Most of existing approaches [32, 41, 47, 48, 49] are implemented online, thus they are difficult for realtime control because the learning process is often timeconsuming. Furthermore, online control design approaches just use current data while discard past data, which implies that the measured system data is used only once and thus results in low utilization efficiency.
To overcome the drawbacks mentioned above, we propose an offpolicy RL approach to solve the control problem with unknown internal system dynamic .
Iv Offpolicy reinforcement learning for control
In this section, an offpolicy RL method for control design is derived and its convergence is proved. Then, a NNbased criticactor structure is developed for implementation purpose.
Iva Offpolicy reinforcement learning
To derive the offpolicy RL method, we rewrite the system (1) as:
(12) 
for . Let be the solution of the linear PDE (1), then taking derivative along the state of system (12) yields,
(13) 
With the linear PDE (1), conducting integral on both sides of equation (IVA) in time interval and rearranging terms yield,
(14) 
It is observed from the equation (IVA) that the cost function can be learned by using arbitrary input signals and , rather than the evaluating policies and . Then, replacing linear PDE (1) in Algorithm 1 with (IVA) results in the offpolicy RL method. To show its convergence, Theorem 1 establishes the equivalence between iterative equations (1) and (IVA).
Theorem 1.
Proof. From the derivation of equation (IVA), it is concluded that if is the solution of the linear PDE (1), then also satisfies equation (IVA). To complete the proof, we have to show that is the unique solution of equation (IVA). The proof is by contradiction.
Before starting the contradiction proof, we derive a simple fact. Consider
(15) 
From (IVA), we have
(16) 
By using the fact (IVA), the equation (IVA) is rewritten as
(17) 
Suppose that is another solution of equation (IVA) with boundary condition . Thus, also satisfies equation (IVA), i.e.,
(18) 
Substituting equation (IVA) from (IVA) yields,
(19) 
This means that equation (IVA) holds for . If letting , we have
(20) 
Then, for , where is a real constant, and . Thus,, i.e., for . This completes the proof.
Remark 3.
It follows from Theorem 1 that the solution of equation (IVA) is equivalent to equation (1), and thus the convergence of the offpolicy RL is guaranteed, i.e., the solution of the iterative equation (IVA) will converge to the solution of HJI equation (1) as iteration step increases. Different from the equation (1) in Algorithm 1, the offpolicy RL with equation (IVA) uses system data instead of the internal system dynamic . Hence, the offpolicy RL can be regarded as a direct learning method for control design, which avoids the identification of . In fact, the information of is embedded in the measurement of system data. That is to say, the lack of knowledge about does not have any impact on the offpolicy RL to obtain the solution of HJI equation (1) and the control policy. It worth pointing out that the equation (IVA) is similar with the form of the IRL [8, 9], which is an important framework for control design of continuoustime systems. The IRL in [8, 9] is an online optimal control learning algorithm for partially unknown systems.
IvB Implementation based on neural network
To solve equation (IVA) for the unknown function based on system data, we develop a NN based actorcritic structure. From the well known highorder Weierstrass approximation theorem [51], a continuous function can be represented by an infinitedimensional linearly independent basis function set. For real practical application, it is usually required to approximate the function in a compact set with a finitedimensional function set. We consider the critic NN for approximating the cost function on a compact set . Let
be the vector of linearly independent activation functions for critic NN, where
is the number of critic NN hidden layer neurons. Then, the output of critic NN is given by
(21) 
for , where is the critic NN weight vector. It follows from (9), (10) and (21) that the disturbance and control policies are given by:
(22)  
(23) 
for , and is the Jacobian of . Expressions (22) and (23) can be viewed as actor NNs for the disturbance and control policies respectively, where and are the activation function vectors and is the actor NN weight vector.
Due to estimation errors of the critic and actor NNs (21)(23), the replacement of and in the iterative equation (IVA) with and respectively, yields the following residual error:
(24) 
For notation simplicity, define
then, expression (IVB) is rewritten as
(25) 
For description convenience, expression (IVB) is represented as a compact form
(26) 
where
For description simplicity, denote . Based on the method of weighted residuals [52], the unknown critic NN weight vector can be computed in such a way that the residual error (for ) of (26) is forced to be zero in some average sense. Thus, projecting the residual error onto and setting the result to zero on domain using the inner product, , i.e.,
(27) 
Then, the substitution of (26) into (27) yields,
where the notations and are given by
and
.
Thus, can be obtained with
(28) 
The computation of inner products and involve many numerical integrals on domain , which are computationally expensive. Thus, the MonteCarlo integration method [53] is introduced, which is especially competitive on multidimensional domain. We now illustrate the MonteCarlo integration for computing . Let , and be the set that sampled on domain , where is size of sample set . Then, is approximately computed with
Comments
There are no comments yet.