, Sutton and Barto suggested a definition of RL method, i.e., any method that is well suited for solving RL problem can be regarded as a RL method, where the RL problem is defined in terms of optimal control of Markov decision processes. This obviously established the relationship between the RL method and control community. Moreover, RL methods have the ability to find an optimal control policy from unknown environment, which makes RL a promising method for control design of real systems.
In the past few years, many RL approaches [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23] have been introduced for solving the optimal control problems. Especially, some extremely important results were reported by using RL for solving the optimal control problem of discrete-time systems [7, 10, 14, 17, 18, 22]. Such as, Liu and Wei suggested a finite-approximation-error based iterative adaptive dynamic programming approach , and a novel policy iteration (PI) method  for discrete-time nonlinear systems. For continuous-time systems, Murray et al.  presented two PI algorithms that avoid the necessity of knowing the internal system dynamics. Vrabie et al. [8, 9, 13] introduced the thought of PI and proposed an important framework of integral reinforcement learning (IRL). Modares et al.  developed an experience replay based IRL algorithm for nonlinear partially unknown constrained-input systems. In , an online neural network (NN) based decentralized control strategy was developed for stabilizing a class of continuous-time nonlinear interconnected large-scale systems. In addition, it worth mentioning that the thought of RL methods have also been introduced to solve the optimal control problem of partial differential equation systems [6, 12, 15, 16, 23]. However, for most of practical real systems, the existence of external disturbances is usually unavoidable.
To reduce the effects of disturbance, robust controller is required for disturbance rejection. One effective solution is the control method, which achieves disturbance attenuation in the -gain setting [24, 25, 26], that is, to design a controller such that the ratio of the objective output energy to the disturbance energy is less than a prescribed level. Over the past few decades, a large number of theoretical results on nonlinear control have been reported [27, 28, 29], where the control problem can be transformed into how to solve the so-called Hamilton-Jacobi-Isaacs (HJI) equation. However, the HJI equation is a nonlinear partial differential equation (PDE), which is difficult or impossible to solve, and may not have global analytic solutions even in simple cases.
Thus, some works have been reported to solve the HJI equation approximately [27, 30, 31, 32, 33, 34, 35]. In , it was shown that there exists a sequence of policy iterations on the control input such that the HJI equation is successively approximated with a sequence of Hamilton-Jacobi-Bellman (HJB)-like equations. Then, the methods for solving HJB equation can be extended for the HJI equation. In , the HJB equation was successively approximated by a sequence of linear PDEs, which were solved with Galerkin approximation in [30, 37, 38]. In , the successive approximation method was extended to solve the discrete-time HJI equation. Similar to , a policy iteration scheme was developed in  for the constrained input system. For implementation purpose of this scheme, a neuro-dynamic programming approach was introduced in  and an online adaptive method was given in . This approach suits for the case that the saddle point exists, thus a situation that the smooth saddle point does not exist was considered in . In , a synchronous policy iteration method was developed, which is the extension of the work . To improve the efficiency for computing the solution of HJI equation, Luo and Wu  proposed a computationally efficient simultaneous policy update algorithm (SPUA). In addition, in  the solution of the HJI equation was approximated by the Taylor series expansion, and an efficient algorithm was furnished to generate the coefficients of the Taylor series. It is observed that most of these methods [27, 30, 31, 32, 33, 35, 40, 44, 45] are model-based, where the full system model is required. However, the accurate system model is usually unavailable or costly to obtain for many practical systems. Thus, some RL approaches have been proposed for control design of linear systems [46, 47] and nonlinear systems  with unknown internal system model. But these methods are on-policy learning approaches [32, 46, 41, 47, 48, 49], where the cost function should be evaluated by using system data generated with policies being evaluated. It is found that there are several drawbacks (to be discussed in Section III) to apply the on-policy learning to solve real control problem.
To overcome this problem, this paper introduces an off-policy RL method to solve the nonlinear continuous-time control problem with unknown internal system model. The rest of the paper is rearranged as follows. Sections II and III present the problem description and the motivation. The off-policy learning methods for nonlinear systems and linear systems are developed in IV and V respectively. The simulation studies are conducted in Section VI and a brief conclusion is given in Section VII.
Notations: and are the set of real numbers, the -dimensional Euclidean space and the set of all real matrices, respectively.
denotes the vector norm or matrix norm inor , respectively. The superscript is used for the transpose and denotes the identify matrix of appropriate dimension. denotes a gradient operator notation. For a symmetric matrix means that it is a positive (semi-positive) definite matrix. for some real vector and symmetric matrix with appropriate dimensions. is function space on with first derivatives are continuous. is a Banach space, for . Let and be compact sets, denote . For column vector functions and , where Ôºå define inner product and norm . is a Sobolev space that consists of functions in space such that their derivatives of order at least are also in .
Ii Problem description
Let us consider the following affine nonlinear continuous-time dynamical system:
where is the state, is the control input and , is the external disturbance and , is the objective output. is Lipschitz continuous on a set that contains the origin, . represents the internal system model which is assumed to be unknown in this paper. and are known continuous vector or matrix functions of appropriate dimensions.
The control problem under consideration is to find a state feedback control law such that the system (1) is closed-loop asymptotically stable, and has -gain less than or equal to , that is,
where and . Then, the closed-loop system with the state feedback control
Iii Motivation from investigation of related work
From Lemma 1, it is noted that the control (5) relies on the solution of the HJI equation (1). Therefore, a model-based iterative method was proposed in , where the HJI equation is successively approximated by a sequence of linear PDEs:
and then update control and disturbance policies with
Note that the key point of the iterative scheme (III)-(8) depends on the solution of the linear PDE (III). Thus, several related methods were developed, such as, Galerkin approximation , synchronous policy iteration , neuro-dynamic programming approach [31, 40] and online adaptive control method  for constrained input systems, and Galerkin approximation method for discrete-time systems . Obviously, the iteration (III)-(8) will generate two iterative loops since the control and disturbance policies are updated at the different iterative steps, i.e., the inner loop for updating disturbance policy with index , and the outer iterative loop for updating control policy with index . The outer loop will not be activated until the inner loop is convergent, which results in low efficiency. Therefore, Luo and Wu  proposed a simultaneous policy update algorithm (SPUA), where the control and disturbance policies are updated at the same iterative step, and thus only one iterative loop is required. It worth noting that the word “simultaneous” in  and the word “synchronous/simultaneous” in [32, 41] represent different meanings. The former emphasizes the same “iterative step,‚Äù while the latter emphasizes the same “time instant”. In other words, the SPUA in  updates control and disturbance policies at the “same” iterative step, while the algorithms in [32, 41] update control and disturbance policies at the “different” iterative steps.
The procedure of model-based SPUA is given in Algorithm 1.
Step 1: Give an initial function ( is determined by Lemma 5 in . Let ;
Step 2: Update the control and disturbance policies with
Step 3: Solve the following linear PDE for :
where and .
Step 4: Let , go back to Step 2 and continue.
It worth noting that Algorithm 1 is an infinite iterative procedure, which is used for theoretical analysis rather than for implementation purpose. That is to say, Algorithm 1 will converge to the solution of the HJI equation (1) when the iteration goes to infinity. By constructing a fixed point equation, the convergence of Algorithm 1 is established in  by proving that it is essentially a Newton‚Äôs iteration method for finding the fixed point. With the increase of index , the sequence obtained by the SPUA with equations (9)-(1) can converge to the solution of HJI equation (1), i.e., .
It is necessary to explain the rationale of using equations (9) and (10) for control and disturbance policies update. The control problem (1)-(3) can be viewed as a two-players zero-sum differential game problem [26, 33, 40, 47, 32, 34, 41]. The game problem is a minimax problem, where the control policy acts as the minimizing player and the disturbance policy is the maximizing player. The game problem aims at finding the saddle point , where is given by expression (5) and is given by . Correspondingly, for the control problem (1)-(3), and are the associated control policy and the worst disturbance signal [26, 31, 40, 47, 32, 34], respectively. Thus, it is reasonable using expressions (9) and (10) (that are consistent with and in form) for control and disturbance policies update. Similar control and disturbance policy update method could be found in references [27, 30, 31, 40, 34, 41].
Observe that both iterative equations (III) and (1) require the full system model. For the control problem that the internal system dynamic is unknown, data based methods [47, 48] were suggested to solve the HJI equation online. However, most of related existing online methods are on-policy learning approaches [32, 41, 47, 48, 49]. From the definition of on-policy learning , the cost function should be evaluated with the data generated from the evaluating policies. For example, in equation (III) is the cost function of the policies and , which means that should be evaluated with system data by using evaluating policies and . It is observed that these on-policy learning approaches for solving the control problem have several drawbacks:
1) For real implementation of on-policy learning methods [32, 41, 48, 49], the approximate evaluating control and disturbance policies (rather than the actual policies) are used to generate data for learning their cost function. In other words, the on-policy learning methods using the “inaccurate” data to learn their cost function, which will increase the accumulated error. For example, to learn the cost function in equation (III), some approximate policies and (rather than its actual policies and
, which are usually unknown because of estimate error) are employed to generate data;
2) The evaluating control and disturbance policies are required to generate data for on-policy learning, thus disturbance signal should be adjustable, which is usually impractical for most of real systems;
3) It is known [2, 50] that the issue of “exploration” is extremely important in RL for learning the optimal control policy, and the lack of exploration during the learning process may lead to divergency. Nevertheless, for on-policy learning, exploration is restricted because only the evaluating policies can be used to generate data. From the literature investigation, it is found that the “exploration” issue is rarely discussed in existing work that using RL techniques for control design;
5) Most of existing approaches [32, 41, 47, 48, 49] are implemented online, thus they are difficult for real-time control because the learning process is often time-consuming. Furthermore, online control design approaches just use current data while discard past data, which implies that the measured system data is used only once and thus results in low utilization efficiency.
To overcome the drawbacks mentioned above, we propose an off-policy RL approach to solve the control problem with unknown internal system dynamic .
Iv Off-policy reinforcement learning for control
In this section, an off-policy RL method for control design is derived and its convergence is proved. Then, a NN-based critic-actor structure is developed for implementation purpose.
Iv-a Off-policy reinforcement learning
To derive the off-policy RL method, we rewrite the system (1) as:
It is observed from the equation (IV-A) that the cost function can be learned by using arbitrary input signals and , rather than the evaluating policies and . Then, replacing linear PDE (1) in Algorithm 1 with (IV-A) results in the off-policy RL method. To show its convergence, Theorem 1 establishes the equivalence between iterative equations (1) and (IV-A).
Proof. From the derivation of equation (IV-A), it is concluded that if is the solution of the linear PDE (1), then also satisfies equation (IV-A). To complete the proof, we have to show that is the unique solution of equation (IV-A). The proof is by contradiction.
Before starting the contradiction proof, we derive a simple fact. Consider
From (IV-A), we have
This means that equation (IV-A) holds for . If letting , we have
Then, for , where is a real constant, and . Thus,, i.e., for . This completes the proof.
It follows from Theorem 1 that the solution of equation (IV-A) is equivalent to equation (1), and thus the convergence of the off-policy RL is guaranteed, i.e., the solution of the iterative equation (IV-A) will converge to the solution of HJI equation (1) as iteration step increases. Different from the equation (1) in Algorithm 1, the off-policy RL with equation (IV-A) uses system data instead of the internal system dynamic . Hence, the off-policy RL can be regarded as a direct learning method for control design, which avoids the identification of . In fact, the information of is embedded in the measurement of system data. That is to say, the lack of knowledge about does not have any impact on the off-policy RL to obtain the solution of HJI equation (1) and the control policy. It worth pointing out that the equation (IV-A) is similar with the form of the IRL [8, 9], which is an important framework for control design of continuous-time systems. The IRL in [8, 9] is an online optimal control learning algorithm for partially unknown systems.
Iv-B Implementation based on neural network
To solve equation (IV-A) for the unknown function based on system data, we develop a NN based actor-critic structure. From the well known high-order Weierstrass approximation theorem , a continuous function can be represented by an infinite-dimensional linearly independent basis function set. For real practical application, it is usually required to approximate the function in a compact set with a finite-dimensional function set. We consider the critic NN for approximating the cost function on a compact set . Let
be the vector of linearly independent activation functions for critic NN, where
is the number of critic NN hidden layer neurons. Then, the output of critic NN is given by
for , and is the Jacobian of . Expressions (22) and (23) can be viewed as actor NNs for the disturbance and control policies respectively, where and are the activation function vectors and is the actor NN weight vector.
For notation simplicity, define
then, expression (IV-B) is rewritten as
For description convenience, expression (IV-B) is represented as a compact form
For description simplicity, denote . Based on the method of weighted residuals , the unknown critic NN weight vector can be computed in such a way that the residual error (for ) of (26) is forced to be zero in some average sense. Thus, projecting the residual error onto and setting the result to zero on domain using the inner product, , i.e.,
where the notations and are given by
Thus, can be obtained with
The computation of inner products and involve many numerical integrals on domain , which are computationally expensive. Thus, the Monte-Carlo integration method  is introduced, which is especially competitive on multi-dimensional domain. We now illustrate the Monte-Carlo integration for computing . Let , and be the set that sampled on domain , where is size of sample set . Then, is approximately computed with