1 Introduction
Reinforcement learning (RL) has achieved impressive success in a variety of domains: the games of Go [30] and Atari [19, 20]. These successes have led to an interest in solving realworld problems such as robotics [34, 9], automation driving [24, 1], and unmanned aerial vehicle [14, 15]
. While providing high performance, RL algorithms suffer from a lack of safety certificates required for safetycritical systems. The difficulty of the safety certificates is seen in, for example, the stability guarantees of the system controlled by a neural network. The neural network generally has nonlinear activation functions. As the number/size of the hidden layers is increased, the feedback system tends to have high nonlinearities, resulting in the fails of the stability tests for the neural network controller. In order to tackle the works on ensuring the stability guarantees, understanding of stability and RL algorithms are required, which suggests that the stabilitycertified RL is an essential topic in both fields of control theory and RL researches. The Lyapunov theorem is fundamental for stability analysis of dynamical systems. Analysis based on the Lyapunov theorem needs to construct a Lyapunov function to ensure the stability
[13]. The Lyapunov stability methods have been widely used in the field of control engineering [13] and recently well investigated for RL algorithms to ensure the stability [8]. As another method, the inputoutput stability is also used to achieve the same purpose [10]. Unlike these approaches, this article revisits the spectral normalization [18] and shall propose two types of methods from different perspectives. The first one is to ensure the global stability of the feedback system based on the smallgain theorem. The stability condition is derived with the spectral normalization. However, this condition is highly conservative and may result in insufficient control performance. To overcome this difficulty, the second one is proposed, which improves the performance while ensuring the local stability with a larger region of attraction (ROA). In the second method, the stability is ensured by solving linear matrix inequalities (LMIs) after training the neural network controller. The spectral normalization improves the feasibility of the aposteriori stability test by constructing tighter local sectors. The contribution of this article is summarized as follows:
In the development of the first method, a relationship between the spectral normalization and the global stability of the feedback system is described from a control point of view,

Regarding the local stability, the second method improves the feasibility of the aposteriori stability test and enlarges the ROA,

Numerical experiments show that the second method provides enough stability and gives almost the same performance level compared with the existing RL algorithms.
The layout of this article is as follows. Section 2 describes the related works on this research. Section 3 describes the problem formulation and the stability analysis of the feedback system. Section 4 describes the stabilitycertified RL based on the spectral normalization. Section 5 gives the results of numerical experiments, and Section 6 ends the article with the conclusions. The notation is fairly standard. The symbol denotes the sets of by symmetric positive definite matrices, and () means that is positive (negative) definite. If , then the inequality is elementwise, i.e. for . The symbol denotes the 2norm of the signal , and denotes the total energy of a series of a discretetime signal , i.e., .
2 Related Works
This article is closely related to the works on RL with stability guarantees. The key points of the related works are summarized as follows. The first is the robust control framework to ensure the nominal/robust stability of the systems under the presence of nonlinearities and uncertainties. In Ref. [16]
, integral quadratic constraints (IQCs) are used to deal with the nonlinear activation function and the variation of the weight matrices in the neural network. The IQC framework has successfully been combined with the RL researches such as recurrent neural network
[2], semidefinite feasibility problems with bounded gradient [10], and local quadratic constraints for obtaining the ROA [32]. As other examples using the robust control framework, the neural network certification algorithm [31] and the learning control methods [21, 7] have been proposed to obtain robustness against adversarial conditions in environments, see also [17, 4, 33] for the interesting works. The second key point is the technique to jointly learn the policy and the Lyapunov function. In Refs. [3, 25], a candidate of the Lyapunov function is represented by the neural network, and then, the policy and the Lyapunov function are obtained through the training, which provides a larger ROA than that of the classical linear quadratic regulator (LQR). In Ref. [11], the barrier function is also learned to ensure that the policy must not run into unsafe regions. These methods can be applied to a large class of nonlinear systems and have the potential for further improvement on the related works. This article differs from the above works in the aspect that the stability constraint is given by the spectral normalization. Specifically, this article first proposes the method for ensuring the global stability of the feedback system. In the first method, the gain of the feedback system is bounded less than 1, that is, the spectral normalization is performed to satisfy the smallgain theorem. However, the first one may provide insufficient performance due to its conservative stability condition. To improve the practicality, the second method is proposed, which provides enough performance while ensuring the local stability with a larger ROA. To conclude this section, this article presents a tradeoff between stability and performance of the spectral normalization on the neural network controller through theoretical groundwork and numerical experiments.3 Preliminaries
3.1 Problem Formulation
Let us consider the feedback system consisting of a plant and a controller as shown in Fig. 1. In this article, we assume that the plant is a discretetime linear timeinvariant (LTI) model represented as follows:
(1) 
where , , , and are the state, the input, the state matrix, and the input matrix, respectively. The controller is an layer neural network with nonlinear activation functions in hidden layers:
(2a)  
(2b)  
(2c) 
where is the output from the layer. Symbols and are respectively the weight matrix and the bias of the layer in the neural network. Symbol
is the activation function which is applied elementwise for a given vector
:(3) 
where is the scalar activation function, e.g.,
, ReLU, and sigmoid. In RL, the neural network defined by Eq. (
2) is trained by some algorithms, e.g., trust region policy optimization (TRPO) [27], proximal policy optimization (PPO) [28], and soft actorcritic (SAC) [6]. In this article, the controller is also referred to as the policy in accordance with the field of RL researches.3.2 Stability of Feedback Systems
3.2.1 SmallGain Theorem
The following definition of the system stability is introduced.
Definition 1.
Stability
Consider a system whose inputoutput relation is represented by a mapping .
The gain of the system is the worstcase ratio between the total output and the total input energy as follows:
(4) 
If is finite, then the system is said to be finitegain stable.
Let us consider the feedback connection as shown in Fig. 2. The smallgain theorem gives a sufficient condition for the feedback connection.
Theorem 1.
SmallGain Theorem
Suppose that the systems and respectively have the finite gains and .
Suppose further that the feedback system is well defined in the sense that for every pair of inputs and , there exist unique signal pairs of and as shown in Fig. 2.
Then, the feedback connection is finitegain stable if .
Proof.
See [13]. ∎
3.2.2 Lyapunov Condition
In this subsection, the stability analysis for the neural network controller proposed in [32] is briefly described. By introducing local quadratic constraints (QCs) for activation functions, the stability analysis in [32]
provides stability certificates and estimates the region of attraction (when global stability tests are infeasible). In accordance with
[32], the followings are introduced to obtain the Lyapunov condition for the neural network controller. The state is an equilibrium point of the feedback system with input if the following conditions hold:(5a)  
(5b) 
Suppose that satisfies Eq. (5). Then, the equilibrium point can be propagated throu the neural network to obtain equilibrium values , for the inputs/outputs of each activation function , and thus, is an equilibrium point of Eqs. (1) and (2) if:
(6a)  
(6b)  
(6c) 
where is the combined nonlinearity by stacking the activation functions. The matrix is defined by
(7) 
Finally, the ellipsoied is defined for , :
(8) 
With the above definitions, the following property and the theorem is obtained by Ref. [32].
Property 1.
(Local Sector [32]) Let be an equilibrium value of the activation input and be the corresponding value at the first layer. Select with . Let denote the activation input generated by Eq. (2b) from the input at the first layer. There exist such that satisfies the offset local sector around the point for all with .
Theorem 2.
(Lyapunov Condition [32]) Consider the feedback system of plant in (1) and the neural network controller in (2) with equilibrium point satisfying (6). Let , , and be given vectors satisfying Property 1 for the neural network. Denote the row of the first weight by and define matrices
(9) 
Define further matrices
(10) 
where with . If there exists a matrix and vector with such that
(11)  
(12) 
then: (i) the feedback system consisting of and is locally stable around , and (ii) the set , defined by Eq. (8) is an innerapproximation to the ROA.
3.3 Spectral Normalization
The spectral normalization [18]
has been proposed for the generative adversarial networks (GANs)
[5]. The training of GANs are stabilized by regularizing the Lipschitz norm of the network layers. The spectral norm of the matrix is defined by(13) 
which is equivalent to the largest singular value of
. The spectral normalization is defined by(14) 
4 Method
In this section, the relationship between the spectral normalization and the stability of the feedback system is described. The original spectral normalization [18] simply normalizes the weight matrix such that , which means that the spectral norm of the normalized matrix is given by . In this article, however, the following weight normalization is introduced for the flexibility of the policy design:
(15) 
where and are respectively the normalized weight matrix and the positive constant. By introducing the parameter , it allows us to arbitrarily normalize the spectral norm of the weight matrix with . The key point is to appropriately select the parameter for ensuring the stability of the feedback system. In Section 4.1, the method for ensuring the global stability of the feedback system is described. In Section 4.2, the more practical method is proposed for improving control performance while ensuring the local stability of the feedback system. In the following, the bias term of each layer is omitted for simplicity (i.e. the equilibrium points are and ). In accordance with [32], the nonlinear operation of the activation function is isolated from the linear operation. Define as the input to the activation function as follows:
(16) 
The neural network controller is thus represented as follows:
(17a)  
(17b)  
(17c)  
(17d) 
4.1 PreGuaranteed RL
The method described in this section is a standard approach: ensuring the gain of the feedback system is less than 1 by normalizing all weight matrices in the neural network. This method is called preguaranteed RL in this article. Although the preguaranteed RL is essentially the same as the original spectral normalization [18] except for introducing the parameter as defined in Eq. (15), the importance of this introduction is explained as follows. Let us consider the linear operation in the hidden layers (i.e. Eq. (17b)). The spectral normalization defined by Eq. (15) is performed for each weight matrix of the hidden layers:
(18) 
where and are respectively the normalized weight matrix and the positive constant for the layer. This means that with the tuning parameter , the spectral norm of the weight matrix is given by . Subsequently, let us define the diagonal matrix , whose diagonal elements are the gain of the activation function for a given vector , and define the matrix corresponding to the operation of Eqs. (17a)(17c), i.e., . Note that the spectral norm of the diagonal matrix is ^{2}^{2}2This property is satisfied for the activation functions which are commonly used in the neural networks, e.g. tanh, ReLU, and sigmoid. See Ref. [16] for the details in the case of tanh, which helps one understand this article., and the following inequality is obtained:
(19) 
Next, the spectral normalization is performed for the linear operation of Eq. (17d), i.e.
(20) 
The neural network controller represented in Eq. (17) is then reformulated by
;  (21a)  
(21b)  
;  (21c)  
,  (21d) 
and the spectral norm of the matrix , which corresponds to the mapping , is given by
(22) 
This means that the output of the neural network controller (and also the input/output of the activation functions in the hidden layers) is bounded as follows:
(23) 
To investigate the stability of the feedback system, it is sufficient to have the inputoutput relation of the neural network controller, i.e.
(24) 
where
(25) 
Finally, consider the gain of the neural network controller:
(26)  
This means that the mapping has a finite gain and it is less than or equal to . The upper bound of the gain of the neural network controller can be calculated by Eq.(25). Thus, the stability condition of the feedback system as shown in Fig. 1 is given as follows.
Theorem 3.
(Stability Condition) Consider the feedback system as shown in Fig. 1. Suppose that the plant has a finite gain and it is less than or equal to . Suppose further that each weight matrix of the neural network controller is normalized as in Eq. (15). Assuming the feedback connection is well defined, the feedback system is finitegain stable by the smallgain theorem if
(27) 
where is defined by Eq. (25).
Proof.
Let be the gain of the neural notwork controller. From Eq. (26), we have the upper bound with . The gain of the feedback system can be bounded by . Therefore, , and then, the feedback system is finitegain stable by smallgain theorem. ∎
To summarize the preguaranteed RL, the spectral normalization as in Eq. (15) is proposed and the stability condition of the feedback system is derived based on this spectral normalization. The stability is ensured by normalizing the weight matrices of all layers under the condition of . This method can be applied to a lot of existing RL algorithms since it does not require any change of the basic framework of those algorithms. However, there exist the following limitations:

only applicable for a plant which has a finite gain,

strict limitation for the neural networks due to Eq. (27).
Regarding (a), there exist cases that the preguaranteed RL cannot be used for traditional RL problems, e.g., the linearized model of the inverted pendulum at the equilibrium point , where and are respectively the angle and the angular velocity, does not have a finite gain. Regarding (b), the designed policy may not satisfy the performance requirement due to the constraint of the stability condition. As the gain of the plant become larger, the smaller gain of the policy is required to achieve the stability condition, resulting in an insufficient performance of the designed policy. To overcome these difficulties, in the next section, the method shall be proposed.
4.2 PostGuaranteed RL
Although the preguaranteed RL explicitly includes the stability condition in the training of the policy by using the spectral normalization, the stability condition in Theorem 3 may impose severe constraints for the policy. Moreover, it is easily confirmed that the preguaranteed RL cannot be used for the plant which has a large gain since the gain of the neural network controller must be set to nearly 0. If one intends to guarantee a global stability, it often leads to the poor performance of the feedback system. In contrast, if one pays attention to a local stability, it may improve the control performance of the feedback system. In order to investigate the local stability, let us consider the region of attraction (ROA) of the feedback system defined by
(28) 
where is the solution to the feedback system at time from the initial condition . The ROA is the set such that all the initial conditions converge to the equilibrium point as [13]. The goal is to obtain the policy which has a larger ROA than a design requirement. In this article, the method to ensure the stability in the sense of the ROA is called postguaranteed RL as explained below. In order to avoid the limitations due to the stability condition of Eq. (27), in the postguaranteed RL, the spectral normalization shall be applied only for each hidden layer but for the output layer. The neural network controller in the postguaranteed RL is then represented as follows:
;  (29a)  
;  (29b)  
;  (29c)  
.  (29d) 
Note that the stability of the feedback system is not ensured by normalizing only each hidden layer. However, this is a compromise such that the policy can be obtained with improved performance (and the optimization algorithm can be applied even for the system which does not have a finite gain). After obtaining the policy, an aposteriori analysis shall be performed to confirm the local stability with its ROA (i.e., the stability of the feedback system is aposteriorly ensured). Specifically, the postguaranteed RL may improve the feasibility of the aposteriori stability tests and enlarge the ROA.
In accordance with [32], let be the equilibrium point at the first layer. Assume the bounds and with and . Note that as explained in [32], the assumed bounds can be used to obtain the local sector bounds for all the nonlinear activation functions () in the neural network. In this article, the bias in the neural network is set to zero, and thus, and . Therefore, if the stability condition formulated by the LMIs in Theorem 2 is feasible, the ROA exists in the following region:
(30) 
For understanding the relationship between the set of the ellipsoid and the spectral normalization defined by Eq. (15), let us consider a simple example of the plant with the state . In this example, it is assumed that with , where is the positive scalar value and is the allones vector. Figure 3 shows an example of the regions given by Eq. (30), which means that the set of the ellipsoid exists in the region of . In other words, the region of that the ROA exists is determined by two parameters and .
In the stability analysis, we need to set to a smaller value (i.e. assume tighter local sectors) to make the LMIs in Theorem 2 feasible. This means that the region of cannot be enlarged with due to the feasibility constraint. In the postguaranteed RL, the effectiveness of introducing the parameters , , is explained as follows. First, setting to a small value enlarges the region of under a small value of , which provides a potential to obtain a larger ROA. Second, from Eq. (19), the input to the nonlinear activation function is limited. Thus, the assumed sectors for all nonlinear functions can be smaller, resulting in the improvement of the feasibility in the stability analysis formulated by the LMIs^{†}^{†}†The sectors after the second layer can be obtained with a forward propagation from the one of the first layer, see [32] for the details. If the weights of the hidden layers are unbounded, the calculated sectors may tend to be large, resulting in the infeasibility of the stability condition formulated by the LMIs in Theorem 2.. Note that the feasibility analysis of LMIs has been well investigated in the field of control engineering, see Ref. [29] for the details of the relationship between the region size and the feasibility of LMIs in conjunction with preguaranteed and postguaranteed approach. Figure 4 shows three types of regions of sector bounded nonlinearity, in which is selected as the nonlinear activation function. The red solid line shows the global sector defined by with and , the green dashed line the local sector with and used for the stability analysis, and the black dashdot line the local sector with and bounded by the spectral normalization. From Fig. 4, it can be confirmed that the spectral normalization enforces the smaller sector bound. On the other hand, too smaller values of , , strictly limit the nonlinearities of the activation functions, which means that too smaller values of provide the poor performance of the policy. To summarize the postguaranteed RL, we can make sectors arbitrary small for making the LMIs feasible while enlarging the ROA by properly choosing parameters and , . In other words, the potential size of the ROA can be changed (for satisfying a design requirement) with tradeoff between stability and performance for the neural network controller.
5 Experiments
Two numerical experiments are performed for discretetime LTI systems: one is the inverted pendulum and the other is the longitudinal motion of aircraft. The details of the environments used in this article are provided in Appendix A
. The policy is modeled by a fullyconnected multilayer perceptron with
as the activation function, which is trained through the policy optimization algorithm using PPO [28]. In this article, PPO is selected as the baseline RL algorithm. For investigating the effectiveness of the spectral normalization, the preguaranteed RL and the postguaranteed RL are compared with the baseline PPO (but the preguaranteed RL is tested only for the stable system, i.e. the aircraft control task).5.1 Inverted Pendulum
The inverted pendulum is the traditional benchmark problem for RL algorithms. The state is , where and are respectively the angle (rad) and the angular velocity (rad/s). The input is , where is the torque (). For the experiment, the nonlinear model is linearized around the equilibrium point in accordance with [23]. Note that the linearized model around the equilibrium point is unstable and does not have a finite gain, which means that the preguaranteed RL cannot be used for this problem. The policy network has two hidden layers of 64 units.
(a) Learning curves. (b) Trajectories. 
(c) Calculated lower bounds. (d) ROA. 
Figure 5 shows the result of the inverted pendulum task, in which the blue line/marker shows the baseline PPO and the red line/marker shows the PPO via the spectral normalization (SN) using the postguaranteed RL. Figure 5(a) shows the learning curves, where the solid line corresponds to the average and the shaded region to the minimum/maximum returns of evaluation rollouts without exploration noise over the three random seeds. Figure 5(b) shows the trajectories obtained by the policy without exploration noise, where the initial states of the system are set to . Figure 5(c) shows the calculated lower bounds for the aposteriori stability analysis, and Fig. 5(d) shows the obtained ROA. The results shown in Figs. 5 (b) to (d) are the case of the first random seed in the experiments. From Figs. 5(a) and (b), PPOSN (Post) performs comparably to the baseline PPO and achieves the control task with more smooth trajectories. Remarkable difference between PPO and PPOSN (Post) is seen in the results of the aposteriori stability analysis, see Figs. 5(c) and (d). In the case of PPO, the small sector bounds (i.e. is set to nearly ) for the first layer are assumed in order to make the LMIs feasible, which corresponds to shrink the size of the ROA. On the other hand, in the case of PPOSN (Post), the larger sector bounds than those of PPO can be assumed while making the LMIs feasible since the spectral norm of the weight matrices in the first/second layers are bounded (see also Table 1). Thus, by the spectral normalization, the larger ROA can be obtained in the aposteriori stability analysis.
(a) Learning curves. (b) Trajectories. 
(c) Calculated lower bounds. (d) ROA. 
Figure 6 shows the result of the comparison by the size of the spectral norm (i.e. the parameters , ) in the case of PPOSN (Post). The blue line/marker shows the case of , the red , and the green . Although more sample efficiency is seen by increasing the scale size of the spectral norm (Fig. 6(a)), the trajectories obtained after the training on the total step are almost similar (Fig. 6(b)). From Figs. 6(c) and (d), the larger ROA can be obtained by setting , , to the smaller value. Note that the sector sizes for the first layer are set to the same in the stability analysis of each case. From these results, it is confirmed that the tradeoff between performance and stability due to the spectral normalization. Table 1 shows the spectral norm of the weight matrices obtained on the inverted pendulum task. Regarding the weight matrices of the first/second layers, the norm size of PPO becomes larger than that of PPOSN (Post). The difference in the norm size leads to the difference in the calculated lower bounds of the second layer in the aposteriori stability analysis as explained above. From these results, the spectral normalization improves the feasibility of the aposteriori analysis and enlarges the ROA.
PPO  PPOSN (Post)  

  
3.423  0.5000  1.000  1.500  
4.201  0.5000  1.000  1.500  
1.171  6.836  1.882  1.123 
(a) Learning curves. (b) Trajectories. 
(c) Calculated lower bounds. (d) ROA. 
5.2 Aircraft
The aircraft model used in this article is the generic transport model (GTM) developed by NASA [12], whose nonlinear simulation model is available in [22]. The linearized aircraft model is given by linearizing the nonlinear model at a trim point in accordance with [26]. For this experiments, the model is discretized via a zeroorder hold at a sampling period of s. The gain (i.e. norm) of the aircraft model is . Regarding the preguaranteed RL, the parameters on the spectral normalization are set to , , to satisfy the stability condition of the feedback system. The policy network has two hidden layers of 32 units. Figure 7 shows the result of the aircraft control task. The plot layout is the same as in Fig. 5 except for the plots (a) and (b) which include the result of the preguaranteed RL. Regarding the learning curve, the preguaranteed RL shows the insufficient performance due to the stability condition derived from the smallgain theorem. On the other hand, the postguaranteed RL shows almost similar performance compared with the baseline PPO. Regarding the time history of the state as well, the postguaranteed RL shows enough performance. Moreover, the ROA of the postguaranteed RL is larger than that of the baseline PPO. These results show that the effectiveness of the spectral normalization for the stabilitycertified RL.
PPO  PPOSN (Pre)  PPOSN (Post)  

  
9.598  0.3100  1.000  
9.360  0.3100  1.000  
2.411  0.3100  24.00 
6 Conclusion
In this article, to achieve a stabilitycertified RL, we have revisited the spectral normalization and proposed two types of methods from different perspectives. The first one is the preguaranteed RL to ensure the stability condition derived from the smallgain theorem. While explicitly including the stability condition in the training of the policy, the preguaranteed RL may provide insufficient performance due to the strict stability condition. In order to improve the performance while ensuring the stability, the postguaranteed RL has been proposed, which much improves the feasibility of the aposteriori stability analysis formulated by LMIs in many cases. The numerical experiments show that the postguaranteed RL achieves almost similar performance compared with the baseline PPO while providing enough stability with a larger ROA.
Acknowledgements
The authors would like to thank Prof. Takashi Shimomura at Osaka Prefecture University for providing advice based on his expert knowledge of control theory.
References
 [1] (2020) Robust reinforcement learningbased autonomous driving agent for simulation and real world. arXiv preprint arXiv:2009.11212. Cited by: §1.
 [2] (2007) Robust reinforcement learning control using integral quadratic constraints for recurrent neural networks. IEEE Transactions on Neural Networks 18 (4), pp. 993–1002. Cited by: §2.
 [3] (2019) Neural lyapunov control. In Advances in Neural Information Processing Systems, pp. 3245–3254. Cited by: §2.
 [4] (2020) Enforcing robust control guarantees within neural network policies. arXiv preprint arXiv:2011.08105. Cited by: §2.
 [5] (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.3.
 [6] (2018) Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §3.1.
 [7] (2019) Modelfree reinforcement learning with robust stability guarantee. arXiv preprint arXiv:1911.02875. Cited by: §2.
 [8] (2020) Actorcritic reinforcement learning for control with stability guarantee. arXiv preprint arXiv:2004.14288. Cited by: §1.
 [9] (2020) Offline learning of counterfactual perception as prediction for realworld robotic reinforcement learning. arXiv preprint arXiv:2011.05857. Cited by: §1.
 [10] (2018) Stabilitycertified reinforcement learning: a controltheoretic perspective. arXiv preprint arXiv:1810.11505. Cited by: §1, §2.
 [11] (2020) Neural certificates for safe control policies. arXiv preprint arXiv:2006.08465. Cited by: §2.
 [12] (2004) Development of a dynamically scaled generic transport model testbed for flight research experiments. Cited by: Appendix A, §5.2.
 [13] (2002) Nonlinear systems. Vol. 3, Prentice hall Upper Saddle River, NJ. Cited by: §1, §3.2.1, §4.2.
 [14] (2019) Neuroflight: next generation flight control firmware. arXiv preprint arXiv:1901.06553. Cited by: §1.
 [15] (2019) Reinforcement learning for uav attitude control. ACM Transactions on CyberPhysical Systems 3 (2), pp. 1–21. Cited by: §1.
 [16] (2000) A synthesis of reinforcement learning and robust control theory. Colorado State University Fort Collins, CO. Cited by: §2, footnote 2.
 [17] (2014) Offpolicy reinforcement learning for control design. IEEE transactions on cybernetics 45 (1), pp. 65–76. Cited by: §2.
 [18] (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1, §3.3, §4.1, §4.
 [19] (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
 [20] (2015) Humanlevel control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
 [21] (2001) Robust reinforcement learning. In Advances in neural information processing systems, pp. 1061–1067. Cited by: §2.
 [22] Flight Dynamics Simulation of a Generic Transport Model. Note: https://software.nasa.gov/software/LAR176251 Cited by: Appendix A, §5.2.
 [23] (2019) Control approach combining reinforcement learning and modelbased control. In 2019 12th Asian Control Conference (ASCC), pp. 1419–1424. Cited by: Appendix A, §5.1.
 [24] (2020) Simulationbased reinforcement learning for realworld autonomous driving. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6411–6418. Cited by: §1.
 [25] (2018) The lyapunov neural network: adaptive stability certification for safe learning of dynamical systems. arXiv preprint arXiv:1808.00924. Cited by: §2.
 [26] (2019) On the worst disturbance of airplane longitudinal motion using the generic transport model. TISCI 32 (8), pp. 309–317. Cited by: Appendix A, §5.2.

[27]
(2015)
Trust region policy optimization.
In
International conference on machine learning
, pp. 1889–1897. Cited by: §3.1.  [28] (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.1, §5.
 [29] (2005) Gainscheduled control under common lyapunov functions: conservatism revisited. In Proceedings of the 2005, American Control Conference, 2005., pp. 870–875. Cited by: §4.2.
 [30] (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
 [31] (2019) Verification of neural network control policy under persistent adversarial perturbation. arXiv preprint arXiv:1908.06353. Cited by: §2.
 [32] (2020) Stability analysis using quadratic constraints for systems with neural network controllers. arXiv preprint arXiv:2006.07579. Cited by: §2, §3.2.2, §4.2, §4, Property 1, Theorem 2, footnote †.
 [33] (2020) Policy optimization for linear control with robustness guarantee: implicit regularization and global convergence. In Learning for Dynamics and Control, pp. 179–190. Cited by: §2.
 [34] (2020) The ingredients of realworld robotic reinforcement learning. arXiv preprint arXiv:2004.12570. Cited by: §1.
Appendix A Environment Details
Inverted Pendulum
The linearized equation of motion for the inverted pendulum is given as follows:
(31) 
where , , , , and are the length of the pendulum (m), the mass (kg), the gravitation constant (), the friction coefficient (), and the sampling period (s), respectively. The parameters are set to the same as [23] and summarized in Table 3. The state and the input are given by
(32) 
where , , and are the angle (rad), the angular velocity (rad/s), and the torque (), respectively. The system does not have a finite gain. At each step, the reward is given by
(33) 
where and . The current episode is terminated if rad. The max episode step is 200.
Symbol  Definition  Value 

length of pendulum (m)  0.5  
Mass (kg)  0.15  
gravitation constant ()  9.8  
Friction coefficient ()  0.05  
sampling period (s)  0.1 
Aircraft
The aircraft model is the generic transport model (GTM) developed by NASA, which is a dynamically scaled 5.5% freeflying model of a twinjet commercial transport aircraft [12]. The nonlinear simulation model is available in [22]. The linearized model of GTM is given by linearizing the nonlinear model at a trim point. In this article, the trim condition and the continuoustime model are taken from [26]. For the experiment of this article, the model is discretized via a zeroorder hold at a sampling period of seconds. Thus, the discretetime model for the aircraft longitudinal dynamics is given as follows.
(34) 
The state and the input are given as follows.
(35) 
where , , , , and are the velocity perturbation in  and directions (m/s), the pitch rate (rad/s), the pitch angle (rad), and the elevator deflection (rad), respectively. The gain (i.e. norm) is given by . At each step, the reward is given by
(36) 
where and . The max episode step is 200. If rad, m/s, or m/s, the current episode is terminated.
Comments
There are no comments yet.