Stability-Certified Reinforcement Learning via Spectral Normalization

12/26/2020 ∙ by Ryoichi Takase, et al. ∙ The University of Tokyo 0

In this article, two types of methods from different perspectives based on spectral normalization are described for ensuring the stability of the system controlled by a neural network. The first one is that the L2 gain of the feedback system is bounded less than 1 to satisfy the stability condition derived from the small-gain theorem. While explicitly including the stability condition, the first method may provide an insufficient performance on the neural network controller due to its strict stability condition. To overcome this difficulty, the second one is proposed, which improves the performance while ensuring the local stability with a larger region of attraction. In the second method, the stability is ensured by solving linear matrix inequalities after training the neural network controller. The spectral normalization proposed in this article improves the feasibility of the a-posteriori stability test by constructing tighter local sectors. The numerical experiments show that the second method provides enough performance compared with the first one while ensuring enough stability compared with the existing reinforcement learning algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has achieved impressive success in a variety of domains: the games of Go [30] and Atari [19, 20]. These successes have led to an interest in solving real-world problems such as robotics [34, 9], automation driving [24, 1], and unmanned aerial vehicle [14, 15]

. While providing high performance, RL algorithms suffer from a lack of safety certificates required for safety-critical systems. The difficulty of the safety certificates is seen in, for example, the stability guarantees of the system controlled by a neural network. The neural network generally has nonlinear activation functions. As the number/size of the hidden layers is increased, the feedback system tends to have high nonlinearities, resulting in the fails of the stability tests for the neural network controller. In order to tackle the works on ensuring the stability guarantees, understanding of stability and RL algorithms are required, which suggests that the stability-certified RL is an essential topic in both fields of control theory and RL researches. The Lyapunov theorem is fundamental for stability analysis of dynamical systems. Analysis based on the Lyapunov theorem needs to construct a Lyapunov function to ensure the stability 

[13]. The Lyapunov stability methods have been widely used in the field of control engineering [13] and recently well investigated for RL algorithms to ensure the stability [8]. As another method, the input-output stability is also used to achieve the same purpose [10]. Unlike these approaches, this article revisits the spectral normalization [18] and shall propose two types of methods from different perspectives. The first one is to ensure the global stability of the feedback system based on the small-gain theorem. The stability condition is derived with the spectral normalization. However, this condition is highly conservative and may result in insufficient control performance. To overcome this difficulty, the second one is proposed, which improves the performance while ensuring the local stability with a larger region of attraction (ROA). In the second method, the stability is ensured by solving linear matrix inequalities (LMIs) after training the neural network controller. The spectral normalization improves the feasibility of the a-posteriori stability test by constructing tighter local sectors. The contribution of this article is summarized as follows:

  • In the development of the first method, a relationship between the spectral normalization and the global stability of the feedback system is described from a control point of view,

  • Regarding the local stability, the second method improves the feasibility of the a-posteriori stability test and enlarges the ROA,

  • Numerical experiments show that the second method provides enough stability and gives almost the same performance level compared with the existing RL algorithms.

The layout of this article is as follows. Section 2 describes the related works on this research. Section 3 describes the problem formulation and the stability analysis of the feedback system. Section 4 describes the stability-certified RL based on the spectral normalization. Section 5 gives the results of numerical experiments, and Section 6 ends the article with the conclusions. The notation is fairly standard. The symbol denotes the sets of -by- symmetric positive definite matrices, and () means that is positive (negative) definite. If , then the inequality is element-wise, i.e. for . The symbol denotes the 2-norm of the signal , and denotes the total energy of a series of a discrete-time signal , i.e., .

2 Related Works

This article is closely related to the works on RL with stability guarantees. The key points of the related works are summarized as follows. The first is the robust control framework to ensure the nominal/robust stability of the systems under the presence of nonlinearities and uncertainties. In Ref. [16]

, integral quadratic constraints (IQCs) are used to deal with the nonlinear activation function and the variation of the weight matrices in the neural network. The IQC framework has successfully been combined with the RL researches such as recurrent neural network 

[2], semidefinite feasibility problems with bounded gradient [10], and local quadratic constraints for obtaining the ROA [32]. As other examples using the robust control framework, the neural network certification algorithm [31] and the learning control methods [21, 7] have been proposed to obtain robustness against adversarial conditions in environments, see also [17, 4, 33] for the interesting works. The second key point is the technique to jointly learn the policy and the Lyapunov function. In Refs. [3, 25], a candidate of the Lyapunov function is represented by the neural network, and then, the policy and the Lyapunov function are obtained through the training, which provides a larger ROA than that of the classical linear quadratic regulator (LQR). In Ref. [11], the barrier function is also learned to ensure that the policy must not run into unsafe regions. These methods can be applied to a large class of nonlinear systems and have the potential for further improvement on the related works. This article differs from the above works in the aspect that the stability constraint is given by the spectral normalization. Specifically, this article first proposes the method for ensuring the global stability of the feedback system. In the first method, the gain of the feedback system is bounded less than 1, that is, the spectral normalization is performed to satisfy the small-gain theorem. However, the first one may provide insufficient performance due to its conservative stability condition. To improve the practicality, the second method is proposed, which provides enough performance while ensuring the local stability with a larger ROA. To conclude this section, this article presents a trade-off between stability and performance of the spectral normalization on the neural network controller through theoretical groundwork and numerical experiments.

3 Preliminaries

3.1 Problem Formulation

Figure 1: Feedback system consisting of and .

Let us consider the feedback system consisting of a plant and a controller as shown in Fig. 1. In this article, we assume that the plant is a discrete-time linear time-invariant (LTI) model represented as follows:

(1)

where , , , and are the state, the input, the state matrix, and the input matrix, respectively. The controller is an -layer neural network with nonlinear activation functions in hidden layers:

(2a)
(2b)
(2c)

where is the output from the layer. Symbols and are respectively the weight matrix and the bias of the layer in the neural network. Symbol

is the activation function which is applied element-wise for a given vector

:

(3)

where is the scalar activation function, e.g.,

, ReLU, and sigmoid. In RL, the neural network defined by Eq. (

2) is trained by some algorithms, e.g., trust region policy optimization (TRPO) [27], proximal policy optimization (PPO) [28], and soft actor-critic (SAC) [6]. In this article, the controller is also referred to as the policy in accordance with the field of RL researches.

3.2 Stability of Feedback Systems

3.2.1 Small-Gain Theorem

The following definition of the system stability is introduced.

Definition 1.

Stability
Consider a system whose input-output relation is represented by a mapping . The gain of the system is the worst-case ratio between the total output and the total input energy as follows:

(4)

If is finite, then the system is said to be finite-gain stable.

Let us consider the feedback connection as shown in Fig. 2. The small-gain theorem gives a sufficient condition for the feedback connection.

Theorem 1.

Small-Gain Theorem
Suppose that the systems and respectively have the finite gains and . Suppose further that the feedback system is well defined in the sense that for every pair of inputs and , there exist unique signal pairs of and as shown in Fig. 2. Then, the feedback connection is finite-gain stable if .

Proof.

See [13]. ∎

Figure 2: Feedback connection.

3.2.2 Lyapunov Condition

In this subsection, the stability analysis for the neural network controller proposed in [32] is briefly described. By introducing local quadratic constraints (QCs) for activation functions, the stability analysis in [32]

provides stability certificates and estimates the region of attraction (when global stability tests are infeasible). In accordance with 

[32], the followings are introduced to obtain the Lyapunov condition for the neural network controller. The state is an equilibrium point of the feedback system with input if the following conditions hold:

(5a)
(5b)

Suppose that satisfies Eq. (5). Then, the equilibrium point can be propagated throu the neural network to obtain equilibrium values , for the inputs/outputs of each activation function , and thus, is an equilibrium point of Eqs. (1) and (2) if:

(6a)
(6b)
(6c)

where is the combined nonlinearity by stacking the activation functions. The matrix is defined by

(7)

Finally, the ellipsoied is defined for , :

(8)

With the above definitions, the following property and the theorem is obtained by Ref. [32].

Property 1.

(Local Sector [32]) Let be an equilibrium value of the activation input and be the corresponding value at the first layer. Select with . Let denote the activation input generated by Eq. (2b) from the input at the first layer. There exist such that satisfies the offset local sector around the point for all with .

Theorem 2.

(Lyapunov Condition [32]) Consider the feedback system of plant in (1) and the neural network controller in (2) with equilibrium point satisfying (6). Let , , and be given vectors satisfying Property 1 for the neural network. Denote the row of the first weight by and define matrices

(9)

Define further matrices

(10)

where with . If there exists a matrix and vector with such that

(11)
(12)

then: (i) the feedback system consisting of and is locally stable around , and (ii) the set , defined by Eq. (8) is an inner-approximation to the ROA.

3.3 Spectral Normalization

The spectral normalization [18]

has been proposed for the generative adversarial networks (GANs) 

[5]. The training of GANs are stabilized by regularizing the Lipschitz norm of the network layers. The spectral norm of the matrix is defined by

(13)

which is equivalent to the largest singular value of

. The spectral normalization is defined by

(14)

4 Method

In this section, the relationship between the spectral normalization and the stability of the feedback system is described. The original spectral normalization [18] simply normalizes the weight matrix such that , which means that the spectral norm of the normalized matrix is given by . In this article, however, the following weight normalization is introduced for the flexibility of the policy design:

(15)

where and are respectively the normalized weight matrix and the positive constant. By introducing the parameter , it allows us to arbitrarily normalize the spectral norm of the weight matrix with . The key point is to appropriately select the parameter for ensuring the stability of the feedback system. In Section 4.1, the method for ensuring the global stability of the feedback system is described. In Section 4.2, the more practical method is proposed for improving control performance while ensuring the local stability of the feedback system. In the following, the bias term of each layer is omitted for simplicity (i.e. the equilibrium points are and ). In accordance with [32], the nonlinear operation of the activation function is isolated from the linear operation. Define as the input to the activation function as follows:

(16)

The neural network controller is thus represented as follows:

(17a)
(17b)
(17c)
(17d)

4.1 Pre-Guaranteed RL

The method described in this section is a standard approach: ensuring the gain of the feedback system is less than 1 by normalizing all weight matrices in the neural network. This method is called pre-guaranteed RL in this article. Although the pre-guaranteed RL is essentially the same as the original spectral normalization [18] except for introducing the parameter as defined in Eq. (15), the importance of this introduction is explained as follows. Let us consider the linear operation in the hidden layers (i.e. Eq. (17b)). The spectral normalization defined by Eq. (15) is performed for each weight matrix of the hidden layers:

(18)

where and are respectively the normalized weight matrix and the positive constant for the layer. This means that with the tuning parameter , the spectral norm of the weight matrix is given by . Subsequently, let us define the diagonal matrix , whose diagonal elements are the gain of the activation function for a given vector , and define the matrix corresponding to the operation of Eqs. (17a)-(17c), i.e., . Note that the spectral norm of the diagonal matrix is 222This property is satisfied for the activation functions which are commonly used in the neural networks, e.g. tanh, ReLU, and sigmoid. See Ref. [16] for the details in the case of tanh, which helps one understand this article., and the following inequality is obtained:

(19)

Next, the spectral normalization is performed for the linear operation of Eq. (17d), i.e.

(20)

The neural network controller represented in Eq. (17) is then re-formulated by

; (21a)
(21b)
; (21c)
, (21d)

and the spectral norm of the matrix , which corresponds to the mapping , is given by

(22)

This means that the output of the neural network controller (and also the input/output of the activation functions in the hidden layers) is bounded as follows:

(23)

To investigate the stability of the feedback system, it is sufficient to have the input-output relation of the neural network controller, i.e.

(24)

where

(25)

Finally, consider the gain of the neural network controller:

(26)

This means that the mapping has a finite gain and it is less than or equal to . The upper bound of the gain of the neural network controller can be calculated by Eq.(25). Thus, the stability condition of the feedback system as shown in Fig. 1 is given as follows.

Theorem 3.

(Stability Condition) Consider the feedback system as shown in Fig. 1. Suppose that the plant has a finite gain and it is less than or equal to . Suppose further that each weight matrix of the neural network controller is normalized as in Eq. (15). Assuming the feedback connection is well defined, the feedback system is finite-gain stable by the small-gain theorem if

(27)

where is defined by Eq. (25).

Proof.

Let be the gain of the neural notwork controller. From Eq. (26), we have the upper bound with . The gain of the feedback system can be bounded by . Therefore, , and then, the feedback system is finite-gain stable by small-gain theorem. ∎

To summarize the pre-guaranteed RL, the spectral normalization as in Eq. (15) is proposed and the stability condition of the feedback system is derived based on this spectral normalization. The stability is ensured by normalizing the weight matrices of all layers under the condition of . This method can be applied to a lot of existing RL algorithms since it does not require any change of the basic framework of those algorithms. However, there exist the following limitations:

  1. only applicable for a plant which has a finite gain,

  2. strict limitation for the neural networks due to Eq. (27).

Regarding (a), there exist cases that the pre-guaranteed RL cannot be used for traditional RL problems, e.g., the linearized model of the inverted pendulum at the equilibrium point , where and are respectively the angle and the angular velocity, does not have a finite gain. Regarding (b), the designed policy may not satisfy the performance requirement due to the constraint of the stability condition. As the gain of the plant become larger, the smaller gain of the policy is required to achieve the stability condition, resulting in an insufficient performance of the designed policy. To overcome these difficulties, in the next section, the method shall be proposed.

4.2 Post-Guaranteed RL

Although the pre-guaranteed RL explicitly includes the stability condition in the training of the policy by using the spectral normalization, the stability condition in Theorem 3 may impose severe constraints for the policy. Moreover, it is easily confirmed that the pre-guaranteed RL cannot be used for the plant which has a large gain since the gain of the neural network controller must be set to nearly 0. If one intends to guarantee a global stability, it often leads to the poor performance of the feedback system. In contrast, if one pays attention to a local stability, it may improve the control performance of the feedback system. In order to investigate the local stability, let us consider the region of attraction (ROA) of the feedback system defined by

(28)

where is the solution to the feedback system at time from the initial condition . The ROA is the set such that all the initial conditions converge to the equilibrium point as  [13]. The goal is to obtain the policy which has a larger ROA than a design requirement. In this article, the method to ensure the stability in the sense of the ROA is called post-guaranteed RL as explained below. In order to avoid the limitations due to the stability condition of Eq. (27), in the post-guaranteed RL, the spectral normalization shall be applied only for each hidden layer but for the output layer. The neural network controller in the post-guaranteed RL is then represented as follows:

; (29a)
; (29b)
; (29c)
. (29d)

Note that the stability of the feedback system is not ensured by normalizing only each hidden layer. However, this is a compromise such that the policy can be obtained with improved performance (and the optimization algorithm can be applied even for the system which does not have a finite gain). After obtaining the policy, an a-posteriori analysis shall be performed to confirm the local stability with its ROA (i.e., the stability of the feedback system is a-posteriorly ensured). Specifically, the post-guaranteed RL may improve the feasibility of the a-posteriori stability tests and enlarge the ROA.

Figure 3: Example of the ROA for the feedback system with the state .

In accordance with [32], let be the equilibrium point at the first layer. Assume the bounds and with and . Note that as explained in [32], the assumed bounds can be used to obtain the local sector bounds for all the nonlinear activation functions () in the neural network. In this article, the bias in the neural network is set to zero, and thus, and . Therefore, if the stability condition formulated by the LMIs in Theorem 2 is feasible, the ROA exists in the following region:

(30)

For understanding the relationship between the set of the ellipsoid and the spectral normalization defined by Eq. (15), let us consider a simple example of the plant with the state . In this example, it is assumed that with , where is the positive scalar value and is the all-ones vector. Figure 3 shows an example of the regions given by Eq. (30), which means that the set of the ellipsoid exists in the region of . In other words, the region of that the ROA exists is determined by two parameters and .

Figure 4: Three types of regions of sector bounded nonlinearity.

In the stability analysis, we need to set to a smaller value (i.e. assume tighter local sectors) to make the LMIs in Theorem 2 feasible. This means that the region of cannot be enlarged with due to the feasibility constraint. In the post-guaranteed RL, the effectiveness of introducing the parameters , , is explained as follows. First, setting to a small value enlarges the region of under a small value of , which provides a potential to obtain a larger ROA. Second, from Eq. (19), the input to the nonlinear activation function is limited. Thus, the assumed sectors for all nonlinear functions can be smaller, resulting in the improvement of the feasibility in the stability analysis formulated by the LMIsThe sectors after the second layer can be obtained with a forward propagation from the one of the first layer, see [32] for the details. If the weights of the hidden layers are unbounded, the calculated sectors may tend to be large, resulting in the infeasibility of the stability condition formulated by the LMIs in Theorem 2.. Note that the feasibility analysis of LMIs has been well investigated in the field of control engineering, see Ref. [29] for the details of the relationship between the region size and the feasibility of LMIs in conjunction with pre-guaranteed and post-guaranteed approach. Figure 4 shows three types of regions of sector bounded nonlinearity, in which is selected as the nonlinear activation function. The red solid line shows the global sector defined by with and , the green dashed line the local sector with and used for the stability analysis, and the black dash-dot line the local sector with and bounded by the spectral normalization. From Fig. 4, it can be confirmed that the spectral normalization enforces the smaller sector bound. On the other hand, too smaller values of , , strictly limit the nonlinearities of the activation functions, which means that too smaller values of provide the poor performance of the policy. To summarize the post-guaranteed RL, we can make sectors arbitrary small for making the LMIs feasible while enlarging the ROA by properly choosing parameters and , . In other words, the potential size of the ROA can be changed (for satisfying a design requirement) with trade-off between stability and performance for the neural network controller.

5 Experiments

Two numerical experiments are performed for discrete-time LTI systems: one is the inverted pendulum and the other is the longitudinal motion of aircraft. The details of the environments used in this article are provided in Appendix A

. The policy is modeled by a fully-connected multi-layer perceptron with

as the activation function, which is trained through the policy optimization algorithm using PPO [28]. In this article, PPO is selected as the baseline RL algorithm. For investigating the effectiveness of the spectral normalization, the pre-guaranteed RL and the post-guaranteed RL are compared with the baseline PPO (but the pre-guaranteed RL is tested only for the stable system, i.e. the aircraft control task).

5.1 Inverted Pendulum

The inverted pendulum is the traditional benchmark problem for RL algorithms. The state is , where and are respectively the angle (rad) and the angular velocity (rad/s). The input is , where is the torque (). For the experiment, the nonlinear model is linearized around the equilibrium point in accordance with [23]. Note that the linearized model around the equilibrium point is unstable and does not have a finite gain, which means that the pre-guaranteed RL cannot be used for this problem. The policy network has two hidden layers of 64 units.


(a) Learning curves. (b) Trajectories.

(c) Calculated lower bounds. (d) ROA.
Figure 5: Comparison between PPO and PPO-SN (Post) on the inverted pendulum task.

Figure 5 shows the result of the inverted pendulum task, in which the blue line/marker shows the baseline PPO and the red line/marker shows the PPO via the spectral normalization (SN) using the post-guaranteed RL. Figure 5-(a) shows the learning curves, where the solid line corresponds to the average and the shaded region to the minimum/maximum returns of evaluation rollouts without exploration noise over the three random seeds. Figure 5-(b) shows the trajectories obtained by the policy without exploration noise, where the initial states of the system are set to . Figure 5-(c) shows the calculated lower bounds for the a-posteriori stability analysis, and Fig. 5-(d) shows the obtained ROA. The results shown in Figs. 5 (b) to (d) are the case of the first random seed in the experiments. From Figs. 5-(a) and (b), PPO-SN (Post) performs comparably to the baseline PPO and achieves the control task with more smooth trajectories. Remarkable difference between PPO and PPO-SN (Post) is seen in the results of the a-posteriori stability analysis, see Figs. 5-(c) and (d). In the case of PPO, the small sector bounds (i.e. is set to nearly ) for the first layer are assumed in order to make the LMIs feasible, which corresponds to shrink the size of the ROA. On the other hand, in the case of PPO-SN (Post), the larger sector bounds than those of PPO can be assumed while making the LMIs feasible since the spectral norm of the weight matrices in the first/second layers are bounded (see also Table 1). Thus, by the spectral normalization, the larger ROA can be obtained in the a-posteriori stability analysis.


(a) Learning curves. (b) Trajectories.

(c) Calculated lower bounds. (d) ROA.
Figure 6: Comparison of PPO-SN (Post) by norm size on the inverted pendulum task.

Figure 6 shows the result of the comparison by the size of the spectral norm (i.e. the parameters , ) in the case of PPO-SN (Post). The blue line/marker shows the case of , the red , and the green . Although more sample efficiency is seen by increasing the scale size of the spectral norm (Fig. 6-(a)), the trajectories obtained after the training on the total step are almost similar (Fig. 6-(b)). From Figs. 6-(c) and (d), the larger ROA can be obtained by setting , , to the smaller value. Note that the sector sizes for the first layer are set to the same in the stability analysis of each case. From these results, it is confirmed that the trade-off between performance and stability due to the spectral normalization. Table 1 shows the spectral norm of the weight matrices obtained on the inverted pendulum task. Regarding the weight matrices of the first/second layers, the norm size of PPO becomes larger than that of PPO-SN (Post). The difference in the norm size leads to the difference in the calculated lower bounds of the second layer in the a-posteriori stability analysis as explained above. From these results, the spectral normalization improves the feasibility of the a-posteriori analysis and enlarges the ROA.

PPO PPO-SN (Post)
-
3.423 0.5000 1.000 1.500
4.201 0.5000 1.000 1.500
1.171 6.836 1.882 1.123
Table 1: Spectral norm of the weight matrices obtained on the inverted pendulum task. The spectral normalization is performed for hidden layers in PPO-SN (Post).

(a) Learning curves. (b) Trajectories.

(c) Calculated lower bounds. (d) ROA.
Figure 7: Comparison between PPO and PPO-SN (Pre/Post) on the aircraft control task.

5.2 Aircraft

The aircraft model used in this article is the generic transport model (GTM) developed by NASA [12], whose nonlinear simulation model is available in [22]. The linearized aircraft model is given by linearizing the nonlinear model at a trim point in accordance with [26]. For this experiments, the model is discretized via a zero-order hold at a sampling period of s. The gain (i.e. norm) of the aircraft model is . Regarding the pre-guaranteed RL, the parameters on the spectral normalization are set to , , to satisfy the stability condition of the feedback system. The policy network has two hidden layers of 32 units. Figure 7 shows the result of the aircraft control task. The plot layout is the same as in Fig. 5 except for the plots (a) and (b) which include the result of the pre-guaranteed RL. Regarding the learning curve, the pre-guaranteed RL shows the insufficient performance due to the stability condition derived from the small-gain theorem. On the other hand, the post-guaranteed RL shows almost similar performance compared with the baseline PPO. Regarding the time history of the state as well, the post-guaranteed RL shows enough performance. Moreover, the ROA of the post-guaranteed RL is larger than that of the baseline PPO. These results show that the effectiveness of the spectral normalization for the stability-certified RL.

PPO PPO-SN (Pre) PPO-SN (Post)
-
9.598 0.3100 1.000
9.360 0.3100 1.000
2.411 0.3100 24.00
Table 2: Spectral norm of the weight matrices obtained on the aircraft control task. The spectral normalization is performed for all layers in PPO-SN (Pre) and only for hidden layers in PPO-SN (Post).

6 Conclusion

In this article, to achieve a stability-certified RL, we have revisited the spectral normalization and proposed two types of methods from different perspectives. The first one is the pre-guaranteed RL to ensure the stability condition derived from the small-gain theorem. While explicitly including the stability condition in the training of the policy, the pre-guaranteed RL may provide insufficient performance due to the strict stability condition. In order to improve the performance while ensuring the stability, the post-guaranteed RL has been proposed, which much improves the feasibility of the a-posteriori stability analysis formulated by LMIs in many cases. The numerical experiments show that the post-guaranteed RL achieves almost similar performance compared with the baseline PPO while providing enough stability with a larger ROA.

Acknowledgements

The authors would like to thank Prof. Takashi Shimomura at Osaka Prefecture University for providing advice based on his expert knowledge of control theory.

References

  • [1] P. Almasi, R. Moni, and B. Gyires-Toth (2020) Robust reinforcement learning-based autonomous driving agent for simulation and real world. arXiv preprint arXiv:2009.11212. Cited by: §1.
  • [2] C. W. Anderson, P. M. Young, M. R. Buehner, J. N. Knight, K. A. Bush, and D. C. Hittle (2007) Robust reinforcement learning control using integral quadratic constraints for recurrent neural networks. IEEE Transactions on Neural Networks 18 (4), pp. 993–1002. Cited by: §2.
  • [3] Y. Chang, N. Roohi, and S. Gao (2019) Neural lyapunov control. In Advances in Neural Information Processing Systems, pp. 3245–3254. Cited by: §2.
  • [4] P. L. Donti, M. Roderick, M. Fazlyab, and J. Z. Kolter (2020) Enforcing robust control guarantees within neural network policies. arXiv preprint arXiv:2011.08105. Cited by: §2.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.3.
  • [6] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §3.1.
  • [7] M. Han, Y. Tian, L. Zhang, J. Wang, and W. Pan (2019) Model-free reinforcement learning with robust stability guarantee. arXiv preprint arXiv:1911.02875. Cited by: §2.
  • [8] M. Han, L. Zhang, J. Wang, and W. Pan (2020) Actor-critic reinforcement learning for control with stability guarantee. arXiv preprint arXiv:2004.14288. Cited by: §1.
  • [9] J. Jin, D. Graves, C. Haigh, J. Luo, and M. Jagersand (2020) Offline learning of counterfactual perception as prediction for real-world robotic reinforcement learning. arXiv preprint arXiv:2011.05857. Cited by: §1.
  • [10] M. Jin and J. Lavaei (2018) Stability-certified reinforcement learning: a control-theoretic perspective. arXiv preprint arXiv:1810.11505. Cited by: §1, §2.
  • [11] W. Jin, Z. Wang, Z. Yang, and S. Mou (2020) Neural certificates for safe control policies. arXiv preprint arXiv:2006.08465. Cited by: §2.
  • [12] T. Jordan, W. Langford, C. Belcastro, J. Foster, G. Shah, G. Howland, and R. Kidd (2004) Development of a dynamically scaled generic transport model testbed for flight research experiments. Cited by: Appendix A, §5.2.
  • [13] H. K. Khalil and J. W. Grizzle (2002) Nonlinear systems. Vol. 3, Prentice hall Upper Saddle River, NJ. Cited by: §1, §3.2.1, §4.2.
  • [14] W. Koch, R. Mancuso, and A. Bestavros (2019) Neuroflight: next generation flight control firmware. arXiv preprint arXiv:1901.06553. Cited by: §1.
  • [15] W. Koch, R. Mancuso, R. West, and A. Bestavros (2019) Reinforcement learning for uav attitude control. ACM Transactions on Cyber-Physical Systems 3 (2), pp. 1–21. Cited by: §1.
  • [16] R. M. Kretchmar (2000) A synthesis of reinforcement learning and robust control theory. Colorado State University Fort Collins, CO. Cited by: §2, footnote 2.
  • [17] B. Luo, H. Wu, and T. Huang (2014) Off-policy reinforcement learning for control design. IEEE transactions on cybernetics 45 (1), pp. 65–76. Cited by: §2.
  • [18] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1, §3.3, §4.1, §4.
  • [19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §1.
  • [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
  • [21] J. Morimoto and K. Doya (2001) Robust reinforcement learning. In Advances in neural information processing systems, pp. 1061–1067. Cited by: §2.
  • [22] NASA Flight Dynamics Simulation of a Generic Transport Model. Note: https://software.nasa.gov/software/LAR-17625-1 Cited by: Appendix A, §5.2.
  • [23] Y. Okawa, T. Sasaki, and H. Iwane (2019) Control approach combining reinforcement learning and model-based control. In 2019 12th Asian Control Conference (ASCC), pp. 1419–1424. Cited by: Appendix A, §5.1.
  • [24] B. Osinski, A. Jakubowski, P. Ziecina, P. Milos, C. Galias, S. Homoceanu, and H. Michalewski (2020) Simulation-based reinforcement learning for real-world autonomous driving. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6411–6418. Cited by: §1.
  • [25] S. M. Richards, F. Berkenkamp, and A. Krause (2018) The lyapunov neural network: adaptive stability certification for safe learning of dynamical systems. arXiv preprint arXiv:1808.00924. Cited by: §2.
  • [26] K. Sawada, Y. Hamada, H. Fukunaga, and S. Shin (2019) On the worst disturbance of airplane longitudinal motion using the generic transport model. TISCI 32 (8), pp. 309–317. Cited by: Appendix A, §5.2.
  • [27] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz (2015) Trust region policy optimization. In

    International conference on machine learning

    ,
    pp. 1889–1897. Cited by: §3.1.
  • [28] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §3.1, §5.
  • [29] T. Shimomura and T. Kubotani (2005) Gain-scheduled control under common lyapunov functions: conservatism revisited. In Proceedings of the 2005, American Control Conference, 2005., pp. 870–875. Cited by: §4.2.
  • [30] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §1.
  • [31] Y. Wang, T. Weng, and L. Daniel (2019) Verification of neural network control policy under persistent adversarial perturbation. arXiv preprint arXiv:1908.06353. Cited by: §2.
  • [32] H. Yin, P. Seiler, and M. Arcak (2020) Stability analysis using quadratic constraints for systems with neural network controllers. arXiv preprint arXiv:2006.07579. Cited by: §2, §3.2.2, §4.2, §4, Property 1, Theorem 2, footnote †.
  • [33] K. Zhang, B. Hu, and T. Basar (2020) Policy optimization for linear control with robustness guarantee: implicit regularization and global convergence. In Learning for Dynamics and Control, pp. 179–190. Cited by: §2.
  • [34] H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V. Kumar, and S. Levine (2020) The ingredients of real-world robotic reinforcement learning. arXiv preprint arXiv:2004.12570. Cited by: §1.

Appendix A Environment Details

Inverted Pendulum

The linearized equation of motion for the inverted pendulum is given as follows:

(31)

where , , , , and are the length of the pendulum (m), the mass (kg), the gravitation constant (), the friction coefficient (), and the sampling period (s), respectively. The parameters are set to the same as [23] and summarized in Table 3. The state and the input are given by

(32)

where , , and are the angle (rad), the angular velocity (rad/s), and the torque (), respectively. The system does not have a finite gain. At each step, the reward is given by

(33)

where and . The current episode is terminated if rad. The max episode step is 200.

Symbol Definition Value
length of pendulum (m) 0.5
Mass (kg) 0.15
gravitation constant () 9.8
Friction coefficient () 0.05
sampling period (s) 0.1
Table 3: Parameters of the inverted pendulum.

Aircraft

The aircraft model is the generic transport model (GTM) developed by NASA, which is a dynamically scaled 5.5% free-flying model of a twin-jet commercial transport aircraft [12]. The nonlinear simulation model is available in [22]. The linearized model of GTM is given by linearizing the nonlinear model at a trim point. In this article, the trim condition and the continuous-time model are taken from [26]. For the experiment of this article, the model is discretized via a zero-order hold at a sampling period of seconds. Thus, the discrete-time model for the aircraft longitudinal dynamics is given as follows.

(34)

The state and the input are given as follows.

(35)

where , , , , and are the velocity perturbation in - and -directions (m/s), the pitch rate (rad/s), the pitch angle (rad), and the elevator deflection (rad), respectively. The gain (i.e. norm) is given by . At each step, the reward is given by

(36)

where and . The max episode step is 200. If rad, m/s, or m/s, the current episode is terminated.