1 Introduction
With growing complexity of cyberphysical systems and robotics , guarantees in performing certain tasks successfully, are required while an agent interact with the environment. This requirement is an essential feature in automation. For example, multiple agents interacting with the environment should ensure to achieve the specified objectives. An ambitious goal is to design an autonomous system that would synthesize controller(s) for the system regardless of how the environment behaves. This is a fundamental problem in AI and Computer Science that has been extensively studied under different titles by control communities [1, 2, 3]. Modern cyberphysical systems rely heavily on the efficacy of the feedback controllers. The efficacy of a feedback controller is generally measured by its capability of keeping the system stable at an equilibrium point and making it follow a reference trajectory. Moreover in real world safetycritical systems we also need to constraint control to ensure safeexploration by avoiding certain states from which an agent cannot recover. In other words, for safetycritical systems, another important property is the Region of Attraction (ROA) [4] which represents the region of the statespace from where if the system initiates its operation, it is guaranteed that the system will remain inside the region during its entire operation and eventually reach an equilibrium state. Though the feedback controller synthesis problem for stability and trajectory tracking has been widely studied, the feedback controller synthesis problem for maximizing the ROA of a nonlinear dynamical system has not received that much attention. In this paper we build work on recent techniques proposed by Berkenkamp et al [5, 6]
, which introduces the idea of computing the ROA of a nonlinear dynamical system. It is by combining ideas of Gaussian Process (GP) learning to approximate the model uncertainties and Lyapunov stability theory to estimate the safe operating region. The key limitation is however, given a controller it only provides an algorithm to compute its ROA, but given a nonlinear dynamical system, a radical question is how do we synthesize a feedback controller that helps the dynamical system to achieve the best possible ROA? This also leads to one of the most pivotal decision of parameters tuning while designing a controller for a dynamical systems.
Some earlier techniques involved quantization error criteria to synthesize feedback controller [7]. Offlate many of the developments happened by using deep reinforcement learning (Deep RL) methods to learn continuous control policies for dynamical systems [8]. The goal is generally to synthesize a feedback controller that has a very good trajectory tracking performance, which is captured in the form of so called LQR cost [9]. However, the controller having the best LQR cost may not achieve the best ROA for
the dynamical system.
In this paper we address these limitations and makes the following contributions: We present a novel method of synthesizing an optimal controller that provides a very good tracking performance together with achieving the best possible ROA for the closedloop system. We employ a stochastic optimization technique to solve the best ROA feedback controller synthesis problem, keeping also in mind the LQR cost as a performance index. As a specific technique, we use the Particle Swarm Optimization(PSO) [10] method to evaluate the best policies.
2 Problem
2.1 Preliminaries
In this section we review some control theoretic definitions and notations that will be used in our analysis. But before that we introduce the problem statement formally. Let us consider a nonlinear, deterministic, discretetime dynamical system:
(1) 
where, and are the states and control inputs at discrete time index . The system is controlled by any feedback policy , thus the closed loop dynamics is given by . The policy is safe to use within , ı.e, is true largest ROA of this policy.
Notation We have used following symbols throughout the paper :
Definition 2.1
(Cost function and LQR controller) Given a discrete time statespace model
associated quadratic cost function to be minimize
The solution of this control problem for linear system is given by the statefeedback law,
where is the optimal gain matrix.
Definition 2.2
(ROA) [4] Let be an asymptotically stable equilibrium point (state) then the total set of initial points (states) from which trajectories converge to as is called the Region of Attraction.
A reliable way to estimate is by using a Lyapunov function. We have the following theorems in this concern.
Theorem 2.1
(Lyapunov stability) [11]
Suppose is locally Lipschitz continuous and has
an equilibrium point at and be locally Lipschitz continuous on . Let, there exists a set containing on which is positivedefinite and
then is an asymptotically stable equilibrium. In this case, is known as a
Lyapunov function for the closedloop dynamics , and is the Lyapunov decrease region for .
Instead of searching a pertinent Lyapunov candidate while the computational methods often constraint to a very specific class of functions, Berkenkamp et al [6]
proposes a technique to construct the Lyapunov candidate as an inner product of feed forward neural network. This function consists of a sequence of layers. Each output layer is parameterized by a suitable weight matrix that yields as an input to a fixed elementwise activation function.
Theorem 2.2
(Lyapunov Neural Network) [6]
Consider as a Lyapunov candidate function, where is a feedforward neural network. Suppose, for each layer in , the activation function and the weight matrix
each have a trivial nullspace. Then
has a trivial nullspace as well, and is positivedefinite with and .
Moreover, if is Lipschitz continuous for each layer , then is also locally Lipschitz continuous.
Following the method of estimating ROA as much of as possible, of a given policy we develop an algorithm to scrutinize the best possible policies based on the performance measure of LQR cost and ROA. We implement the following optimization technique to learn such policy through sequential iteration.
Definition 2.3
(PSO) It is an evolutionary computational method developed by American scholars Kennedy and Eberhart in the early
s inspired by social behavior of fish schooling and bird flocking [12]. The underlying idea is to seek for an optimal solution through particles or agents, whose trajectories are adjusted by a stochastic and a deterministic component. Each particle keeps track of its coordinates in the search space and they are influenced by its “best” achieved position called pbest and the group’s “best” position called gbest. It employs an objective function, also known as fitness function, to evaluate the effectiveness of each particle in an iteration, and through several iterations, it searches for the optimal solution. The iteration continues until convergence or for a prespecified number of times.In our methodology, the fitness function is defined to map a candidate feedback controller in the search space to the size of its ROA and its trajectory tracking performance in terms of LQR cost. The controllers are the particle in the search space and in each step, we identify the gbest particle according to the fitness function.
2.2 Algorithm
Initially a set of random coordinates of particles (controllers) with random positions and velocities are considered. Then the newer sets of coordinates get updated through PSO which run to obtain the global best particle (gbest) or configuration. The local best or pbest having the larger LQR cost and smaller ROA obtained locally so far, stored in the memory variable which is then followed by the searching of gbest among the set of pbest by exploration. Consequently the optimal solution achieved through iteration.
For each particle and each dimension
Initialize position and velocity randomly within permissible range
END For
Iteration
DO
For each particle calculating ROA
Input : closed loop dynamics ; parametric Lyapunov candidate ; level set expansion multiplier
such that
Repeat
Sample a finite batch
ForwardSimulate the batch with over finite time steps
Update via batch SGD on
, such that
Continue until convergence
Fitness value = An weighted combination of ROA and LQR cost
If fitness value better than in the history
Set current fitness value as
END If
END For
Choose
For each particle and each dimension
END For
Continue till the criteria attained.
2.3 Example
In this section we examined the effectiveness of our proposed model on realworld applications. We implement Algorithm 2.2
on the inverted pendulum benchmark with TensorFlow
[13] based Python code. A three layers of activation units each have been used to construct the Neural Network Lyapunov candidate. The codes for computing ROA with respect to a fixed LQR policy are available at GitHub [14].The system is governed by the second order differential equation :
where
is the moment of inertia of pendulum,
is a frictional coefficient, is an angle from the upbright equilibrium position, is the state matrix and is the gravitational constant. Given this system we derive three different controllers , on the basis of fitness function. has minimal LQR cost but rather inferior ROA , has best possible ROA but maximal LQR cost and finally we obtain an optimal controller with moderate LQR cost and impressive ROA as well. Table 1 reflect the comparisons. We choose the number of discretized states along each dimension as . Following outer iterations and inner iterations we obtain such ROA size of each controller.policy  LQR cost  ROA size  of original Safe set 

287.65  21,776  90.08  
245.067  19,721  81.58  
254.54  20,514  84.86 
Figure 1 shows the results of training and also visualizes how characterizes with a level set better than both (yellow ellipsoid) and (blue ellipsoid) candidates and almost adapts to the shape of . Therefore on the benchmark of largest ROA is the best policy, while Table 1 also reflects its drawbacks of having worst LQR cost. Combining both the results we conclude that the feedback controller has a relatively good trajectory tracking performance as well it also achieve the best possible ROA for the closedloop system.
We have also created a Simulink model to simulate the relation between a controller and its corresponding ROA. Figure (a)a(c)c demonstrates the efficacy of these controllers. The angletime plot clearly shows the failure and success rate to maintain the stability of a controller under various initial angle of the pendulum.
From Figure (b)b we observe that fail to reach the upbright equilibrium point when the pendulum is driven from while the rest two controllers are stable in this case. However the optimal controller also remain unstable when driven from the position as shown in Figure (c)c. Evidently is always stable throughout the experiment. In other words, these figures distinguishes the controllers on the basis of ROA, and typically shows that a certain disturbance easily drive the system out of the ROA and then fail to come back to the stable equilibrium point.
3 Related Work
In this section, we will briefly explore some of the major works that are closely related to ours.
Safelearning is an active area of research that has been drawn prominent attention from both the researchers in machine learning and control communities
[15, 16, 17]. Discrete Markov Decision Process, Model Predictive Controlscheme these are few areas that has considered the existence of feasible return trajectories to a safe region of the state space with high probability. Nevertheless for a nonlinear dynamical system
Lyapunov functions are the most convenient tools for safety certification and hence ROA estimations [18, 19, 20].Even though searching such function analytically is not a straight forward task but can be identified efficiently via a semi definite program [21, 22] , or using SOS polynomial methods [23]. Some other methods to obtain ROA includes volume over system trajectories, sampling based approaches [24] and so on.4 Conclusion and Future Work
We have developed a novel method for synthesizing control policies for general nonlinear dynamical systems.
Our work borrows insights from recent advances in estimating ROA for nonlinear dynamical system, resulting in algorithms that can learn competitive policies with continuous action spaces while minimizing the LQR cost as well as expanding the ROA, which is essential for energyefficient, safe, and highperformance operations of lifecritical and missioncritical embedded applications.
An interesting area that we want to broadly explore in future is to initiate the learning process, and updates the weights of the neural network through deep RL technique.Once the deep neural network for RL is initialized, a reward function that captures both the trajectory tracking capability and the size of the ROA will be used to improve the feedback controller. We also plan to synthesize feedback controllers for complex dynamical systems viz. quadcopters, drones etc. Our final goal would be to fly a UAV by using the feedback controllers synthesized through the proposed techniques.
References
 [1] Rajeev Alur, Salar Moarref, and Ufuk Topcu. Compositional and symbolic synthesis of reactive controllers for multiagent systems. Information and Computation, 261:616–633, 2018.

[2]
Alberto Camacho, Jorge A Baier, Christian Muise, and Sheila A McIlraith.
Synthesizing controllers: On the correspondence between ltl synthesis
and nondeterministic planning.
In
Canadian Conference on Artificial Intelligence
, pages 45–59. Springer, 2018.  [3] Eugene Asarin, Oded Maler, Amir Pnueli, and Joseph Sifakis. Controller synthesis for timed automata. IFAC Proceedings Volumes, 31(18):447–452, 1998.
 [4] Claudiu C Remsing. Am3. 2linear control, 2006.
 [5] Felix Berkenkamp, Riccardo Moriconi, Angela P Schoellig, and Andreas Krause. Safe learning of regions of attraction for uncertain, nonlinear systems with gaussian processes. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4661–4666. IEEE, 2016.
 [6] Spencer M Richards, Felix Berkenkamp, and Andreas Krause. The lyapunov neural network: Adaptive stability certification for safe learning of dynamic systems. arXiv preprint arXiv:1808.00924, 2018.
 [7] Rupak Majumdar, Indranil Saha, and Majid Zamani. Synthesis of minimalerror control software. In Proceedings of the tenth ACM international conference on Embedded software, pages 123–132. ACM, 2012.
 [8] Hassan K Khalil. Nonlinear systems. Upper Saddle River, 2002.
 [9] Frank L Lewis, Draguna Vrabie, and Vassilis L Syrmos. Optimal control. John Wiley & Sons, 2012.
 [10] Yuhui Shi et al. Particle swarm optimization: developments, applications and resources. In Proceedings of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546), volume 1, pages 81–86. IEEE, 2001.
 [11] Rudolf E Kalman and John E Bertram. Control system analysis and design via the “second method” of lyapunov: I—continuoustime systems. 1960.
 [12] Russell Eberhart and James Kennedy. A new optimizer using particle swarm theory. In MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, pages 39–43. Ieee, 1995.
 [13] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 [14] Richards Spencer and Berkenkamp Felix. Safelearning: https://github.com/befelix/safe_learning. 2018.
 [15] Anil Aswani, Humberto Gonzalez, S Shankar Sastry, and Claire Tomlin. Provably safe and robust learningbased model predictive control. Automatica, 49(5):1216–1226, 2013.
 [16] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450, 2016.
 [17] Anayo K Akametalu, Jaime F Fisac, Jeremy H Gillula, Shahab Kaynama, Melanie N Zeilinger, and Claire J Tomlin. Reachabilitybased safe learning with gaussian processes. In 53rd IEEE Conference on Decision and Control, pages 1424–1431. IEEE, 2014.
 [18] Anthony Vannelli and M Vidyasagar. Maximal lyapunov functions and domains of attraction for autonomous nonlinear systems. Automatica, 21(1):69–80, 1985.
 [19] JM Gomes Da Silva and Sophie Tarbouriech. Antiwindup design with guaranteed regions of stability: an lmibased approach. IEEE Transactions on Automatic Control, 50(1):106–111, 2005.
 [20] David J Hill and Iven MY Mareels. Stability theory for differential/algebraic systems with application to power systems. IEEE transactions on circuits and systems, 37(11):1416–1423, 1990.
 [21] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [22] Pablo A Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. PhD thesis, California Institute of Technology, 2000.
 [23] Didier Henrion and Milan Korda. Convex computation of the region of attraction of polynomial control systems. IEEE Transactions on Automatic Control, 59(2):297–312, 2013.
 [24] Ruxandra Bobiti and Mircea Lazar. A sampling approach to finding lyapunov functions for nonlinear discretetime systems. In 2016 European Control Conference (ECC), pages 561–566. IEEE, 2016.
Comments
There are no comments yet.