1 Introduction
Researchers over the last decades have been intrigued to find distributed control solutions for interconnected cyberphysical systems, in which the controller of each subsystem will receive measurement signals from some neighboring subsystems, instead of the full system state, in order to derive at a decision. On the other hand, in most practical systems, only a subset of the system state is available for the feedback control. There are typical examples of structured feedback control in which a predefined structure is imposed on the feedback control mechanism for practical implementation. Many research works have been geared toward structured control solutions for different structures and different system models. Optimal control laws that can capture decentralized [1] or a more generic distributed [3] structure has been investigated under notions of quadratic invariance (QI) [21, 14], structured linear matrix inequalities (LMIs) [13, 16], sparsity promoting optimal control [4] etc. [6] discusses a structural feedback gain computation for discretetime systems with known dynamics.
It is worth noting that, for practical dynamic systems, the dynamic model and its parameters may not always be known accurately (e.g., the US eastern interconnection transmission grid model). Several nonidealities such as unmodeled dynamics from the coupled processes, parameter drift issues over time can make model based control computations insufficient. Unfortunately, most of the aforementioned techniques for the strctured control system design assume that the designer has explicit knowledge of the dynamic system.
In recent times much attention has been given to modelfree decision making of dynamical systems by marrying the ideas from machine learning with the classical control theory, resulting into flourish in the area of reinforcement learning
[23]. In the RL framework, the control agent tries to optimize its actions based on the interactions with the environment or the dynamic systems quantified in terms of rewards due to such interactions. These techniques were traditionally introduced in the context of sequential decision making using Markov decision processes (MDPs)
[26, 2, 20], and has since been driving force in developing lots of advanced algorithms in RL with applications to games, robotics etc.Although many sophisticated machine learningbased control algorithms are developed to achieve certain tasks, these algorithms many times suffer from the lack of stability and optimally guarantees due to multiple heuristics involved in their designs. Recent works such as
[25, 9] have brought together the good from two worlds to the control of dynamical systems: the capability to learn without model from machine learning and the capability to make decision with rigorous stability guarantee from automatic control theory. Basically, these works leveraged the conceptual equivalence of RL/ADP algorithms with the classical optimal [15] and adaptive control[7] of the dynamic systems. References [25, 9, 10, 24, 11, 18, 19] cover many of such work for systems with partially or completely model free designs.However, the area of structure based RL/ADP design for dynamic systems is still unexplored to some extent. In [19, 17], a projection based reduceddimensional RL variant have been proposed for singularly perturbed systems along with a blockdecentralized design for two timescale networks, [8] presents a decentralized design for a class of interconnected systems, [5] presents a structured design for discretetime timevarying systems in the recent times. Along this line of research, this paper will present a structured optimal feedback control methodology using reinforcement learning without knowing the state dynamic model of system.
We first formulate a modelbased constrained optimal control criterion using the methodologies and guarantees from dynamic programming. Subsequently, we formulate a modelfree RL gain computation algorithm that can encode the general structural constraint on the optimal control. This structured learning algorithm  SRL encapsulates the stability and convergence guarantees along with the suboptimality for the controllers with the specified structure. We substantiate our design on a 6agent network system. The paper is organized as follows. We discuss the problem formulation and the required assumptions in the section II. In section III, we discuss the structured RL algorithm development incorporating the structural constraint. Numerical simulation example is given in Section IV, and concluding remarks are given in Section V.
2 Model and Problem Formulation
We consider a linear timeinvariant (LTI) continuoustime dynamic system:
(1) 
where are the states and control inputs. We, hereby, make the following assumption.
Assumption 1: The dynamic state matrix is unknown.
With this unknown state matrix, we would like to learn an optimal feedback gain . However, instead of unrestricted control gain , we impose some structure on the gain. We would like to have , which we call structural constraint, where is the set of all structured controllers such that:
(2) 
Here is the matricial function that can capture any structure in the control gain matrix. This can encode which elements in the matrix will be nonzero, for example, with the multiagent example as given in Fig. 1, the control for the agent can be constrained in the form:
(3) 
denotes the first row of . Similarly, the feedback communication requirement on all other agents can be encoded. Here all such feedback gains with the specified structure will be captured in the set . Therefore, we make the following assumption on the control gain structure.
Assumption 2: The communication structure required to implement the feedback control is known to the designer, and it is sparse in nature.
This assumption means that the structure is known. This captures the limitations in the feedback communication infrastructure, and can also some time represent a much less dense infrastructure to minimize the cost of deployment of the network. For many network physical systems, the communication infrastructure can be already existing, for example, in some peertopeer architecture, agents can only receive feedback from their neighbors. Another very commonly designed control structure is of blockdecentralized nature where local states are used to perform feedback control. Therefore, our general constraint set will encompass all such scenarios. We also make the standard stabilizability assumption.
Assumption 3: The pair is stabilizable and is observable.
3 Structured Reinforcement Learning (SRL)
To develop the learning control design we will take the following route. We will first formulate a modelbased solution of the optimal control problem via a modified Algebraic Riccati Equation (ARE) in continuoustime that can solve the structured optimal control when all the state matrices are known. Thereafter, we will formulate a learning algorithm that does not require the knowledge about the state matrix by embedding the modelbased framework with guarantees into RL, which will then automatically ensure the convergence and stability of the learning solution.
First of all, we present the following central result of this paper.
Theorem 1: For any matrix ,
let be the solution of the following modified Riccati equation
(5) 
Then, the control gain
(6) 
will ensure closedloop stability, i.e., .
Proof:
We look into the optimal control solution of the dynamic system (1) with the objective (2) using dynamic programming (DP) to ensure theoretical guarantees. We assume at time , the state is at . We define the finite time optimal value function with the unconstrained control as:
(7) 
with . Staring from state the optimal gives the minimum LQR costtogo. Now as the value function is quadratic, we can write it in a quadratic form as, For a small time interval , where is small, we assume that the control is constant at and is optimal. Then cost incurred over the interval is
(8) 
Also, the control input evolves the states at time to,
(9) 
Then, the minimum costtogo from is:
(10) 
Expanding as we have
(11)  
(12) 
Here, we neglect higherorder terms. Therefore, the total cost
(13)  
(14) 
If the control is optimal then the total cost must be minimized. Minimizing over we have,
(15)  
(16) 
Now, this gives us an optimal gain which solves the unconstrained LQR. However, we are not interested in the unconstrained optimal gain, as that cannot impose any structure as such. In order to impose structure in the feedback gains, the feedback control will have to deviate from the optimal solution of , and following [6], we introduce another matrix such that,
(17)  
(18) 
The matrix will help us to impose the structure, i.e., , which we will discuss later. Therefore, the structured implemented control is given by,
(19) 
We have , where is the set of all control inputs when following . Now, with slight abuse of notation, we denote the matrix to be the solution corresponding to the structured optimal control. The HamiltonJacobi equation with the structured control is given by,
(20)  
(21) 
Putting (19), neglecting higher order terms, and after simplifying we get,
(22) 
For steadystate solution, we will have,
(23) 
This proves the modified Riccati equation of the theorem. Now let us look into the stability of the closedloop system with the gain . We can consider the Lyapunov function:
(24) 
Therefore, the time derivative along the closedloop trajectory is given as,
(25)  
(26)  
(27)  
(28)  
(29) 
Now as is positive definite, the terms of form are atleast positive semidefinite. Therefore, we have,
(30) 
This ensures, closedloop system stability. For the linear system, with the assumption that is observable, the globally asymptotic stability can be proved by using the LaSalle’s invariance principle. This completes the proof.
At this point, we investigate closely the matricial structure constraint. Let denotes the indicator matrix for the structured matrix set where this matrix contains element whenever the corresponding element in is nonzero. For the example (3), we will have for the first row as,
(31) 
Therefore structural constraint is simply written as:
(32) 
Here, denotes the elementwise/ Hadamard product, and is the complement of . We, hereafter, state the following theorem on the choice of to impose structure on . This follows the similar form of discretetime condition of [6].
Theorem 2: Let be the solution of the modified ARE (5) where and Then, the control gain designed as in Theorem 1 will satisfy the structural constraint
Proof: We have,
(33)  
(34)  
(35)  
(36) 
This concludes the proof.
We note that the implicit assumption here is the existence of the solution of the modified ARE (5) where and It is still an open question on the necessary and sufficient condition on the structure for the existence of this solution. However, once the solution exists, we can iteratively compute it and the associated control gain using the following modelbased algorithm.
Theorem 3: Let be such that is Hurwitz. Then, for
1. Solve for (Policy Evaluation) :
(37) 
2. Update the control gain (Policy update):
(38) 
Then is Hurwitz and and converge to structured , and as .
Proof: The theorem is formulated by taking motivation from the Kleinman’s algorithm [12] that has been used for unstructured feedback gain computations. Comparing with the Kleinman’s algorithm, the control gain update step is now modified to impose the structure. With some mathematical manipulations it can be shown that using , where we have,
(39)  
(40) 
Therefore, (37) is equivalent to (5) for the iteration. As we shown the stability and convergence via dynamic programming and Lyapunov analysis in Theorem , considering the equivalence of Theorem with this iterative version, the theorem can be proved.
Although, we have formulated an iterative solution to compute structured feedback gains, the algorithm is still modeldependent. We, hereby, start to move into a state matrix agnostic design using reinforcement learning. We write (1) incorporating as
(41) 
We explore the system by injecting a probing signal such that the system states do not become unbounded [10]. For example, following [10] one can design as a sum of sinusoids. Thereafter, we consider a quadratic Lyapunov function , and we can take the timederivative along the state trajectories, and use Theorem to alleviate the dependency on the state matrix .
(42)  
where, . Therefore, starting with an arbitrary control policy we have,
(43) 
We, thereafter, solve (3) by formulating an iterative algorithm using the measurements of the state and control trajectories. The design will require to gather data matrices for sufficient number of time samples (discussed shortly) where,
(44)  
(45)  
(46) 
1. Gather sufficient data:
Store data ( and ) for interval .
Then construct the following data matrices such that rank(.
2. Controller update iteration : Starting with a stabilizing , Compute iteratively () using the following iterative equation
for
A. Solve for and :
(47) 
B. Compute using the feedback structure matrix.
C. Update the gain .
D. Terminate the loop when , is a small threshold.
endfor
3. Applying K on the system : Finally, apply , and remove .
Algorithm presents the steps to compute the structured feedback gain without knowing the state matrix .
Remark 1: If is Hurwitz, then the controller update iteration in (47) can be started without any stabilizing initial control. Otherwise, stabilizing is required, as commonly encountered in the RL literature [10]. This is mainly due to its equivalence with modified Kleinman’s algorithm in Theorem .
Remark 2: The rank condition dictates the amount of data sample needs to be gathered. For this algorithm we need rank(, where is the number of nonzero elements in the structured feedback control matrices. This is based on the number of unknown variables in the least squares. The number of data samples can be considered to be twice this number to guarantee convergence.
The next theorem describes the suboptimality of the structured solution with respect to the unconstrained objective.
Theorem 4: The difference between the optimal structured control objective value and the optimal unstructured objective is bounded as:
(48) 
for any control structure. Here , where is a operator defined as:
(49) 
Proof: Let the unstructured solution of the ARE be denoted as , then the unstructured objective value is , whereas, the learned structured control will result into the objective , therefore we have,
(50) 
Using , and as defined in the Theorem 4 statement, following [22, Theorem 3] with , we have,
As such, the difference between the optimal values and is bounded by
(51) 
for any structure of the control. We note that and are not dependent on the control structure. Therefore, the inequality (51) indicates that the difference between the optimal control value with , and optimal unstructured control value is linearly bounded by the Kronecker combination of the initial value for any control structure.
4 Numerical Example
We consider a multiagent network with agents following the interaction structure shown in Fig. 1. We consider each agent to follow a consensus dynamics with its neighbors such that:
(52) 
where are the coupling coefficients. We consider the state and input matrix to be:
(53) 
The dynamics given as above is generally referred to as a Laplacian dynamics with
resulting into a zero eigenvalue. We would like the controller to improve the damping of the eigenvalues closer to instability. The eigenvalues of the system are
. We choose initial conditions as . We consider two scenarios with structured gains: A. , and B. Along with the sparsity pattern in scenario A, we also have . Please note that we consider an arbitrary sparsity pattern. We assume that the states of all the agents can be measured.We consider the design parameters as . First we describe the scenario . Here we have , and number of nonzero elements of is , therefore, we require to gather data for atleast samples, which is data samples. We consider the time step to be s, and gather data for s. The iteration for and took around s on an average with a Macbook laptop of Catalina OS, 2.8 GHz QuadCore Intel Core i7 with 16 GB RAM. During exploration, we have used sum of sinusoids based perturbation signal to persistently excite the network. Please note that the majority of the learning is spent on the exploration of the system because of the requirement of persistent excitation and the least square iteration is a order of magnitude smaller in comparison to the exploration time. With faster processing units, the least square iteration can be made much faster. Fig. 4 shows the state trajectories of the agents during exploration, and also with control implementation phase. The structured control gain learned in this scenario is given as:
(54) 
The total cost comes out to be units. Fig. 44 show that the and iteration converges after around iterations. The structured solution also matches with the modelbased solution from Theorem 3 with high accuracy. Whenever, the learning has been performed for unstructured LQR control gain, the solution comes out to be:
(55) 
with the objective of units.
We then perform similar experimentation with the scenario B, where we have removed more set of communication links. Here, the number of nonzero elements of is . We need to gather atleast data samples, therefore, we perform around s of exploration with s time step. The structured control learned for this scenario is given by,
(56) 
with the objective of units. The state trajectories for scenario B during the exploration and the control implementation is given in Fig. 7. The convergence of the least squre iterates for and are shown in Figs. 77, where we can see that convergence is being reached after iterations. Also, the damping of the eigenvalues are improved with the control with the closedloop eigenvalues are placed at .
5 Conclusions and path forward
This paper discussed modelfree learning based computation of stable and closetooptimal feedback gains which are constrained to a specific yet general structure. We first formulated a modelbased condition to compute structured feedback control, which resulted in a modified algebraic Riccati equation (ARE). Techniques from dynamic programming and Lyapunov based stability criterion ensured the stability and the convergence. Thereafter, we embed the modelbased condition into the a RL algorithm where trajectories of states and controls are used to compute the gains. The algorithm uses an interleaved policy evaluation and policy improvement steps. We also analyzed the suboptimality of the structured optimal control solution in comparison with the unstructured optimal solutions. Simulations on a multiagent network with constrained communication infrastructure validated the theoretical results. Our future work will investigate the possibility to embed a prescribed degree of stability margin along with the structural constraint, as well as necessary and sufficient conditions for the existence of the optimal structured control.
References
 [1] (2008) Decentralized control: an overview. Annual reviews in control 32, pp. 87–98. Cited by: §1.
 [2] (2012) Dynamic programming and optimal control: approximate dynamic programming, 4th ed. Cited by: §1.
 [3] (2016) Control of interconnected systems with distributed model knowledge. PhD thesis, TU Munchen, Germany. Cited by: §1.
 [4] (2011) Sparsitypromoting optimal control for a class of distributed systems. In Proceedings of the 2011 American Control Conference, Vol. , pp. 2050–2055. Cited by: §1.
 [5] (2020) Learning the globally optimal distributed lq regulator. arXiv:1912.08774v3. Cited by: §1.
 [6] (1984) Structural constrained controllers for linear discrete dynamic systems. IFAC Proceedings Volumes 17 (2), pp. 435 – 440. Note: 9th IFAC World Congress Cited by: §1, §3, §3.
 [7] (2006) Adaptive control tutorial. Cited by: §1.
 [8] (201210) Robust adaptive dynamic programming for largescale systems with an application to multimachine power systems. IEEE Transactions on Circuits and Systems II: Express Briefs 59 (10), pp. 693–697. External Links: Document, ISSN 15497747 Cited by: §1.
 [9] (2012) Computational adaptive optimal control for continuoustime linear systems with completely unknown dynamics. Automatica 48, pp. 2699–2704. Cited by: §1.
 [10] (2017) Robust adaptive dynamic programming. Cited by: §1, §3, §3.

[11]
(2018)
Optimal and autonomous control using reinforcement learning: a survey.
IEEE Trans. on Neural Networks and Learning Systems
. Cited by: §1.  [12] (1968) On an iterative technique for riccati equation computations. IEEE Trans. on Automatic Control 13 (1), pp. 114–115. Cited by: §3.
 [13] (2004) Distributed control design for systems interconnected over an arbitrary graph. IEEE Transactions on Automatic Control 49 (9), pp. 1502–1519. Cited by: §1.
 [14] (2011) Quadratic invariance is necessary and sufficient for convexity. In Proceedings of the 2011 American Control Conference, Vol. , pp. 5360–5362. Cited by: §1.
 [15] (1995) Optimal control. Cited by: §1.
 [16] (2009) Distributed control for identical dynamically coupled systems: a decomposition approach. IEEE Transactions on Automatic Control 54 (1), pp. 124–135. Cited by: §1.
 [17] Blockdecentralized modelfree reinforcement learning control of two timescale networks. In American Control Conference 2019, Philadelphia, USA. Cited by: §1.
 [18] On modelfree reinforcement learning of reducedorder optimal control for singularly perturbed systems. In IEEE Conference on Decision and Conrol 2018, Miami, FL, USA. Cited by: §1.
 [19] (2020) Reduceddimensional reinforcement learning control using singular perturbation approximations. arXiv 2004.14501. Cited by: §1, §1.
 [20] (2007) Approximate dynamic programming. Cited by: §1.
 [21] (2006) A characterization of convex problems in decentralized control. IEEE Transactions on Automatic Control 51 (2), pp. 274–286. Cited by: §1.
 [22] (1998) Perturbation theory for algebraic riccati equations. SIAM Journal on Matrix Analysis and Applications 19, pp. 39–65. Cited by: §3.
 [23] (1998) Reinforcement learning  an introduction. Cited by: §1.
 [24] (2017) Qlearning for continuoustime linear systems: a modelfree infinite horizon optimal control approach. Systems and Control Letters 100, pp. 14–20. Cited by: §1.
 [25] (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45, pp. 477–484. Cited by: §1.
 [26] (1989) Learning from delayed systems. PhD thesis, King’s college of Cambridge. Cited by: §1.
Comments
There are no comments yet.