1 Introduction
Automatic control researchers over the last few decades heavily investigated control designs for largescale and interconnected systems. Distributed control solutions are much more feasible and computationally cheap instead of performing fullscale centralized control designs for such scenarios. In many cyberphysical systems the communication infrastructure that helps to implement feedback controllers may have their own limitations. Many of such computational and practical bottlenecks have triggered a flourish in the research of distributed and structure control designs. Optimal control laws that can capture decentralized [1] or a more generic distributed [3] structure has been investigated under notions of quadratic invariance (QI) [21, 14], structured linear matrix inequalities (LMIs) [13, 15], sparsity promoting optimal control [4] etc. [6] discusses a structural feedback gain computation for discretetime systems with known dynamics.
However, the above research works have been conducted with the assumption that the dynamic model of the system is known to the designer. In practice, accurate dynamic model of largescale systems may not always be known, for example, the U.S. Eastern Interconnection transmission grid model consists of
buses making tractable modelbased control designs much difficult. Moreover, the dynamic system may also be coupled with unmodeled dynamics from coupled processes, parameter drift issues etc. which motivated research works toward learning control gains from system trajectory measurements. In recent times, ideas from machine learning and control theory are unified under the notions of reinforcement learning (RL)
[23]. In the RL framework, the controller tried to optimize its policies based on the interaction with the system or environment to achieve higher rewards. Traditionally RL techniques have been developed for the Markov Decision Process (MDP)
[26, 2, 20] based framework, and has since been the driving force to develop lots of advanced algorithmic sequential decision making solutions. On a slightly different path, to incorporate much more rigorous and control theoretic guarantees in the context of dynamic system control where systems are modeled by differential equations, research works such as [25, 9, 24, 11, 16, 18] brought together the the good from the worlds of optimal and adaptive control along with the machine learning ingredients, sometimes under the notions of adaptive dynamic programming (ADP). Learning control designs have been reported for systems with partially or completely model free designs. Works such as [7, 10, 17] present robust ADP/RL designs for dynamic system control. In this article, we concentrate on the learning control designs from the dynamic system viewpoint, and present a novel structured optimal control learning solution along with incorporating robustness for continuoustime linear systems with unknown state matrix.Although reference [27] present a survey on multiagent and distributed RL solutions in the context of the MDPs, the area of distributed and structured learning control designs from the dynamic system viewpoint is still less explored. In [18], a projection based reduceddimensional RL variant have been proposed for singularly perturbed systems, [8] presents a decentralized design for a class of interconnected systems, [5] presents a structured design for discretetime timevarying systems in the recent times. Continuing in these lines of research, this article presents a robust structured optimal control learning methodology using ideas from ADP/RL.
We first consider the dynamic system without any exogenous perturbations, and formulate a modelbased structural optimal control solution for continuoustime LTI systems using dynamic programming. Thereafter, we perform the robustness analysis in presence of exogenous inputs and provide guarantees with sufficient stability conditions. Subsequently, these modelbased formulations are translated to a datadriven gain computation framework  RSRL that can encapsulate the structural and the robustness constraints along with enjoying the stability, convergence, and suboptimality guarantees. It is worthy to note that the “robustness” of this structural feedback control to the exogenous inputs is showed in twofold. First, if the exogenous inputs are entirely unknown, then the closedloop system is inputtostate stable to the exogenous inputs. Second, if the exogenous inputs are bounded and intermittently measurable, i.e., the exogenous inputs measurement is available atleast for some disjoint intervals, then the closedloop system is globally asymptotically stable. This implies that the intermittently measurable exogenous inputs are fully compensated to compute the structured feedback controls. We validate our design on a multiagent dynamic network with agents.
2 Model and Problem formulation
We consider a perturbed linear timevarying (LTV) continuoustime dynamic system as
(1) 
where are the states and control inputs. The perturbation is caused due to the influence of the exogenous input . The function represents a functional coupling of the dynamic system with some extraneous processes which can influence the dynamic system. We consider the input matrix for the control and the exogenous input to be same, i.e., we use matched input ports. We, hereby, make the following set of assumptions.
Assumption 1: The dynamic state matrix is unknown, and the input matrix is assumed to be known.
Thereafter, we characterize the nature of the coupling variable with the following assumption.
Assumption 2: The exogenous input measurements are available atleast for
some disjoint intervals where is the current time and is a small time increment [7], and satisfy the boundedness property given as:
(2) 
In our numerical example, we considered to be of structure with . With this unknown state matrix, and the presence of the bounded disturbance as given in Assumption 2, we would like to learn an optimal feedback gain . However, instead of unrestricted control gain , we impose some structure on the gain. We would like to have , which we call structural constraint, where is the set of all structured controllers such that:
(3) 
Here is the matricial function that can capture any structure in the control gain matrix. This will able to implement nonzero control communication links in the matrix.
We now make the following assumption on the control gain structure .
Assumption 3: The communication structure required to implement the feedback control is known to the designer, and is sparse in nature.
This assumption means that the structure is known. This captures the limitations in the feedback communication infrastructure. For many network physical systems, the communication infrastructure can be already existing, for example, in some peertopeer architecture, agents can only receive feedback from their neighbors. Another very commonly designed control structure is of blockdecentralized nature where local states are used to perform feedback control. Therefore, our general constraint set will encompass all such scenarios. We also make the standard stabilizability assumption.
Assumption 4: The pair is stabilizable and is observable, where is responsible to penalize the states as described in the problem statement.
3 Structured Reinforcement Learning and Robustness
To develop the learning control design we will take the following route. First, we consider an unperturbed dynamic system, i.e., and then formulate a framework where we can impose the structural constraint. Thereafter, we will consider the perturbation caused due to , and then refine the framework to add robustness guarantee.
3.1 Structural Constraint on the Feedback Control Gain
We start with the unperturbed dynamic system:
(5) 
and we want to design the control such that .
We will first formulate a modelbased solution of the optimal control problem via a modified Algebraic Riccati Equation (ARE) in continuoustime that can solve the structured optimal control when all the state matrices are known. The scenario with structural constraint without any exogenous inputs has been presented in our recent work [19], in this article we present the detail formulation for better readability. We now present a very important modelbased theorem as follows.
Theorem 1: For any matrix ,
let be the solution of the following modified Riccati equation
(6) 
Then, the control gain
(7) 
will ensure closedloop stability of (5), i.e., without any external perturbation.
Proof: See Appendix.
At this point, we investigate closely the matricial structure constraint. Let denotes the indicator matrix for the structured matrix set where this matrix contains element whenever the corresponding element in is nonzero. The structural constraint is simply written as:
(8) 
Here, denotes the elementwise/ Hadamard product, and is the complement of . We, hereafter, state the following theorem on the choice of to impose structure on . This follows the similar form of discretetime condition of [6].
Theorem 2: Let
where Then, the control gain designed as in Theorem 1 will satisfy the structural constraint
Proof: We have,
(9)  
(10)  
(11)  
(12)  
(13) 
This concludes the proof.
The next theorem describes the suboptimality of the structured solution with respect to the unconstrained objective.
Theorem 3: The difference between the optimal structured control objective value and the optimal unstructured objective is bounded as:
(14) 
for any control structure. Here , where is a operator defined as:
(15) 
Proof: Let the unstructured solution of the ARE be denoted as , then the unstructured objective value is , whereas, the learned structured control will result into the objective , therefore we have,
(16) 
Following [22][Theorem 3] we use , and as defined in the Theorem 3. With , we will have,
As such, the difference between the optimal values and is bounded by
(17) 
for any structure of the control. We note that and are not dependent on the control structure. Therefore, the inequality (17) indicates that the difference between the optimal control value with , and optimal unstructured control value is linearly bounded by the initial value for any control structure.
3.2 Robustness
To this end, we have formulated a modified algebraic Riccati equation that can restrict the optimal control solutions to the structural constraint. We will now investigate the robustness guarantees and necessary modifications for the system (1) with .
Theorem 4: For any bounded exogenous input , the structural control where computed following Theorems 1 and 2 will result into the system (1) to be inputtostate stable (ISS) with respect to .
Proof: The closedloop dynamics with control is given by:
(18) 
Therefore, we have
(19) 
As is stable we have, . Subsequently, we can bound as,
(20) 
Therefore, with the global asymptotic stability of the structured control, we can conclude as long as is bounded, the closedloop is ISS with respect to .
If the exogenous inputs are intermittently measurable as in Assumption , then they can be entirely compensated by the structural feedback control to ensure the global asymptotic stability of the closedloop system, as in the following theorem which will be necessary when developing the datadriven algorithm. Please note that although the ISS condition in Theorem 4 can be ensured when bounded measurements are not available, as the control learning will be dependent on the state trajectories that are perturbed by the exogenous inputs, we will require the measurements of the exogenous inputs.
Theorem 5: With assumption , Let be the solution of the following Riccati equation
(21) 
then the control will ensure closedloop stability of (1) with , and where and
are the maximum and minimum eigenvalues of
.Proof: See Appendix.
4 Reinforcement Learning Algorithm
The gain can be iteratively computed using the following modelbased algorithm as follows.
Theorem 6: Let be such that is Hurwitz. Then, for
1. Solve for (Policy Evaluation) :
(22) 
2. Update the control gain (Policy update):
(23) 
Then is Hurwitz and and converge to structured , and as .
Proof: The theorem is formulated by taking motivation from the Kleinman’s algorithm [12] that has been used for unstructured feedback gain computations. Comparing with the Kleinman’s algorithm, the control gain update step is now modified to impose the structure. With some mathematical manipulations it can be shown that using , where we have,
(24)  
(25) 
Therefore, (22) is equivalent to (21) for the iteration. As we shown the stability and convergence via dynamic programming and Lyapunov analysis in Theorem , considering the equivalence of Theorem with this iterative version, the theorem can be proved.
Although, we have formulated an iterative solution to compute structured feedback gains, the algorithm is still modeldependent. We, hereby, start to move into a state matrix agnostic design using reinforcement learning. We write (1) incorporating as
(26) 
We explore the system by injecting a probing signal such that the system states do not become unbounded [10]. For example, following [10] one can design as a sum of sinusoids. Thereafter, we consider a quadratic Lyapunov function , and we can take the timederivative along the state trajectories, and use Theorem to alleviate the dependency on the state matrix .
(27)  
where, . Therefore, starting with an arbitrary control policy we have,
(28) 
We, thereafter, solve (4) by formulating an iterative algorithm using the measurements of the state and control trajectories. The algorithm basically formulates an iterative leastsquares problem to solve for the structured gain. The design will require to gather data matrices for sufficient number of time samples (discussed shortly) where denotes the Kronecker product and as follows:
(29)  
(30)  
(31)  
(32) 
1. Gather sufficient data: Store data ( and ) for interval . Then construct the following data matrices such that rank(. Select where
is estimated from the allowable implementation gains.
2. Controller update iteration : Starting with a stabilizing , Compute iteratively () using the following iterative equation
for
A. Solve for and :
(33) 
B. Compute using the feedback structure matrix.
C. Update the gain .
D. Terminate the loop when , is a small threshold.
endfor
3. Applying K on the system : Finally, apply , and remove .
Algorithm presents the steps to compute the structured feedback gain without knowing the state matrix .
Remark 1: If is Hurwitz, then the controller update iteration in (33) can be started without any stabilizing initial control. Otherwise, stabilizing is required, as commonly encountered in the RL literature [10]. This is mainly due to its equivalence with modified Kleinman’s algorithm in Theorem .
Remark 2: The rank condition dictates the amount of data sample needs to be gathered. For this algorithm we need rank(, where is the number of nonzero elements in the structured feedback control matrices. This is based on the number of unknown variables in the least squares. The number of data samples can be considered to be twice this number to guarantee convergence.
The condition on in the Theorem 5 ensures robustness of the design. Although, the designer can estimate , during the initial iterations, the numerical values may not satisfy the condition on , and therefore, the designer may need to retune the parameter .
Theorem 7: Performing Algorithm using , and will recover the structured , and corresponding to Theorem 6 for (1).
Proof: Performing Algorithm 1 using , and is equivalently solving the trajectory relationship (4). As (4) has been constructed using Theorem 6, then any solution from Theorem 6 will satisfy the iteration of the following equation:
(34) 
Hence, any solution of Theorem 6 is also a solution of (34). When the condition rank( is satisfied, will have full column rank. As such, equation (34) has a unique solution , from which we have a unique solution Since this solution is unique, it is also the solution of theorem 6. Considering this equivalence of the Algorithm 1 with the modified Kleinman update in Theorem 6, we can conclude that the structured , and can be recovered using Algorithm 1.
5 Numerical Example
We consider a multiagent network with agents following the interaction structure shown in Fig. 1. We consider each agent to follow a consensus dynamics with its neighbors such that:
(35) 
where are the coupling coefficients. We consider the state and input matrix to be:
(36) 
We thereafter consider the exogeneous input perturbation to be , with the estimate of to be . We also set , and which make sure that the condition on the can be satisfied. We consider . The dynamics given as above is generally referred to as a Laplacian dynamics with resulting into a zero eigenvalue. We would like the controller to improve the damping of the eigenvalues closer to instability. The eigenvalues of the system are . We choose initial conditions as . We consider an arbitrary sparsity pattern, with the assumption that the states of all the agents can be measured. We experiment with the following structure constraint:
.
Here we have , and number of nonzero elements of is , therefore, we require to gather data for atleast samples, which is data samples. We use the time step to be with an exploration of . The iteration for and took around s on an average in Matlab19a with a Macbook laptop of Catalina OS, 2.8 GHz QuadCore Intel Core i7 with 16 GB RAM. During exploration, we have used sum of sinusoids based perturbation signal to persistently excite the network. Please note that the majority of the learning is spent on the exploration of the system because of the requirement of persistent excitation and the least square iteration is a order of magnitude smaller in comparison to the exploration time. With faster processing units, the least square iteration can be made much faster. Fig. 4 shows the state trajectories of the agents during exploration, and also with control implementation phase. The structured control gain learned in this scenario is given as:
(37) 
The total cost comes out to be units. Fig. 44 show that the and iteration converges after around iterations. The solution also closely matches with the modelbased computation. On the other hand, the unstructured optimal control solution is given as:
(38) 
with the optimal cost units. With the structured solution , the damping of the closedloop poles is improved as they are placed at This example brings out various intricacies of the algorithm and validating our theoretical results.
6 Conclusions
This paper presented a reinforcement learningbased robust optimal control design for linear systems with unknown state dynamics when the control is subjected to a structural constraint. We first formulate an extended algebraic Riccati equation (ARE) from the modelbased analysis encompassing dynamic programming and robustness analysis with sufficient stability and convergence guarantees. Subsequently, an policy iteration based RL algorithm is formulated using the previous modelbased results that continue to enjoy the rigorous guarantees and can compute the structured suboptimal gains using the trajectory measurements of states, controls, and exogenous inputs. The suboptimality of the learned structured gain is also quantified by comparing it with the unconstrained optimal solutions. Simulations on a multiagent network with constrained communication infrastructure along with the exogenous influences at each agent substantiate our theoretical and algorithmic formulations. Future research will look into investigating the feasibility and methodology of robust design variants when the measurements of exogenous inputs are not available by exploiting some underlying knowledge about the exogenous input and its corresponding gain through the dynamic system.
References
 [1] (2008) Decentralized control: an overview. Annual reviews in control 32, pp. 87–98. Cited by: §1.
 [2] (2012) Dynamic programming and optimal control: approximate dynamic programming, 4th ed. Cited by: §1.
 [3] (2016) Control of interconnected systems with distributed model knowledge. PhD thesis, TU Munchen, Germany. Cited by: §1.
 [4] (2011) Sparsitypromoting optimal control for a class of distributed systems. In Proceedings of the 2011 American Control Conference, Vol. , pp. 2050–2055. Cited by: §1.
 [5] (2020) Learning the globally optimal distributed lq regulator. arXiv:1912.08774v3. Cited by: §1.
 [6] (1984) Structural constrained controllers for linear discrete dynamic systems. IFAC Proceedings Volumes 17 (2), pp. 435 – 440. Note: 9th IFAC World Congress Cited by: §1, §3.1, §7.
 [7] (2013) Robust adaptive dynamic programming. In Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, F. Lewis and D. Liu (Eds.), Cited by: §1, §2.
 [8] (201210) Robust adaptive dynamic programming for largescale systems with an application to multimachine power systems. IEEE Transactions on Circuits and Systems II: Express Briefs 59 (10), pp. 693–697. External Links: Document, ISSN 15497747 Cited by: §1.
 [9] (2012) Computational adaptive optimal control for continuoustime linear systems with completely unknown dynamics. Automatica 48, pp. 2699–2704. Cited by: §1.
 [10] (2017) Robust adaptive dynamic programming. Cited by: §1, §4, §4.

[11]
(2018)
Optimal and autonomous control using reinforcement learning: a survey.
IEEE Trans. on Neural Networks and Learning Systems
. Cited by: §1.  [12] (1968) On an iterative technique for riccati equation computations. IEEE Trans. on Automatic Control 13 (1), pp. 114–115. Cited by: §4.
 [13] (2004) Distributed control design for systems interconnected over an arbitrary graph. IEEE Transactions on Automatic Control 49 (9), pp. 1502–1519. Cited by: §1.
 [14] (2011) Quadratic invariance is necessary and sufficient for convexity. In Proceedings of the 2011 American Control Conference, Vol. , pp. 5360–5362. Cited by: §1.
 [15] (2009) Distributed control for identical dynamically coupled systems: a decomposition approach. IEEE Transactions on Automatic Control 54 (1), pp. 124–135. Cited by: §1.
 [16] On modelfree reinforcement learning of reducedorder optimal control for singularly perturbed systems. In IEEE Conference on Decision and Conrol 2018, Miami, FL, USA. Cited by: §1.
 [17] (2020) On robust modelfree reduceddimensional reinforcement learning control for singularly perturbed systems. In 2020 American Control Conference (ACC), Vol. , pp. 3914–3919. External Links: Document Cited by: §1.
 [18] (2020) Reduceddimensional reinforcement learning control using singular perturbation approximations. arXiv 2004.14501. Cited by: §1, §1.
 [19] (2020) Reinforcement learning of structured control for linear systems with unknown state matrix. arXiv 2011.01128. Cited by: §3.1.
 [20] (2007) Approximate dynamic programming. Cited by: §1.
 [21] (2006) A characterization of convex problems in decentralized control. IEEE Transactions on Automatic Control 51 (2), pp. 274–286. Cited by: §1.
 [22] (1998) Perturbation theory for algebraic riccati equations. SIAM Journal on Matrix Analysis and Applications 19, pp. 39–65. Cited by: §3.1.
 [23] (1998) Reinforcement learning  an introduction. Cited by: §1.
 [24] (2017) Qlearning for continuoustime linear systems: a modelfree infinite horizon optimal control approach. Systems and Control Letters 100, pp. 14–20. Cited by: §1.
 [25] (2009) Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45, pp. 477–484. Cited by: §1.
 [26] (1989) Learning from delayed systems. PhD thesis, King’s college of Cambridge. Cited by: §1.
 [27] (2019) Multiagent reinforcement learning: a selective overview of theories and algorithms. arXiv 1911.10635. Cited by: §1.
7 A1. Proof of Theorem 1
We look into the optimal control solution of the unperturbed dynamic system (5) with the objective (2) using dynamic programming (DP) such that we can ensure theoretical guarantees. We assume at time , the state is at . We define the finite time optimal value function with the unconstrained control as:
(39) 
with . Staring from state the optimal gives the minimum LQR costtogo. Now as the value function is quadratic, we can write it in a quadratic form as, We, next, look into a small time interval , where is small, and in this small time interval we assume that the control is constant at and is optimal. Then cost incurred over the interval is
(40) 
Also, the control input evolves the states at time to,
(41) 
Then, the minimum costtogo from is:
(42) 
Expanding as we have,
(43)  
(44) 
Therefore, the total cost ,
(45)  
(46) 
If the control is optimal then the total cost must be minimized. Minimizing over we have,
(47)  
(48) 
Now this gives us an optimal gain which solves the unconstrained LQR. However, we are not interested in the unconstrained optimal gain, as that cannot impose any structure per se. In order to impose structure in the feedback gains, the feedback control will have to deviate from the optimal solution of , and following [6], we introduce another matrix such that,
(49)  
(50) 
The matrix will help us to impose the structure, i.e., , which we will discuss later. Therefore, the structured implemented control is given by,
(51) 
We have , where is the set of all control inputs when following . Now, with slight abuse of notation, we denote the matrix to be the solution corresponding to the structured optimal control. The HamiltonJacobi equation with the structured control is given by,
(52)  
(53) 
Putting (51), neglecting higher order terms, and after simplifying we get,
(54) 
For steadystate solution, we will have,
(55) 
This proves the modified Riccati equation of the theorem. Now let us look into the stability of the closedloop system with the gain . We can consider the Lyapunov function:
(56) 
Therefore, the time derivative along the closedloop trajectory of (5) is given as,
(57)  
(58)  
(59)  
(60)  
(61)  
(62) 
Now as is positive definite, the terms of form are atleast positive semidefinite. Therefore, we have,
(63) 
This ensures the stability of the closedloop system (5). Since the linear system (5) is autonomous and is observable, the globally asymptotic stability can be proved by using the LaSalle’s invariance principle. This completes the proof.
8 A2. Proof of Theorem 5
We start by considering the quadratic Lyapunov function . The closedloop dynamics is now given by,
(64) 
Using Assumption 2, the derivative of the Lyapunov function along the closedloop trajectories will satisfy
(65)  
(66)  
(67)  
(68) 
Denote then, Hence, . On the other hand
Comments
There are no comments yet.