Reinforcement Learning of Structured Control for Linear Systems with Unknown State Matrix

by   Sayak Mukherjee, et al.

This paper delves into designing stabilizing feedback control gains for continuous linear systems with unknown state matrix, in which the control is subject to a general structural constraint. We bring forth the ideas from reinforcement learning (RL) in conjunction with sufficient stability and performance guarantees in order to design these structured gains using the trajectory measurements of states and controls. We first formulate a model-based framework using dynamic programming (DP) to embed the structural constraint to the Linear Quadratic Regulator (LQR) gain computation in the continuous-time setting. Subsequently, we transform this LQR formulation into a policy iteration RL algorithm that can alleviate the requirement of known state matrix in conjunction with maintaining the feedback gain structure. Theoretical guarantees are provided for stability and convergence of the structured RL (SRL) algorithm. The introduced RL framework is general and can be applied to any control structure. A special control structure enabled by this RL framework is distributed learning control which is necessary for many large-scale cyber-physical systems. As such, we validate our theoretical results with numerical simulations on a multi-agent networked linear time-invariant (LTI) dynamic system.



There are no comments yet.


page 1

page 2

page 3

page 4


Imposing Robust Structured Control Constraint on Reinforcement Learning of Linear Quadratic Regulator

This paper discusses learning a structured feedback control to obtain su...

A Secure Learning Control Strategy via Dynamic Camouflaging for Unknown Dynamical Systems under Attacks

This paper presents a secure reinforcement learning (RL) based control m...

Approximate Dynamic Programming For Linear Systems with State and Input Constraints

Enforcing state and input constraints during reinforcement learning (RL)...

Reduced-Dimensional Reinforcement Learning Control using Singular Perturbation Approximations

We present a set of model-free, reduced-dimensional reinforcement learni...

Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls

We study finite-time horizon continuous-time linear-convex reinforcement...

Revised Progressive-Hedging-Algorithm Based Two-layer Solution Scheme for Bayesian Reinforcement Learning

Stochastic control with both inherent random system noise and lack of kn...

Model-Free Optimal Control of Linear Multi-Agent Systems via Decomposition and Hierarchical Approximation

Designing the optimal linear quadratic regulator (LQR) for a large-scale...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Researchers over the last decades have been intrigued to find distributed control solutions for interconnected cyber-physical systems, in which the controller of each subsystem will receive measurement signals from some neighboring subsystems, instead of the full system state, in order to derive at a decision. On the other hand, in most practical systems, only a subset of the system state is available for the feedback control. There are typical examples of structured feedback control in which a predefined structure is imposed on the feedback control mechanism for practical implementation. Many research works have been geared toward structured control solutions for different structures and different system models. Optimal control laws that can capture decentralized [1] or a more generic distributed [3] structure has been investigated under notions of quadratic invariance (QI) [21, 14], structured linear matrix inequalities (LMIs) [13, 16], sparsity promoting optimal control [4] etc. [6] discusses a structural feedback gain computation for discrete-time systems with known dynamics.

It is worth noting that, for practical dynamic systems, the dynamic model and its parameters may not always be known accurately (e.g., the US eastern interconnection transmission grid model). Several non-idealities such as unmodeled dynamics from the coupled processes, parameter drift issues over time can make model based control computations insufficient. Unfortunately, most of the aforementioned techniques for the strctured control system design assume that the designer has explicit knowledge of the dynamic system.

In recent times much attention has been given to model-free decision making of dynamical systems by marrying the ideas from machine learning with the classical control theory, resulting into flourish in the area of reinforcement learning


. In the RL framework, the control agent tries to optimize its actions based on the interactions with the environment or the dynamic systems quantified in terms of rewards due to such interactions. These techniques were traditionally introduced in the context of sequential decision making using Markov decision processes (MDPs)

[26, 2, 20], and has since been driving force in developing lots of advanced algorithms in RL with applications to games, robotics etc.

Although many sophisticated machine learning-based control algorithms are developed to achieve certain tasks, these algorithms many times suffer from the lack of stability and optimally guarantees due to multiple heuristics involved in their designs. Recent works such as

[25, 9] have brought together the good from two worlds to the control of dynamical systems: the capability to learn without model from machine learning and the capability to make decision with rigorous stability guarantee from automatic control theory. Basically, these works leveraged the conceptual equivalence of RL/ADP algorithms with the classical optimal [15] and adaptive control[7] of the dynamic systems. References [25, 9, 10, 24, 11, 18, 19] cover many of such work for systems with partially or completely model free designs.

However, the area of structure based RL/ADP design for dynamic systems is still unexplored to some extent. In [19, 17], a projection based reduced-dimensional RL variant have been proposed for singularly perturbed systems along with a block-decentralized design for two time-scale networks, [8] presents a decentralized design for a class of interconnected systems, [5] presents a structured design for discrete-time time-varying systems in the recent times. Along this line of research, this paper will present a structured optimal feedback control methodology using reinforcement learning without knowing the state dynamic model of system.

We first formulate a model-based constrained optimal control criterion using the methodologies and guarantees from dynamic programming. Subsequently, we formulate a model-free RL gain computation algorithm that can encode the general structural constraint on the optimal control. This structured learning algorithm - SRL encapsulates the stability and convergence guarantees along with the sub-optimality for the controllers with the specified structure. We substantiate our design on a 6-agent network system. The paper is organized as follows. We discuss the problem formulation and the required assumptions in the section II. In section III, we discuss the structured RL algorithm development incorporating the structural constraint. Numerical simulation example is given in Section IV, and concluding remarks are given in Section V.

2 Model and Problem Formulation

We consider a linear time-invariant (LTI) continuous-time dynamic system:


where are the states and control inputs. We, hereby, make the following assumption.
Assumption 1: The dynamic state matrix is unknown.

With this unknown state matrix, we would like to learn an optimal feedback gain . However, instead of unrestricted control gain , we impose some structure on the gain. We would like to have , which we call structural constraint, where is the set of all structured controllers such that:


Here is the matricial function that can capture any structure in the control gain matrix. This can encode which elements in the matrix will be non-zero, for example, with the multi-agent example as given in Fig. 1, the control for the agent can be constrained in the form:


denotes the first row of . Similarly, the feedback communication requirement on all other agents can be encoded. Here all such feedback gains with the specified structure will be captured in the set . Therefore, we make the following assumption on the control gain structure.
Assumption 2: The communication structure required to implement the feedback control is known to the designer, and it is sparse in nature.

This assumption means that the structure is known. This captures the limitations in the feedback communication infrastructure, and can also some time represent a much less dense infrastructure to minimize the cost of deployment of the network. For many network physical systems, the communication infrastructure can be already existing, for example, in some peer-to-peer architecture, agents can only receive feedback from their neighbors. Another very commonly designed control structure is of block-decentralized nature where local states are used to perform feedback control. Therefore, our general constraint set will encompass all such scenarios. We also make the standard stabilizability assumption.

Figure 1: An example of structured feedback for agent

Assumption 3: The pair is stabilizable and is observable.

We can now state the model-free learning control problem as follows.
P. For the system (1) satisfying Assumptions , learn structured control policies , where described in (2), such that the closed-loop system is stable and the following objective is minimized


3 Structured Reinforcement Learning (SRL)

To develop the learning control design we will take the following route. We will first formulate a model-based solution of the optimal control problem via a modified Algebraic Riccati Equation (ARE) in continuous-time that can solve the structured optimal control when all the state matrices are known. Thereafter, we will formulate a learning algorithm that does not require the knowledge about the state matrix by embedding the model-based framework with guarantees into RL, which will then automatically ensure the convergence and stability of the learning solution. First of all, we present the following central result of this paper.
Theorem 1: For any matrix , let be the solution of the following modified Riccati equation


Then, the control gain


will ensure closed-loop stability, i.e., .
Proof: We look into the optimal control solution of the dynamic system (1) with the objective (2) using dynamic programming (DP) to ensure theoretical guarantees. We assume at time , the state is at . We define the finite time optimal value function with the unconstrained control as:


with . Staring from state the optimal gives the minimum LQR cost-to-go. Now as the value function is quadratic, we can write it in a quadratic form as, For a small time interval , where is small, we assume that the control is constant at and is optimal. Then cost incurred over the interval is


Also, the control input evolves the states at time to,


Then, the minimum cost-to-go from is:


Expanding as we have


Here, we neglect higher-order terms. Therefore, the total cost-


If the control is optimal then the total cost must be minimized. Minimizing over we have,


Now, this gives us an optimal gain which solves the unconstrained LQR. However, we are not interested in the unconstrained optimal gain, as that cannot impose any structure as such. In order to impose structure in the feedback gains, the feedback control will have to deviate from the optimal solution of , and following [6], we introduce another matrix such that,


The matrix will help us to impose the structure, i.e., , which we will discuss later. Therefore, the structured implemented control is given by,


We have , where is the set of all control inputs when following . Now, with slight abuse of notation, we denote the matrix to be the solution corresponding to the structured optimal control. The Hamilton-Jacobi equation with the structured control is given by,


Putting (19), neglecting higher order terms, and after simplifying we get,


For steady-state solution, we will have,


This proves the modified Riccati equation of the theorem. Now let us look into the stability of the closed-loop system with the gain . We can consider the Lyapunov function:


Therefore, the time derivative along the closed-loop trajectory is given as,


Now as is positive definite, the terms of form are at-least positive semi-definite. Therefore, we have,


This ensures, closed-loop system stability. For the linear system, with the assumption that is observable, the globally asymptotic stability can be proved by using the LaSalle’s invariance principle. This completes the proof.

At this point, we investigate closely the matricial structure constraint. Let denotes the indicator matrix for the structured matrix set where this matrix contains element whenever the corresponding element in is non-zero. For the example (3), we will have for the first row as,


Therefore structural constraint is simply written as:


Here, denotes the element-wise/ Hadamard product, and is the complement of . We, hereafter, state the following theorem on the choice of to impose structure on . This follows the similar form of discrete-time condition of [6].

Theorem 2: Let be the solution of the modified ARE (5) where and Then, the control gain designed as in Theorem 1 will satisfy the structural constraint

Proof: We have,


This concludes the proof.

We note that the implicit assumption here is the existence of the solution of the modified ARE (5) where and It is still an open question on the necessary and sufficient condition on the structure for the existence of this solution. However, once the solution exists, we can iteratively compute it and the associated control gain using the following model-based algorithm.

Theorem 3: Let be such that is Hurwitz. Then, for
1. Solve for (Policy Evaluation) :


2. Update the control gain (Policy update):


Then is Hurwitz and and converge to structured , and as .

Proof: The theorem is formulated by taking motivation from the Kleinman’s algorithm [12] that has been used for unstructured feedback gain computations. Comparing with the Kleinman’s algorithm, the control gain update step is now modified to impose the structure. With some mathematical manipulations it can be shown that using , where we have,


Therefore, (37) is equivalent to (5) for the iteration. As we shown the stability and convergence via dynamic programming and Lyapunov analysis in Theorem , considering the equivalence of Theorem with this iterative version, the theorem can be proved.

Although, we have formulated an iterative solution to compute structured feedback gains, the algorithm is still model-dependent. We, hereby, start to move into a state matrix agnostic design using reinforcement learning. We write (1) incorporating as


We explore the system by injecting a probing signal such that the system states do not become unbounded [10]. For example, following [10] one can design as a sum of sinusoids. Thereafter, we consider a quadratic Lyapunov function , and we can take the time-derivative along the state trajectories, and use Theorem to alleviate the dependency on the state matrix .


where, . Therefore, starting with an arbitrary control policy we have,


We, thereafter, solve (3) by formulating an iterative algorithm using the measurements of the state and control trajectories. The design will require to gather data matrices for sufficient number of time samples (discussed shortly) where,


1. Gather sufficient data: Store data ( and ) for interval . Then construct the following data matrices such that rank(.

2. Controller update iteration : Starting with a stabilizing , Compute iteratively () using the following iterative equation

A. Solve for and :


B. Compute using the feedback structure matrix.
C. Update the gain .
D. Terminate the loop when , is a small threshold.

3. Applying K on the system : Finally, apply , and remove .

Algorithm 1 Structured Reinforcement Learning (SRL) Control

Algorithm presents the steps to compute the structured feedback gain without knowing the state matrix .

Remark 1: If is Hurwitz, then the controller update iteration in (47) can be started without any stabilizing initial control. Otherwise, stabilizing is required, as commonly encountered in the RL literature [10]. This is mainly due to its equivalence with modified Kleinman’s algorithm in Theorem .

Remark 2: The rank condition dictates the amount of data sample needs to be gathered. For this algorithm we need rank(, where is the number of non-zero elements in the structured feedback control matrices. This is based on the number of unknown variables in the least squares. The number of data samples can be considered to be twice this number to guarantee convergence.

The next theorem describes the sub-optimality of the structured solution with respect to the unconstrained objective.
Theorem 4: The difference between the optimal structured control objective value and the optimal unstructured objective is bounded as:


for any control structure. Here , where is a operator defined as:


Proof: Let the unstructured solution of the ARE be denoted as , then the unstructured objective value is , whereas, the learned structured control will result into the objective , therefore we have,


Using , and as defined in the Theorem 4 statement, following [22, Theorem 3] with , we have,

As such, the difference between the optimal values and is bounded by


for any structure of the control. We note that and are not dependent on the control structure. Therefore, the inequality (51) indicates that the difference between the optimal control value with , and optimal unstructured control value is linearly bounded by the Kronecker combination of the initial value for any control structure.

4 Numerical Example

We consider a multi-agent network with agents following the interaction structure shown in Fig. 1. We consider each agent to follow a consensus dynamics with its neighbors such that:


where are the coupling coefficients. We consider the state and input matrix to be:


The dynamics given as above is generally referred to as a Laplacian dynamics with

resulting into a zero eigenvalue. We would like the controller to improve the damping of the eigenvalues closer to instability. The eigenvalues of the system are

. We choose initial conditions as . We consider two scenarios with structured gains: A. , and B. Along with the sparsity pattern in scenario A, we also have . Please note that we consider an arbitrary sparsity pattern. We assume that the states of all the agents can be measured.

We consider the design parameters as . First we describe the scenario . Here we have , and number of non-zero elements of is , therefore, we require to gather data for at-least samples, which is data samples. We consider the time step to be s, and gather data for s. The iteration for and took around s on an average with a Macbook laptop of Catalina OS, 2.8 GHz Quad-Core Intel Core i7 with 16 GB RAM. During exploration, we have used sum of sinusoids based perturbation signal to persistently excite the network. Please note that the majority of the learning is spent on the exploration of the system because of the requirement of persistent excitation and the least square iteration is a order of magnitude smaller in comparison to the exploration time. With faster processing units, the least square iteration can be made much faster. Fig. 4 shows the state trajectories of the agents during exploration, and also with control implementation phase. The structured control gain learned in this scenario is given as:


The total cost comes out to be units. Fig. 4-4 show that the and iteration converges after around iterations. The structured solution also matches with the model-based solution from Theorem 3 with high accuracy. Whenever, the learning has been performed for unstructured LQR control gain, the solution comes out to be:

Figure 2: Scenario A: State trajectories during exploration and control implementation
Figure 3: Scenario A: P convergence
Figure 4: Scenario A: K convergence

with the objective of units.

We then perform similar experimentation with the scenario B, where we have removed more set of communication links. Here, the number of non-zero elements of is . We need to gather at-least data samples, therefore, we perform around s of exploration with s time step. The structured control learned for this scenario is given by,


with the objective of units. The state trajectories for scenario B during the exploration and the control implementation is given in Fig. 7. The convergence of the least squre iterates for and are shown in Figs. 7-7, where we can see that convergence is being reached after iterations. Also, the damping of the eigenvalues are improved with the control with the closed-loop eigenvalues are placed at .

Figure 5: Scenario B: State trajectories during exploration and control implementation
Figure 6: Scenario B: P convergence
Figure 7: Scenario B: K convergence

5 Conclusions and path forward

This paper discussed model-free learning based computation of stable and close-to-optimal feedback gains which are constrained to a specific yet general structure. We first formulated a model-based condition to compute structured feedback control, which resulted in a modified algebraic Riccati equation (ARE). Techniques from dynamic programming and Lyapunov based stability criterion ensured the stability and the convergence. Thereafter, we embed the model-based condition into the a RL algorithm where trajectories of states and controls are used to compute the gains. The algorithm uses an interleaved policy evaluation and policy improvement steps. We also analyzed the sub-optimality of the structured optimal control solution in comparison with the unstructured optimal solutions. Simulations on a multi-agent network with constrained communication infrastructure validated the theoretical results. Our future work will investigate the possibility to embed a prescribed degree of stability margin along with the structural constraint, as well as necessary and sufficient conditions for the existence of the optimal structured control.


  • [1] L. Bakule (2008) Decentralized control: an overview. Annual reviews in control 32, pp. 87–98. Cited by: §1.
  • [2] D. P. Bertsekas (2012) Dynamic programming and optimal control: approximate dynamic programming, 4th ed. Cited by: §1.
  • [3] F. Deroo (2016) Control of interconnected systems with distributed model knowledge. PhD thesis, TU Munchen, Germany. Cited by: §1.
  • [4] M. Fardad, F. Lin, and M. R. Jovanović (2011) Sparsity-promoting optimal control for a class of distributed systems. In Proceedings of the 2011 American Control Conference, Vol. , pp. 2050–2055. Cited by: §1.
  • [5] L. Furieri, Y. Zheng, and M. Kamgarpour (2020) Learning the globally optimal distributed lq regulator. arXiv:1912.08774v3. Cited by: §1.
  • [6] J. C. Geromel (1984) Structural constrained controllers for linear discrete dynamic systems. IFAC Proceedings Volumes 17 (2), pp. 435 – 440. Note: 9th IFAC World Congress Cited by: §1, §3, §3.
  • [7] P. Ioannou and B. Fidan (2006) Adaptive control tutorial. Cited by: §1.
  • [8] Y. Jiang and Z. Jiang (2012-10) Robust adaptive dynamic programming for large-scale systems with an application to multimachine power systems. IEEE Transactions on Circuits and Systems II: Express Briefs 59 (10), pp. 693–697. External Links: Document, ISSN 1549-7747 Cited by: §1.
  • [9] Y. Jiang and Jiang,Z.-P. (2012) Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48, pp. 2699–2704. Cited by: §1.
  • [10] Y. Jiang and Jiang,Z.-P. (2017) Robust adaptive dynamic programming. Cited by: §1, §3, §3.
  • [11] Kiumarsi,B., Vamvoudakis,K.G., Modares,H., and Lewis,F.L. (2018) Optimal and autonomous control using reinforcement learning: a survey.

    IEEE Trans. on Neural Networks and Learning Systems

    Cited by: §1.
  • [12] D. Kleinman (1968) On an iterative technique for riccati equation computations. IEEE Trans. on Automatic Control 13 (1), pp. 114–115. Cited by: §3.
  • [13] C. Langbort, R. S. Chandra, and R. D’Andrea (2004) Distributed control design for systems interconnected over an arbitrary graph. IEEE Transactions on Automatic Control 49 (9), pp. 1502–1519. Cited by: §1.
  • [14] L. Lessard and S. Lall (2011) Quadratic invariance is necessary and sufficient for convexity. In Proceedings of the 2011 American Control Conference, Vol. , pp. 5360–5362. Cited by: §1.
  • [15] Lewis,F.L. (1995) Optimal control. Cited by: §1.
  • [16] P. Massioni and M. Verhaegen (2009) Distributed control for identical dynamically coupled systems: a decomposition approach. IEEE Transactions on Automatic Control 54 (1), pp. 124–135. Cited by: §1.
  • [17] S. Mukherjee, H. Bai, and A. Chakrabortty Block-decentralized model-free reinforcement learning control of two time-scale networks. In American Control Conference 2019, Philadelphia, USA. Cited by: §1.
  • [18] S. Mukherjee, H. Bai, and A. Chakrabortty On model-free reinforcement learning of reduced-order optimal control for singularly perturbed systems. In IEEE Conference on Decision and Conrol 2018, Miami, FL, USA. Cited by: §1.
  • [19] S. Mukherjee, H. Bai, and A. Chakrabortty (2020) Reduced-dimensional reinforcement learning control using singular perturbation approximations. arXiv 2004.14501. Cited by: §1, §1.
  • [20] W.B. Powell (2007) Approximate dynamic programming. Cited by: §1.
  • [21] M. Rotkowitz and S. Lall (2006) A characterization of convex problems in decentralized control. IEEE Transactions on Automatic Control 51 (2), pp. 274–286. Cited by: §1.
  • [22] J. Sun (1998) Perturbation theory for algebraic riccati equations. SIAM Journal on Matrix Analysis and Applications 19, pp. 39–65. Cited by: §3.
  • [23] R.S. Sutton and A.G. Barto (1998) Reinforcement learning - an introduction. Cited by: §1.
  • [24] K.G. Vamvoudakis (2017) Q-learning for continuous-time linear systems: a model-free infinite horizon optimal control approach. Systems and Control Letters 100, pp. 14–20. Cited by: §1.
  • [25] D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F.L. Lewis (2009) Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica 45, pp. 477–484. Cited by: §1.
  • [26] C. Watkins (1989) Learning from delayed systems. PhD thesis, King’s college of Cambridge. Cited by: §1.