1 Introduction
System identification is a problem of great interest in many fields such as econometrics, robotics, aeronautics, mechanical engineering or reinforcement learning
[12, 17, 5, 6, 16]. The task consists in estimating the parameters of an unknown system by sampling trajectories from it as fast as possible. To this end, inputs must be chosen so as to yield maximally informative trajectories. We focus on linear timeinvariant (LTI) systems. Let
and be two matrices; we consider the following discretetime dynamics:(1)  
where is the state,
is a normally distributed isotropic noise with known variance
and the control variables are chosen by the controller with the following power constraint:(2) 
The system parameters () are unknown initially and are to be estimated from observed trajectories . The goal of system identification is to choose the inputs so as to drive the system to the most informative states for the estimation of . It may happen that the controller knows , in which case and .
System identification is a primary field in control theory. It has been widely studied in the field of optimal design of experiments [2, 11]. For LTI dynamic systems, classical optimal design approaches provided results for singleinput singleoutput (SISO) systems [5, 10]
or multiinput multioutput (MIMO) systems in the frequency domain or with randomized timedomain inputs
[15]. More recently, system identification received considerable attention in the machine learning community, with the aim of obtaining finitetime bounds on the estimation error for
[8, 9, 20]. In [24] and [25], the inputs are optimized in the frequency domain to maximize an optimal design objective, with theoretical estimation rate guarantees. In our approach, we directly optimize deterministic inputs in the time domain for MIMO LTI systems. An important aspect of system identification is the quantity of computational resource and the number of observations needed to reach a certain performance. We study the computational complexity of our algorithms and compare their performance against each other and against an oracle, both on average and on reallife dynamic systems.1.1 Notations
In the rest of this work, we note the unknown parameter underlying the dynamics. We suppose that the pair is controllable: the matrix has rank . Adopting the notations of [25], we define a policy as a mapping from the past trajectory to future input. The set of policies meeting the power constraint (2) is noted . We note a trajectory, and we extend this notation to when the trajectory is obtained using a policy up to time . We denote by the average for a dynamical system given by (1) (where the randomness comes from the noise and possibly from the policy inducing the control ).
1.2 Adaptive identification
Fix an estimator , yielding an estimate of the parameters from a given trajectory. Our objective is to play the inputs of a policy so that the resulting trajectory gives a good estimation for . We measure this performance by the mean squared error:
(3) 
Of course, this quantity depends on the true parameter of the system which is unknown. A natural way of proceeding is to estimate sequentially, as follows.
Definition 1 (Adaptive system identification).
Given an estimate of , the policy for the next sequence of inputs can be chosen so as to minimize a cost function approximating the MSE (3), using as an approximation of . Then, these inputs are played and is reestimated with the resulting trajectory, and so on. We call planning the process of minimizing .
This approach is summarized in Algorithm 1, which takes as inputs a first guess for the parameters to estimate and a policy , the problem parameters and , a schedule , a cost functional and an estimator .
An adaptive identification algorithm is hence determined by a triplet . A natural estimator is the least squares estimator which we define in Section 2.1. In the rest of this work, we set .
Example 1 (Random policy).
A naive strategy for system identification consists in playing random inputs with maximal energy at each time step. This corresponds to the choice and returning .
Example 2 (Taskoptimal pure exploration).
In [25], the authors propose the following cost function
(4) 
where is defined in equation (8) below. As we will see in Section 2.2, this corresponds to Aoptimal experimental design. The authors show that this cost function approximates the MSE in the long time limit at an optimal rate when . In their identification algorithm, they set
for some initial epoch
.Example 3 (Oracle).
An oracle is a controller who is assumed to choose their policy with the knowledge of the true parameter . It can hence perform one single, offline optimization of over . By definition, the inputs played by the oracle are the optimal inputs for our problem of mean squared error system identification.
1.3 Contributions
In practice, systems have complex dynamics and can only be approximated locally by linear systems. We still believe that in order to understand complex systems, we need to understand identification of linear systems as on short time scales, we can approximate the complex system with a linear one. In order to be practical, our identification algorithm needs to interact as little as possible with the true system and to take decisions as fast as possible. With previous notations, we are interested in cases where is small (to ensure that in practice the dynamics remains timeinvariant and linear) and where the estimation and planning steps need to be very fast in order to run the algorithm online.
In this work, we explore a setting for linear system identification with hard constraints on the number of interactions with the real system and on the computing resources used for planning and estimation. To the best of our knowledge, finitetime system identification guarantees are only available in the large limit which makes the hypothesis of linear dynamic quite unlikely. Using a framework based on experimental design, we propose a greedy online algorithm requiring minimal computing resources. The resulting policy gives a control that maximizes the amount of information collected at the next step. We show empirically that for short interactions with the system, this simple approach can actually outperforms more sophisticated gradientbased methods. We also propose a method to compute an oracle optimal control, against which we can compare the different identification algorithms.
1.4 Related work
System identification has been studied extensively in the last decades [3, 12]. The question of choosing the maximally informative input can be tackled in the framework of classical experimental design [2, 18]. Several methods have been proposed for the particular case of dynamic systems [14, 10, 15] A comprehensive study can be found in [5], with a focus on SISO systems.
In the machine learning community, the last few years have seen an increasing interest in finitetime system identification [20, 9, 19, 22]
. These works typically derive theoretical error rates for linear dynamic system estimation and produce high probability bounds guaranteeing that the estimation is smaller than
with probability greater than after a certain number of samples. The question of designing optimal inputs is tackled in [24, 25]. The authors derive an asymptotically optimal algorithm by computing the control in the frequency domain. In [13], an approach to control partially nonlinear systems is proposed.2 Background
It is convenient to describe the structure of the state as a function of the inputs and the noise. By integrating the dynamics (1), we obtain the following result.
Proposition 1.
The state can be expressed as with
(5) 
Note that that solves the deterministic dynamics and has zero mean and is independent of the control. The two terms and depend linearly on the and the respectively.
The datagenerating distribution knowing the parameter
can be computed using the probability chain rule with the dynamics (
1):(6) 
We define the loglikelihood (up to a constant):
(7)  
where we have noted and the observations and the covariates associated to the parameter . If , then , . If , then and . We also note the input matrix and
the state matrix. We define the moment matrix
and the Gramians of the system at time :(8) 
and . Note that
2.1 Ordinary least squares
Given a trajectory, a natural estimator for the matrix is the least squares estimator. The theory of least squares provides us with a formula for the mean squared error with respect to the ground truth, which can be used as a measure of the quality of a control.
Proposition 2 (Ordinary least squares estimator).
Given inputs and noise
, the ordinary least squares (OLS) estimator associated to the resulting trajectory
is(9) 
and its difference to is given by
(10)  
where denotes the pseudoinverse of . Noting the least squares estimator obtained from the trajectory up to time , we recall the recursive update formula
(11) 
Proof.
The least squares estimator minimizes the quadratic loss , which writes
(12) 
with the th column of and the th row of . The terms of the sum can be minimized independently, with each minimizing the least squares of the vectorial relation . The solution for is equal to (see e.g. [1]). By concatenating the columns, we obtain that , which proves (9). Substituting yields (10). Note here that the controllability assumption on ensures that can be made full rank, and hence that the moment matrix is invertible. ∎
Definition 2 (OLS mean squared error).
For a given trajectory generated with a matrix and noise , the Euclidean mean squared error (MSE) is
(13)  
If the noise and the covariates were independent, then the expected error would reduce to the Aoptimal design objective . It is not the case in our framework since is generated with .
2.2 Classical optimal design
The correlation between and makes the derivation of a tractable expression for the expectation of (13) complicated. In this section, we show how a more tractable objective can be computed by applying theory of optimal experimental design [2, 21]. In the classical theory of optimal design, the informativeness of an experiment is measured by the size of the expected Fisher information.
Definition 3 (Fisher information matrix).
Let denote the loglikelihood of the datagenerating distribution knowing the parameter . The Fisher information matrix is defined as
(14) 
Proposition 3.
Proof.
The loglikelihood (7) can be separated into a sum over the as in (12). The quadratic term in is and the other terms are constant or linear. Differentiating twice and taking the expectation gives , which yields the desired result after dividing by . Following the decomposition (5), Taking the expectation, we obtain . Summing over yields the result. Note that the first term is deterministic and depends on the control whereas the second term depends on the noise and not on the control. Therefore, the expected moment matrix is the sum of a noise term and of a deterministic control term. ∎
Definition 4.
In classical optimal design, the size of the information matrix is measured by some criterion
, which is a functional of its eigenvalues
. The quantity represents the amount of information brought by the experiment and should be maximized.Example 4.
Some of the usual criteria are presented in Table 1.
The criteria are required to have properties such as homogeneity, monotonicity and concavity in the sense of the Loewner ordering, which can be interpreted in terms of information theory: monotonicity means that a larger information matrix brings a greater amount of information, concavity means that information cannot be increased by interpolation between experiments. We refer to
[18] for more details.Optimality  

Aoptimality  
Doptimality  
Eoptimality 
The theory of classical optimal design leads to the definition of the following optimal design informativeness functional.
Definition 5 (Optimal design functional).
Let denote an optimal design criterion. Then the associated cost is defined as
(17)  
where the depend on the inputs through (5).
2.3 Small noise regime
The optimal design functional (17) can be related to the MSE in the small noise regime .
Proof.
We introduce the rescaled variables and which are of order . Extending the notations of equation (5), , where the first term is of order and the second is of order . Therefore, , or equivalently . By Proposition 7, is differentiable at so . Taking the squared norm and using CauchySchwartz inequality, we obtain
(19) 
Furthermore,
(20)  
Gathering (19) and (20), we obtain
(21) 
∎
Remark 2.
In classical least squares regression, the covariates are independent of the noise . As a consequence, the minimziation of the mean squared estimation error leads to the classical Aoptimality criterion. This does not hold in general in our framework because the signal and the noise are coupled by the dynamics (1). However, Proposition 18 shows that this criterion does hold in the small noise regime at first order in . Indeed, when the contribution of the noise to the signal is negligible because the deterministic part of the signal is of order .
Remark 3.
From Proposition 4 and the definition of Aoptimality, we see that the MSE approximately scales like when the number of observations increases. This is confirmed by experiments.
3 Online greedy identification
3.1 Onestepahead objective
A simple, natural approach for system identification consists in choosing a decision sequentially at each time step. At each time , the control is chosen with energy so as to maximize a onestepahead objective. Then, a new observation is collected and the process repeats. Following Section 2.2, can be chosen to maximize the value of at . This corresponds to the choice of functional and to the onestep schedule .
Upon choosing , the policy should select so as to maximize the design criterion applied on the onestep ahead, dependent information matrix, the past trajectory being fixed. The onestepahead information matrix is , with when is estimated (because then then next dependent covariate is ) and if is known, because then the next dependent covariate is . Therefore, onestep ahead planning yields the following optimization problem:
(22)  
such that 
with
(23) 
and
(24) 
Remark 4.
With this greedy policy, the energy constraint imposed for one input ensures that the global power constraint (2) is met.
The corresponding identification process is detailed in Algorithm 2. We will see in Section 3.2 that problem (22) can be solved accurately and at a cheap cost. Moreover, Algorithm 2 offers the advantage of improving the knowledge of at each time step using all the available information on the parameter to plan at each time step. This way, the bias affecting planning due to the uncertainty about is minimized. When planning is performed over larger time sequences, a large bias could impair the identification of the system.
3.2 Solving the onestep optimal design problem
We show that the onestep ahead planning for online system identification is equivalent to a convex quadratic program which can be solved efficiently.
Proposition 5.
For Doptimality and Aoptimality, there exists a symmetric matrix and the problem (22) is equivalent to
(25)  
such that 
Proof.
From Proposition 8, we find that
(26)  
Similarly, from Corollary 1 ,
(27)  
Maximizing these quantities with respect to amounts to maximizing . The matrix is symmetric because the and the are symmetric, and so are its diagonal submatrices. Given the affine dependence of in and the (possible) block structure of and , is of the form , up to a constant. We provide an explicit formula for and in the case where in Remark 5 ∎
We now characterize the minimizers of Problem (25). If a minimizer can be found in the interior of the constraining sphere, then is positive semidefinite and the problem can be tackled using unconstrainted optimization. We thus consider the equality constrained problem
(28)  
such that 
Proposition 6.
Note the eigenvalues of , and and the coordinates of and in a corresponding orthonormal basis. Then a minimizer satisfies the following equations for some nonzero scalar :
(29) 
Proof.
By the Lagrange multiplier theorem there exists a nonzero scalar such that , where can be scaled such that is nonsingular. Inverting the optimal condition and expanding the equality constraint gives the two conditions. ∎
Problem (25) can hence be solved at the cost of a scalar rootfinding and an eigenvalue decomposition. In [7], bounds are provided so as to initialize the rootfinding method efficiently.
Remark 5.
In the case where (i.e. ), and have the following expressions:
(30) 
4 Gradientbased identification
In this section, we propose a gradientbased approach to planning. In a sequential identification scheme of Algorithm 1, the cost functions (3) and (17) can be optimized by projected gradient descent. This builds on the following remark.
Remark 6 (Differentiability of the functionals).
The functionals (3) and (17) are differentiable functions of the output. Indeed, is an affine function of the inputs as shown in Proposition 1, and the controllability of guarantees that is positive definite. Furthermore, the operations of pseudoinverse (see Proposition 7) and the optimal design criteria of Table 1 are differentiable over the set of positive definite matrices.
The gradients with respect to can either be derived analytically (see [5], section 6 for the derivation of an adjoint equation) or automatically in an automatic differentiation framework. We rescale at each step to ensure the power constraint is met. The are chosen arbitrarily. The computational complexity of the algorithm is linear in
: each gradient step backpropagates through the planning time interval.
4.1 Gradientbased optimal design
We propose a gradientbased method to optimize by performing gradient descent directly on in functional (17). Note that we optimize the inputs directly in the time domain, whereas other approaches such as [25] perform optimization in the frequency domain by restricting the control to periodic inputs.
4.2 Gradient through the oracle MSE
Given the true parameters , the optimal control for the MSE minimizes the MSE cost (3), as explained Example 3. However, the dependency between and makes this functional complicated to evaluate and to minimize with respect to the inputs, even when the true parameters are known. We propose a numerical method to minimize (3) using automatic differentiation an MonteCarlo sampling. Given one realization of the noise and inputs , the gradient of the squared error (13) can be computed automatically in an automatic differentiation framework. Then, one can sample a batch of noise matrices and approximate the gradient of (3) by
(31) 
Although we do not have convergence guarantees due to the lack of structure of the objective function, the gradient descent does converge in practice, to a control that outperforms the adaptive controls.
5 Performance study
5.1 Complexity analysis
Definition 6 (Performance).
Let denote the estimation produced by the learning algorithm at the end of identification. The performance of the policy is measured by the average error over the experiments on the true system: . We study the performance of our algorithms as a function of the number of observations and the computational cost. We also introduce the computational rate .
Algorithm 2 and the gradient identification algorithm have linear time complexity. Hence, we define and for a given number of gradient iterations. In practice, we find that , where is the computational rate needed for the gradient descent to converge. As pointed out in Remark 3, the squared error essentially scales like . This is verified experimentally. Given the previous observations, we postulate that the performance of our algorithms takes the form
(32) 
We build an experimental diagram where we plot the average estimation error for as a function of the two types of resource and for the gradient algorithm. Increasing allows for more gradient steps. We run trials with random matrices of size , with . We set , , . The gradient algorithm optimizes the Aoptimality functional (17) with a batch size of and . The obtained performances are to be compared with those of the greedy algorithm (with the Aoptimality cost function), which has a fixed, small computational rate . Our diagrams are plotted on Fig. 1.
Our diagrams show that the greedy algorithm is preferable in a phase of low computational rate: , as suggested by (32). The phase separation corresponds to a relatively high number of gradient steps. Indeed, the isoperformance along this line are almost vertical, meaning that the gradient descent has almost converged. Furthermore, the maximum performance gain of the gradient algorithm relatively to the greedy algorithm is of 10%.
5.2 Average estimation error
We now test the performances of our algorithms on random matrices, with the same settings as in the previous experiment. For the gradient algorithm, the minimal number of gradient iterations to reach maximum performance for was found to be . For each matrix , we also compute an oracle optimal control using Algorithm 3 with a batch size of , and run a random input baseline (see Example 1), and the TOPLE algorithm of [25].
Both the gradient algorithm and the greedy algorithm closely approach the oracle. The former performs slightly better than the latter in average. However, the computational cost of the gradient algorithm is far larger, as Table 2 shows. Indeed, the number of gradient steps to reach convergence in this setting is found to be of order . Note that the number of subgradient steps for the TOPLE algorithm is found to be , and so .
Random  TOPLE[24]  Gradient  Greedy  

5.3 Identification of an aircraft system
We now study a more realistic setting from the field of aeronautics: we apply system identification to an aircraft system. We use the numerical values issued in a report from the NASA [6]. The lateral motion of a Lockheed Jet star is described by the slideslip and roll angles and the roll and yaw rates . The control variables are the aileron and rudder angles . The linear dynamics for an aircraft flying at 573.7 meters/sec at 6.096 meters are given by the following matrix, obtained after discretization and normalization of the continuoustime system [6]:
(33) 
(34) 
and , deg.
Random  TOPLE [24]  MSE gradient  Greedy  

Error  
Time  1  55.7  25  1.13 
We apply our algorithms on this LTI system. Our results are summarized in Table 3.
As we can see, the greedy algorithm outperforms the gradientbased algorithms, both in performance and in computational cost. This could be explained by the fact that the signaltonoise ratio in this system is of order 1, hence the estimation bias in planning is large and it is more effective to plan onestepahead than to do planning over large epochs. We obtain similar results for the longitudinal system of a C8 Buffalo aircraft
[6].6 Conclusion
In this work, we explore a setting for linear system identification with hard constraints on the number of interactions with the real system and on the computing resources used for planning and estimation. We introduce a greedy online algorithm requiring minimal computing resources and show empirically that for small values of interactions with the system, it can actually outperform more sophisticated gradientbased methods. Extension of this approach to optimal control for the LQR is an interesting direction of future research.
7 Matrix calculus
Proposition 7.
On a domain where has linearly independent columns, is a differentiable function of and
(35) 
Proof.
See [4]. ∎
Lemma 1.
Let and . Then
(36) 
Proposition 8.
Let be a nonsingular matrix and . Then
(37) 
Proof.
Proof.
See [23]. ∎
Proposition 9.
Let be positive definite matrices of , and . Then
(40) 
Proof.
Proposition 8 admits the following generalization.
Proposition 10.
Let be a nonsingular matrix and let . Then
(43)  
Proof.
See [23]. ∎
Proposition 11.
Let be a nonsingular matrix and . Then is nonsingular and
(44) 
Corollary 1.
Let be a nonsingular matrix and . Then
(45) 
References

[1]
(2018)
Introduction to applied linear algebra: vectors, matrices, and least squares
. Cambridge university press. Cited by: §2.1.  [2] (1972) Theory of optimal experiments. Cellular Neurobiology, Academic Press. External Links: ISBN 9780122507502, LCCN 76182610, Link Cited by: §1.4, §1, §2.2.
 [3] (2011) Optimal Experiment Design for Open and Closedloop System Identification. Communications in Information and Systems (CIS) 11 (3), pp. 197–224. Note: Special Issue Dedicated to Brian Anderson on the Occasion of His 70th Birthday: Part II External Links: Link Cited by: §1.4.
 [4] (1973) The differentiation of pseudoinverses and nonlinear least squares problems whose variables separate. SIAM Journal on numerical analysis 10 (2), pp. 413–432. Cited by: §7.
 [5] (1977) Dynamic system identification: experiment design and data analysis. Developmental Psychology Series, Academic Press. External Links: ISBN 9780122897504, LCCN lc76050396, Link Cited by: §1.4, §1, §1, §4.
 [6] (1976) Application of optimal input synthesis to aircraft parameter identification. Cited by: §1, §5.3, §5.3.
 [7] (2001) Minimizing a quadratic over a sphere. SIAM Journal on Optimization 12 (1), pp. 188–208. Cited by: §3.2.
 [8] (2020) Finitetime identification of stable linear systems optimality of the leastsquares estimator. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 996–1001. Cited by: §1.
 [9] (2020) Finitetime identification of stable linear systems: optimality of the leastsquares estimator. External Links: 2003.07937 Cited by: §1.4, §1.
 [10] (1975) "Design of experiments" for the identification of linear dynamic systems. Technometrics 17 (3), pp. 303–308. External Links: ISSN 00401706, Link Cited by: §1.4, §1.
 [11] (2001) Identification for control: adaptive input design using convex optimization. In Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228), Vol. 5, pp. 4326–4331 vol.5. External Links: Document Cited by: §1.
 [12] (1998) System identification. In Signal analysis and prediction, pp. 163–173. Cited by: §1.4, §1.
 [13] (2020) Active learning for nonlinear system identification with guarantees. External Links: 2006.10277 Cited by: §1.4.
 [14] (1974) Optimal inputs for linear system identification. IEEE Transactions on Automatic Control 19 (3), pp. 192–200. External Links: Document Cited by: §1.4.
 [15] (1976) Synthesis of optimal inputs for multiinputmultioutput (mimo) systems with process noise part i: frequenc ydomain synthesis part ii: timedomain synthesis. In System Identification Advances and Case Studies, R. K. Mehra and D. G. Lainiotis (Eds.), Mathematics in Science and Engineering, Vol. 126, pp. 211–249. External Links: Document, ISSN 00765392, Link Cited by: §1.4, §1.
 [16] (2021) Modelbased reinforcement learning: a survey. External Links: 2006.16712 Cited by: §1.
 [17] (1992) System identification: torsten söderström and petre stoica. Automatica 28 (5), pp. 1069–1071. External Links: ISSN 00051098, Document, Link Cited by: §1.
 [18] (2006) Optimal design of experiments. SIAM. Cited by: §1.4, §2.2.
 [19] (2019) Near optimal finite time identification of arbitrary linear dynamical systems. External Links: 1812.01251 Cited by: §1.4.
 [20] (2018) Learning without mixing: towards a sharp analysis of linear system identification. External Links: 1802.08334 Cited by: §1.4, §1.
 [21] (1984) Experimental design: review and comment. Technometrics 26 (2), pp. 71–97. Cited by: §2.2.
 [22] (2019) Finite sample analysis of stochastic system identification. External Links: 1903.09122 Cited by: §1.4.
 [23] (2016) A note on the matrix determinant lemma. International Journal of Pure and Applied Mathematics 111 (4), pp. 643–646. Cited by: §7, §7.
 [24] (2020) Active learning for identification of linear dynamical systems. External Links: 2002.00495 Cited by: §1.4, §1, Table 2, Table 3.
 [25] (2021) Taskoptimal exploration in linear dynamical systems. External Links: 2102.05214 Cited by: §1.1, §1.4, §1, §4.1, §5.2, Example 2.