Many modern control design techniques rely on the existence of a fairly accurate state-space model of the plant to be controlled. Although in some cases a model can be obtained from first principles, there are many situations in which a model should be learned from input/output data. Classical results in system identification provide asymptotic convergence guarantees for learning models from data [19, 29]. However, finite sample complexity properties have been rarely discussed in system identification literature ; and earlier results are conservative .
There is recent interest from the machine learning community in data-driven control and non-asymptotic analysis. Putting aside the reinforcement learning literature and restricting our attention to linear state-space models, the work in this area can be divided into two categories: (i) directly learning the control inputs to optimize a control objective or analyzing the predictive power of the learned representation[6, 14, 8], (ii) learning the parameters of the system model from limited data [21, 13, 4, 3, 25, 2]. For the former problem, the focus has been on exploration/exploitation type formulations and regret analysis. Since the goal is to learn how to control the system to achieve a specific task, the system is not necessarily fully learned. On the other hand, the latter problem aims to learn a general purpose model that can be used in different control tasks, for instance, by combining it with robust control techniques [4, 27, 3]. The focus for the latter work has been to analyze data–accuracy trade-offs.
In this paper we focus on learning a realization for an LTI system from a single input/output trajectory. This setting is significantly more challenging than earlier studies that assume that (multiple independent) state trajectories are available [4, 25]. One of our main contributions is to derive sample complexity results in learning the Markov parameters, to be precisely defined later, of the system using a least squares algorithm . Markov parameters play a central role in system identification  and they can also be directly used in control design when the system model itself is not available [26, 12, 24]. In Section 4, we show that using few Markov parameter estimates and leveraging stability assumption, one can approximate system’s Hankel operator with near optimal sample size. When only input/output data is available, it is well known that the system matrices can be identified only up to a similarity transformation even in the noise-free case but Markov parameters are identifiable. Therefore, we focus on obtaining a realization. One classical technique to derive a realization from the Markov parameters is the Ho-Kalman (a.k.a., eigensystem realization algorithm – ERA) algorithm . The Ho-Kalman algorithm constructs a balanced realization222Balanced realizations give a representation of the system in a basis that orders the states in terms of their effect on the input/output behavior. This is relevant for determining the system order and for model reduction .
for the system from the singular value decomposition of the Hankel matrix of the Markov parameters. By proving a stability result for the Ho-Kalman algorithm and combining it with the sample complexity results, we show how much data is needed to learn a balanced realization of the system up to a desired accuracy with high probability.
2 Problem Setup
We first introduce the basic notation. Spectral norm
returns the largest singular value of a matrix. Multivariate normal distribution with meanand covariance matrix is denoted by . denotes the transpose of a matrix . returns the Moore–Penrose inverse of the matrix
. Covariance matrix of a random vectoris denoted by . returns the trace of a matrix. stands for absolute constants.
Suppose we have an observable and controllable linear system characterized by the system matrices and this system evolves according to
Our goal is to learn the characteristics of this system and to provide finite sample bounds on the estimation accuracy. Given a horizon , we will learn the first Markov parameters of the system. The first Markov parameter is the matrix , and the remaining parameters are the set of matrices . As it will be discussed later on, by learning these parameters,
we can provide bounds on how well can be estimated for a future time ,
we can identify the state-space matrices (up to a similarity transformation).
Problem setup: We assume that are vectors that are independent of each other with distributions , , and 333While we assume diagonal covariance throughout the paper, we believe our proof strategy can be adapted to arbitrary covariance matrices.. is the input vector which is known to us. and are the process and measurement noise vectors respectively. We also assume that the initial condition of the hidden state is . Observe that Markov parameters can be found if we have access to cross correlations . In particular, we have the identities
Hence, if we had access to infinitely many independent pairs, our task could be accomplished by a simple averaging. In this work, we will show that, one can robustly learn these matrices from a small amount of data generated from a single realization of the system trajectory. The challenge is efficiently using finite and dependent data points to perform reliable estimation. Observe that, our problem is identical to learning the concatenated matrix defined as
Next section describes our input and output data. Based on this, we formulate a least-squares procedure that estimates . The estimate will play a critical role in the identification of the system matrices.
2.1 Least-Squares Procedure
To describe the estimation procedure, we start by explaining the data collection process. Given a single input/output trajectory , we generate subsequences of length , where and . To ease representation, we organize the data and the noise into length chunks denoted by the following vectors,
In a similar fashion to define the matrix,
To establish an explicit connection to Markov parameters, can be expanded recursively until to relate the output to the input and Markov parameter matrix as follows,
where, corresponds to the error due to the effect of the state at time . With this relation, we will use as inputs and outputs of our regression problem. We treat , , and as additive noise and attempt to estimate from covariates . Note that, the noise terms are zero-mean including since we assumed . With these in mind, we form the following least-squares problem,
Defining our label matrix and input data matrix as,
we obtain the minimization . Hence, the least-squares solution is given by
where is the left pseudo-inverse of . Ideally, we would like the estimation error to be small. Our main result bounds the norm of the error as a function of the sample size and noise levels and .
3 Results on Learning Markov Parameters
denote the spectral radius of a matrix which is the largest absolute value of its eigenvalues. Our results in this section apply to stable systems where. Additionally we need a related quantity involving which is the spectral norm to spectral radius ratio of its exponents defined as . We will assume which is a mild condition: For instance, if is diagonalizable,
is a function of its eigenvector matrix and is finite. Another important parameter is the steady state covariance matrix ofwhich is given by
It is rather trivial to show that for all , . We will use to bound the error due to the unknown state at time . Following the definition of , we have that . We characterize the impact of
by its “effective standard deviation”that is obtained by scaling the bound on by an additional factor which yields,
Our first result is a simplified version of Theorem 3.2 and captures the problem dependencies in terms of the total standard deviations and the total dimensions .
Suppose and where . Given observations of a single trajectory until time , with high probability444Precise statement on the probability of success is provided in the proof, the least-square estimator of the Markov parameter matrix obeys
Remark: Our result is stated in terms of the spectral norm error . One can deduce the following Frobenius norm bound by naively bounding terms and swapping term by (following (A.2), (A.3)). This yields, .
Our bound individually accounts for the the process noise sequence , measurement noise , and the contribution of the unknown state . Setting and to , we end up with the unknown state component . has a multiplier inside hence larger implies smaller . On the other hand, larger increases the size of the matrix as its dimensions are . This dependence is contained inside the term which grows proportional to (ignoring terms). corresponds to the minimum observation period since there are unknowns and we get to observe measurements at each timestamp. Hence, ignoring logarithmic terms, our result requires and estimation error decays as
. This behavior is similar to what we would get from solving a linear regression problem with independent noise and independent covariates. This highlights the fact that our analysis successfully overcomes the dependencies of covariates and noise terms.
Our main theorem is a slightly improved version of Theorem 3.1 and is stated below. Theorem 3.1 is operational in the regime . In practical applications, hidden state dimension can be much larger than number of sensors and input dimension . On the other hand, the input data matrix becomes tall as soon as hence ideally (2.7) should work as soon as . Our main result shows that reliable estimation is indeed possible in this more challenging regime. It also carefully quantifies the contribution of each term to the overall estimation error.
Suppose system is stable (i.e. ) and . We observe a trajectory until time . Then, with high probability, the least-square estimator of the Markov parameter matrix obeys
where are given by
Here are absolute constants and where .
One advantage of Theorem 3.2 is that it works in the regime . Additionally, Theorem 3.2 provides tighter individual error bounds for the terms and explicitly characterizes the dependence on inside the term.
Theorem 3.2 can be improved in a few directions. Some of the log factors that appear in our sample size might be spurious. These terms are arising from a theorem borrowed from Krahmer et al. ; which actually has a stronger implication than what we need in this work. We also believe (3.1) is overestimating the correct dependence by a factor of .
3.1 Estimating the Output via Markov Parameters
The following lemma illustrates how learning Markov parameters helps us bound the prediction error.
Lemma 3.3 (Estimating )
Suppose and , , for as described in Section 2. Assume, we have an estimate of that is independent of these variables and we employ the estimator
Proof Following from the input/output identity (2.5), the key observation is that for a fixed , are all independent of each other and their prediction errors are uncorrelated. Since , . Same argument applies to and which obeys . Observe that th largest eigenvalue of is upper bounded by via Min-Max principle  hence .
4 Markov Parameters to Hankel Matrix:
Low Order Approximation of Stable Systems
So far our attention has focused on estimating the impulse response for a particular horizon . Clearly, we are also interested in understanding how well we learn the overall behavior of the system by learning a finite impulse approximation. In this section, we will apply our earlier results to approximate the overall system by using as few samples as possible. A useful idea towards this goal is taking advantage of the stability of the system. The Markov parameters decay exponentially fast if the system is stable i.e. . This means that, most of the Markov parameters will be very small after a while and not learning them might not be a big loss for learning the overall behavior. In particular, ’th Markov parameter obeys
This implies that, the impact of the impulse response terms we don’t learn can be upper bounded. For instance, the total spectral norm of the tail terms obey
To proceed fix a finite horizon that will later be allowed to go infinity. Represent the estimate as where corresponds to the noisy estimate of . Now, let us consider the estimated and true order Markov parameters
Similarly we define the associated block Hankel matrices of size as follows
The following theorem merges results of this section with a specific choice of to give approximation bounds for the infinite Markov operator and Hankel operator . For notational simplicity, we shall assume that there is no process noise.
Suppose the spectral radius obeys . Fix a number and suppose process noise obeys . Assume sample size and estimation horizon satisfies555Exact form of the bounds depend on and is provided in the proof.
Then, given observations of a single trajectory until time and estimating first Markov parameters via least-squares estimator (2.7), with high probability, the following bounds hold on the infinite impulse response and Hankel matrix of the system.
In essence, the above theorem is a corollary of Theorem 3.2. However, it further simplifies the bounds and also provides approximation to systems overall behavior (e.g. infinite Hankel matrix). In particular, these bounds exploit stability of the system and allows us to treat the system as if it has a logarithmic order. Observe that (4.3) only logarithmically depends on the critical problem variables such as precision and spectral radius. In essence, the effective system order is dictated by the eigen-decay and equal to hence stability allows us to treat the system as if it has a logarithmically small order. Ignoring logarithmic terms except , using and picking
Remarkably, sample size is independent of the state dimension and only linearly grows with . Indeed, one needs at least samples to estimate a single Markov parameter and we need only logarithmically more than this minimum (i.e. ) to estimate the infinite Hankel matrix.
5 Non-Asymptotic System Identification via Ho-Kalman
In this section, we first describe the Ho-Kalman algorithm  that generates from the Markov parameter matrix . We also show that the algorithm is stable to perturbations in and the output of Ho-Kalman gracefully degrades as a function of . Combining this with Theorem 3.1 implies guaranteed non-asymptotic identification of multi-input-multi-output systems from a single trajectory. We remark that results of this section do not assume stability and applies to arbitrary, possibly unstable, systems. We will use the following Hankel matrix definition to introduce the algorithms.
Definition 5.1 (Clipped Hankel matrix)
Given a block matrix and integers satisfying , define the associated Hankel matrix to be the block matrix with size blocks where th block is given by
Note that, does not contain , which shall correspond to the (or ) matrix for our purposes. This is solely for notational convenience as the first Markov parameter in is ; however are identified from the remaining Markov parameters of type .
5.1 System Identification Algorithm
Given a noisy estimate of , we wish to learn good system matrices from up to trivial ambiguities. This will be achieved by using Algorithm 1 which admits the matrix , system order and Hankel dimensions as inputs. Throughout this section, we make the following two assumptions to ensure that the system we wish to learn is order- and our system identification problem is well-conditioned.
the system is observable and controllable; hence is the order of the system.
Hankel matrix formed from is rank-. This can be ensured by choosing sufficiently large . In particular is guaranteed to work by the first assumption above.
Learning state-space representations is a non-trivial, inherently non-convex problem. Observe that there are multiple state-space realizations that yields the same system and Markov matrix
yields the same system and Markov matrix. In particular, for any nonsingular matrix ,
is a valid realization and yields the same system. Hence, similarity transformations of generate a class of solutions. Note that is already estimated as part of . Since is a submatrix of , we clearly have
Hence, we focus our attention on learning . Suppose we have access to the true Markov parameters and the corresponding Hankel matrix . In this case, is a rank- matrix and th block of is equal to . Defining (extended) controllability and observability matrices and , we have . However, it is not clear how to find .
The Ho-Kalman algorithm accomplishes this task by finding a balanced realization and returning some matrices from possibly noisy Markov parameter matrix . Let the input to the algorithm be where corresponds to the noisy estimate of . We construct the Hankel matrix as described above so that th block of is equal to . Let be the submatrix of after discarding the rightmost block and be the best rank- approximation of obtained by setting its all but top singular values to zero. Let be the submatrix after discarding the left-most block. Note that both have size . Take the singular value decomposition (SVD) of the rank- matrix as (with ) and write
If was equal to the ground truth , then would correspond to the order observability matrix and the order controllability matrix of the actual balanced realization based on noiseless SVD. Here, matrices are not necessarily equal to , however they yield the same system. Note that, the columns of are the scaled versions of the left and right singular vectors of respectively. The Ho-Kalman algorithm finds as follows.
is the first submatrix of .
is the first submatrix of .
This procedure (Ho-Kalman) returns the true balanced realization of the system when Markov parameters are known i.e. . Our goal is to show that even with noisy Markov parameters, this procedure returns good estimates of the true balanced realization. We remark that there are variations of this procedure; however the core idea is the same and they are equivalent when the true Markov parameters are used as input. For instance, when constructing , one can attempt to improve the noise robustness of the algorithm by picking balanced dimensions .
5.2 Robustness of the Ho-Kalman Algorithm
Observe that of Algorithm 1 are functions of the input matrix . For the subsequent discussion, we let
be the matrices corresponding to ground truth .
be the matrices corresponding to the estimate .
Furthermore, let be the actual balanced realization associated with and let be the Ho-Kalman output associated with . Note that since is already rank . We now provide a lemma relating the estimation error of to that of and .
and satisfies the following perturbation bounds,
Let us denote the th largest singular value of via . Note that is the smallest nonzero singular value of since . A useful implication of Theorem 3.1 (in light of Lemma 5.2) is that if is large enough, the true system order can be non-asymptotically estimated from the noisy Markov parameter estimates via singular value thresholding.
Our next result shows the robustness of the Ho-Kalman algorithm to possibly adversarial perturbations on the Markov parameter matrix .
Suppose and be the Hankel matrices derived from and respectively per Definition 5.1. Let be the state-space realization corresponding to the output of Ho-Kalman with input and be the state-space realization corresponding to output of Ho-Kalman with input . Suppose the system is observable and controllable and let and be order- controllability/observability matrices associated with and respectively. Suppose and perturbation obeys
Then, there exists a unitary matrix
Then, there exists a unitary matrixsuch that,
Furthermore, hidden state matrices satisfy