1 Introduction
We investigate the canonical problem of identifying Linear Time Invariant (LTI) systems of the form:
(1) 
denotes the state at time , is initially unknown but stable (i.e., its spectral radius satisfies ), and
are i.i.d. subgaussian zeromean noise vectors with covariance matrix
(for simplicity). The objective is to estimate the matrix from an observed trajectory of the system. Most work on this topic is concerned with the convergence properties of some specific estimation methods (e.g. ordinary least squares (OLS), ML estimator) [3, 4, 5]. Recently however, there have been intense research efforts towards understanding the finitetime behavior of classical estimation methods [6, 7, 1, 8, 9, 10]. The results therein aim at deriving bounds on the sample complexity of existing estimation methods (mostly the OLS), namely at finding upper bounds on the number of observations sufficient to identify with prescribed levels of accuracy and confidence^{1}^{1}1Throughout the paper, denotes the euclidian norm for vectors, and the operator norm for matrices.: for any if denotes the estimator of after observations. Lower bounds of the sample complexity (valid for any estimator) have been also derived in [1, 2]. Sample complexity upper bounds appearing in the aforementioned papers are often hard to interpret and to compare. The main difficulty behind deriving such bounds stems from the fact that the data used to estimate is correlated (we observe a single trajectory of the system). In turn, existing analyses rely on involved concentration results for random matrices [11, 12] and selfnormalized processes [13]. The most technical and often tedious part of these analyses deals with deriving concentration results for the covariates matrix defined as , and the tightness of these results directly impacts that of the sample complexity upper bound (refer to §2 for more details).In this paper, we present a novel and simple analysis of the error of the OLS estimator. In this analysis, we derive tight concentration results for the entire spectrum of the covariates matrix . To this aim, we first show that the spectrum of can be controlled when is upper bounded, where and is the finitetime controllability gramian of the system. We then derive a concentration inequality for by (i) expressing this quantity as the supremum of a chaos process [14]; and (ii) applying HansonWright inequality [15] and an net argument to upper bound this supremum.
Our main result is simple and easy to interpret. For any , we establish that the OLS estimator is PAC after observations, i.e., , provided that:
(2) 
where
denotes the smallest eigenvalue of
, is a universal constant, and depends on only and is finite when ( if ).In [1], the authors have shown that an estimator can be PAC estimator after observations uniformly over the set of scaledorthogonal matrices for some fixed only if the following necessary condition holds: for ,
(3) 
for some universal constant . Hence our sufficient condition (2) cannot be improved when the accuracy level is sufficiently low, i.e., when . This result was actually conjectured in [1]. More recently in [2], we also proved that for any arbitrary , a necessary condition for the existence of a PAC estimator after observations is , but we believe that this sample complexity lower bound can be improved to (3) (for arbitrary matrix , not only for orthogonal matrices). Anyway, when approach 0, and more precisely when and , the necessary condition derived in [2] and the sufficient condition (2) are identical. They would be also identical for any , if we manage to show that the sample complexity lower bound (2) holds for any matrix.
An other way of presenting our results is to consider the rate at which the estimation error decays with the number of observations . We obtain: with probability at least ,
(4) 
One can readily check that , hence, decays as , which is the best possible decay rate. In §2, we compare our result to those of existing analyses of the OLS estimator. Our result is tighter than stateoftheart results, and it is derived using much simpler arguments.
Notations. Throughout the paper, denotes the operator norm of the matrix , and the Euclidian norm of a vector is denoted either by or . The unit sphere in is denoted by
. The singular values of any real matrix
with are denoted by for , arranged in a nonincreasing order. For any square matrix , represents its MoorePenrose pseudoinverse. The vectorization operationacts on matrices by stacking its columns into one long column vector. Next, we recall the definition of subgaussian random variables and vectors. The
norm of a r.v. is defined as . is said subgaussian if . Now a random vector is subgaussian if its onedimensional marginals are subgaussian r.v.’s for all . Its norm is then defined as Note that a gaussian vector with covariance matrix equal to the identity is subgaussian, and its norm is 1. A vector is isotropic if for all or equivalently if .Finally, we introduce notations specifically related to our linear system. Let and be the matrices defined by and . We further define the noise vector by .
2 Related work
The finitetime analysis of the OLS estimator has received a lot of attention recently, see [16], [1], [17] and references therein.
In [16], the authors prove that the OLS estimator is PAC if the observed trajectory is longer than . In this upper bound, the constant depends on in a complicated manner. The bound does not exhibit the right scaling in and . It also has a worse scaling with the dimension than our bound.
The main result in [1] (Theorem 2.1) states that the OLS estimator is PAC after observations under the following condition:
for some satisfying
This result is difficult to interpret, and choosing to optimize the bound seems involved. The authors of [1] present a simplified result in Corollary 2.2, removing the dependence in . However, the result requires that , and we can show that actually depends on (the authors do not express this dependence). Corollary A.2 presented in the appendix of [1] is more explicit, and states that OLS is PAC if
where are the block sizes of the Jordan decomposition of , and is the diagonalizing matrix in this decomposition. Note that the term may be of order . Compared to our sample complexity upper bound (2), the above bound has a worse dependence in the dimension . It has the advantage not to have the term , but this advantage disappears when . The analysis of [1] relies on the decomposition of the estimation error , where
is obtained from the singular value decomposition
. Deriving an upper bound of is then the most involved step of the analysis. To this aim, the authors adapt the socalled small ball method [18] (which typically requires independence), and introduce the Block Martingale Small Ball condition (BMSB) (which indeed requires the introduction of the term in the result).The authors of [17] use the same decomposition of as ours. This decomposition is . The first term corresponds to a selfnormalized process, and can be analyzed using the related theory [13]. The analysis of the second term again requires to control the singular values of . In turn, the analysis of presented in [17] is involved, and unfortunately leads to bounds that are not precise in the system parameters and the dimension . Overall, the upper bound of the sample complexity proposed in [17] is . The dependence in is not explicit, and that in is unknown and hidden in constants.
3 Main result
Consider the linear system (1), and assume that the noise vectors are i.i.d. isotropic vectors with independent coordinates of norm upper bounded by . We observe a single trajectory of the system , and builds from these observations the OLS estimator: . enjoys the following closed form expression:
(5) 
The estimation error is also explicit:
(6) 
which we can rewrite as (using the notation introduced in the introduction)
(7) 
Before stating our main result, we define a quantity that will arise naturally in our analysis. We define the truncated block Toeplitz matrix as
(8) 
When , we can upper bound by a quantity , independent of , as shown in Lemma 4 presented in the appendix:
(9) 
Our main result is the following Theorem.
Theorem 1 (OLS performance).
Let denote the OLS estimator. For any , and any , we have: as long as the following condition holds
(10) 
where and is a universal constant.
4 Spectrum of the Covariates Matrix
In this section, we analyze the spectrum of the covariates matrix . Such an analysis is at the core of proof of Theorem 1. Indeed, relating to the spectrum of the matrix is close to what Theorem 1 states. We are actually able to control the entire spectrum of with high probability, and prove:
Theorem 2.
Let . Let . Then:
holds with probability at least
for some universal constants .
Theorem 2 is a direct consequence of the following two lemmas. Lemma 1 is a corollary of [19, Ch. 4, Lemma 4.5.6]), and is proved in the appendix. Informally, in this lemma, we may think of
as a tall random matrix and
as a deterministic normalizing matrix. If the latter is chosen appropriately, then one would hope that is an approximate isometry^{2}^{2}2 A mapping is said to be an isometry if for all . in the sense of Lemma 1, which in turn will provide a twosided bound on the spectrum of .Lemma 1 (Approximate isometries).
Let , and let be a full rank matrix. Let and assume that
(11) 
Then the following holds
Lemma 2 is the main ingredient of the proof of Theorem 2, but also of that of Theorem 1. The lemma is established by expressing as the supremum of a chaos process, which we can control combining HansonWright inequality and a classical net argument. In the next two subsections, we formally introduce chaos processes and provide related concentration inequalities; we also present the net argument used to complete the proof of Lemma 2.
Lemma 2.
Let .
holds with probability at most
for some positive absolute constants .
4.1 Chaos Processes and HansonWright Inequality
In this section, we introduce chaos processes, and provide HansonWright inequality, an instrumental concentration result related to these processes. A chaos
in probability theory refers to a quadratic form
, where is a deterministic matrix in and is a random vector with independent coordinates. If the vector is isotropic, we simply have . The process indexed by a set of matrices is referred to as a chaos process.In our analysis, we use the following concentration result on chaos, due to HansonWright [20, 21], see [15] (Theorems 2.1).
Theorem 3 (Concentration of anisotropic random vectors).
Let , and
be a random vector with zeromean, unitvariance, subgaussian independent coordinates. Then for all
,where is an absolute positive constant and .
4.2 The net argument
The epsilon net argument is a simple, yet powerful tool in the nonasymptotic theory of random matrices and high dimensional probability [22, 19, 23]. Here, we provide an instance of this argument.
Lemma 3.
Let be an random matrix, and . Furthermore, let be an net of with minimal cardinality. Then for all , we have
4.3 Proof of Lemma 2
Step 1 ( as the supremum of a chaos process)
Note that we can write .
It follows from the fact that is isotropic that
Noting that is a symmetric positive definite matrix, we may define the inverse of its square root matrix which itself is a symmetric positive definite matrix. We think of the inverse of this matrix as a normalization of . Now, we have:
where in the second equality, we used the variational form of the operator norm, where in third equality, the matrix is defined as
and the last equality follows from the fact that is isotropic. In fact, by definition of , we have for all . We have proved that is the supremum of a chaos process , where the parametrizing matrix is
Step 2 (Uniform bound on the chaos) Let , and . Again, recalling that is a zeromean, subgaussian random vector with independent coordinates, from HansonWright inequality (see Theorem 3), we deduce that :
holds with probability at most
for some positive universal constant . Noting that , and , an upper bound on the above probability is
Step 3 (net argument) Using an net argument with (see Lemma 3), we obtain that
holds with probability at most
By choosing , which is equivalent to , we obtain that
holds with probability at most
for some positive absolute constants . This completes the proof of Lemma 2.
5 Proof of Theorem 1
We first decompose the estimation error as:
We introduce the matrix , and the event defined as:
Observe that:
5.1 Upper bound of
5.2 Upper bound of
We derive an upper bound on the above probability using a similar technique as in [17]. We first use the event to simplify the condition until we get a quantity that can be analyzed using concentration results on selfnormalized processes.
When the event occurs, we have:
or equivalently
Define and . Note that when event occurs,
Thus and we obtain:
Next , again when occurs, we have , and thus . We deduce that:
Hence,
Furthermore, note that under , we have , which implies that for all
Now consider the condition
Since , the above condition is equivalent to
(13) 
We deduce that, under the above condition,
We are now ready to apply the results presented in the appendix in Corollary 1 for selfnormalized processes, and conclude that the event
occurs with probability at most . This implies that:
as long as condition holds. Hence
when (13) holds.
5.3 Concluding steps
References
 [1] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference On Learning Theory, 2018, pp. 439–473.
 [2] Y. Jedra and A. Proutiere, “Sample complexity lower bounds for linear system identification,” in IEEE Conference on Decision and Control, 2019.
 [3] G. C. Goodwin and R. L. Payne, Dynamic system identification: experiment design and data analysis. Academic press New York, 1977, vol. 136.
 [4] L. Ljung, “Consistency of the leastsquares identification method,” IEEE Transactions on Automatic Control, vol. 21, no. 5, pp. 779–781, October 1976.
 [5] L. Ljung, “On the consistency of prediction error identification methods,” in System Identification Advances and Case Studies, ser. Mathematics in Science and Engineering, R. K. Mehra and D. G. Lainiotis, Eds. Elsevier, 1976, vol. 126, pp. 121 – 164. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0076539208608711
 [6] A. Rantzer, “Concentration bounds for single parameter adaptive control,” in 2018 Annual American Control Conference (ACC). IEEE, 2018, pp. 1862–1866.
 [7] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
 [8] T. Sarkar and A. Rakhlin, “How fast can linear dynamical systems be learned?” CoRR, vol. abs/1812.01251, 2018. [Online]. Available: http://arxiv.org/abs/1812.01251
 [9] S. Oymak and N. Ozay, “Nonasymptotic identification of LTI systems from a single trajectory,” CoRR, vol. abs/1806.05722, 2018. [Online]. Available: http://arxiv.org/abs/1806.05722
 [10] T. Sarkar, A. Rakhlin, and M. A. Dahleh, “Finitetime system identification for partially observed LTI systems of unknown order,” CoRR, vol. abs/1902.01848, 2019. [Online]. Available: http://arxiv.org/abs/1902.01848

[11]
S. Mendelson, “Learning without concentration,” in Proceedings of The
27th Conference on Learning Theory
, ser. Proceedings of Machine Learning Research, M. F. Balcan, V. Feldman, and C. Szepesvári, Eds., vol. 35. Barcelona, Spain: PMLR, 13–15 Jun 2014, pp. 25–39. [Online]. Available:
http://proceedings.mlr.press/v35/mendelson14.html 
[12]
R. Vershynin,
Introduction to the nonasymptotic analysis of random matrices
. Cambridge University Press, 2012, p. 210–268.  [13] V. H. Peña, T. L. Lai, and Q.M. Shao, Selfnormalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
 [14] F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted isometry property,” Communications on Pure and Applied Mathematics, vol. 67, no. 11, pp. 1877–1904, 2014.
 [15] M. Rudelson, R. Vershynin et al., “Hansonwright inequality and subgaussian concentration,” Electronic Communications in Probability, vol. 18, 2013.
 [16] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342 – 353, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0005109818303546
 [17] T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 5610–5618. [Online]. Available: http://proceedings.mlr.press/v97/sarkar19a.html
 [18] S. Mendelson, “Learning without concentration,” in Conference on Learning Theory, 2014, pp. 25–39.

[19]
R. Vershynin,
HighDimensional Probability: An Introduction with Applications in Data Science
, ser. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.  [20] D. L. Hanson and F. T. Wright, “A bound on tail probabilities for quadratic forms in independent random variables,” The Annals of Mathematical Statistics, vol. 42, no. 3, pp. 1079–1083, 1971.
 [21] F. T. Wright, “A bound on tail probabilities for quadratic forms in independent random variables whose distributions are not necessarily symmetric,” Ann. Probab., vol. 1, no. 6, pp. 1068–1070, 12 1973. [Online]. Available: https://doi.org/10.1214/aop/1176996815
 [22] A. Litvak, A. Pajor, M. Rudelson, and N. TomczakJaegermann, “Smallest singular value of random matrices and geometry of random polytopes,” Advances in Mathematics, vol. 195, no. 2, pp. 491 – 523, 2005. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0001870804002750
 [23] T. Tao, Topics in random matrix theory. American Mathematical Soc., 2012, vol. 132.
 [24] A. Böttcher and B. Silbermann, Introduction to large truncated Toeplitz matrices. Springer Science & Business Media, 2012.
 [25] N. Young, “The rate of convergence of a matrix power series,” Linear Algebra and its Applications, vol. 35, pp. 261 – 278, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0024379581902780
 [26] Y. AbbasiYadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.
 [27] V. H. Peña, T. L. Lai, and Q.M. Shao, Selfnormalized processes: Limit theory and Statistical Applications. Springer Science & Business Media, 2008.
Appendix A Proof of Lemma 1
Proof.
Observe that
Using the fact that for all , we obtain for all
By the monotonicity of for , we deduce that:
from which we conclude
It follows immediately that:
The proof is completed observing that:
∎
Appendix B Properties of
Lemma 4.
Consider the block Toeplitz matrix as defined earlier, and assume that . Then, there exists a positive constant that only depends on , such that for all
Proof.
From the theory of Toeplitz matrices (see [24]), we have
The argument behind the above inequality consists of noting that is a submatrix of the infinite banded block Toeplitz matrix defined as
thus . It follows immediately from the results in [24, chapter 6] that
Next, note that:
Classic results on convergence of Matrix power series ensure convergence
Comments
There are no comments yet.