 # Finite-time Identification of Stable Linear Systems: Optimality of the Least-Squares Estimator

We provide a new finite-time analysis of the estimation error of stable linear time-invariant systems under the Ordinary Least Squares (OLS) estimator. Specifically, we characterize the sufficient number of observed samples (the length of the observed trajectory) so that the OLS is (ε,δ)-PAC, i.e. yields an estimation error less than ε with probability at least 1-δ. We show that this number matches existing sample complexity lower bound [1,2] up to universal multiplicative factors (independent of (ε,δ), of the system and of the dimension). This paper hence establishes the optimality of the OLS estimator for stable systems, a result conjectured in . Our analysis of the performance of the OLS estimator is simpler, sharper, and easier to interpret than existing analyses, but is restricted to stable systems. It relies on new concentration results for the covariates matrix.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We investigate the canonical problem of identifying Linear Time Invariant (LTI) systems of the form:

 x0=0    and    ∀t≥0, xt+1=Axt+ηt+1. (1)

denotes the state at time , is initially unknown but stable (i.e., its spectral radius satisfies ), and

are i.i.d. sub-gaussian zero-mean noise vectors with covariance matrix

(for simplicity). The objective is to estimate the matrix from an observed trajectory of the system. Most work on this topic is concerned with the convergence properties of some specific estimation methods (e.g. ordinary least squares (OLS), ML estimator) [3, 4, 5]. Recently however, there have been intense research efforts towards understanding the finite-time behavior of classical estimation methods [6, 7, 1, 8, 9, 10]. The results therein aim at deriving bounds on the sample complexity of existing estimation methods (mostly the OLS), namely at finding upper bounds on the number of observations sufficient to identify with prescribed levels of accuracy and confidence111Throughout the paper, denotes the euclidian norm for vectors, and the operator norm for matrices.: for any if denotes the estimator of after observations. Lower bounds of the sample complexity (valid for any estimator) have been also derived in [1, 2]. Sample complexity upper bounds appearing in the aforementioned papers are often hard to interpret and to compare. The main difficulty behind deriving such bounds stems from the fact that the data used to estimate is correlated (we observe a single trajectory of the system). In turn, existing analyses rely on involved concentration results for random matrices [11, 12] and self-normalized processes . The most technical and often tedious part of these analyses deals with deriving concentration results for the covariates matrix defined as , and the tightness of these results directly impacts that of the sample complexity upper bound (refer to §2 for more details).

In this paper, we present a novel and simple analysis of the error of the OLS estimator. In this analysis, we derive tight concentration results for the entire spectrum of the covariates matrix . To this aim, we first show that the spectrum of can be controlled when is upper bounded, where and is the finite-time controllability gramian of the system. We then derive a concentration inequality for by (i) expressing this quantity as the supremum of a chaos process ; and (ii) applying Hanson-Wright inequality  and an -net argument to upper bound this supremum.

Our main result is simple and easy to interpret. For any , we establish that the OLS estimator is -PAC after observations, i.e., , provided that:

 λmin(t−1∑s=0Γs(A))≥cmax{1ε2,J(A)2}(log(1/δ)+d), (2)

where

denotes the smallest eigenvalue of

, is a universal constant, and depends on only and is finite when ( if ).

In , the authors have shown that an estimator can be -PAC estimator after observations uniformly over the set of scaled-orthogonal matrices for some fixed only if the following necessary condition holds: for ,

 λmin(t−1∑s=0Γs(A))≥c1ε2(log(1/δ)+d), (3)

for some universal constant . Hence our sufficient condition (2) cannot be improved when the accuracy level is sufficiently low, i.e., when . This result was actually conjectured in . More recently in , we also proved that for any arbitrary , a necessary condition for the existence of a -PAC estimator after observations is , but we believe that this sample complexity lower bound can be improved to (3) (for arbitrary matrix , not only for orthogonal matrices). Anyway, when approach 0, and more precisely when and , the necessary condition derived in  and the sufficient condition (2) are identical. They would be also identical for any , if we manage to show that the sample complexity lower bound (2) holds for any matrix.

An other way of presenting our results is to consider the rate at which the estimation error decays with the number of observations . We obtain: with probability at least ,

 ∥At−A∥≤c ⎷log(1/δ)+dλmin(∑t−1s=0Γs(A)). (4)

One can readily check that , hence, decays as , which is the best possible decay rate. In §2, we compare our result to those of existing analyses of the OLS estimator. Our result is tighter than state-of-the-art results, and it is derived using much simpler arguments.

Notations. Throughout the paper, denotes the operator norm of the matrix , and the Euclidian norm of a vector is denoted either by or . The unit sphere in is denoted by

. The singular values of any real matrix

with are denoted by for , arranged in a non-increasing order. For any square matrix , represents its Moore-Penrose pseudo-inverse. The vectorization operation

acts on matrices by stacking its columns into one long column vector. Next, we recall the definition of sub-gaussian random variables and vectors. The

-norm of a r.v. is defined as . is said sub-gaussian if . Now a random vector is sub-gaussian if its one-dimensional marginals are sub-gaussian r.v.’s for all . Its -norm is then defined as Note that a gaussian vector with covariance matrix equal to the identity is sub-gaussian, and its -norm is 1. A vector is isotropic if for all or equivalently if .

Finally, we introduce notations specifically related to our linear system. Let and be the matrices defined by and . We further define the noise vector by .

## 2 Related work

The finite-time analysis of the OLS estimator has received a lot of attention recently, see , ,  and references therein.

In , the authors prove that the OLS estimator is -PAC if the observed trajectory is longer than . In this upper bound, the constant depends on in a complicated manner. The bound does not exhibit the right scaling in and . It also has a worse scaling with the dimension than our bound.

The main result in  (Theorem 2.1) states that the OLS estimator is -PAC after observations under the following condition:

 t≥1ε2λmin(Γk(A))(dlog(d/δ)+logdet(Γt(A)Γk(A)−1)),

for some satisfying

 tk≥c′(dlog(d/δ)+logdet(Γt(A)Γk(A)−1)).

This result is difficult to interpret, and choosing to optimize the bound seems involved. The authors of  present a simplified result in Corollary 2.2, removing the dependence in . However, the result requires that , and we can show that actually depends on (the authors do not express this dependence). Corollary A.2 presented in the appendix of  is more explicit, and states that OLS is -PAC if

 t≥c(dlog(dcond(S)t/δ)+∑lb2llogt),

where are the block sizes of the Jordan decomposition of , and is the diagonalizing matrix in this decomposition. Note that the term may be of order . Compared to our sample complexity upper bound (2), the above bound has a worse dependence in the dimension . It has the advantage not to have the term , but this advantage disappears when . The analysis of  relies on the decomposition of the estimation error , where

is obtained from the singular value decomposition

. Deriving an upper bound of is then the most involved step of the analysis. To this aim, the authors adapt the so-called small ball method  (which typically requires independence), and introduce the Block Martingale Small Ball condition (BMSB) (which indeed requires the introduction of the term in the result).

The authors of  use the same decomposition of as ours. This decomposition is . The first term corresponds to a self-normalized process, and can be analyzed using the related theory . The analysis of the second term again requires to control the singular values of . In turn, the analysis of presented in  is involved, and unfortunately leads to bounds that are not precise in the system parameters and the dimension . Overall, the upper bound of the sample complexity proposed in  is . The dependence in is not explicit, and that in is unknown and hidden in constants.

## 3 Main result

Consider the linear system (1), and assume that the noise vectors are i.i.d. isotropic vectors with independent coordinates of -norm upper bounded by . We observe a single trajectory of the system , and builds from these observations the OLS estimator: . enjoys the following closed form expression:

 At=(t∑s=0xs+1x⊤s)(t∑s=0xsx⊤s)† (5)

The estimation error is also explicit:

 At−A=(t∑s=1ηs+1x⊤s)(t∑s=1xsx⊤s)†, (6)

which we can re-write as (using the notation introduced in the introduction)

 At−A=E⊤X(X⊤X)†. (7)

Before stating our main result, we define a quantity that will arise naturally in our analysis. We define the truncated block Toeplitz matrix as

 Γ=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣IdAIdO⋱At−2…AIdAt−1……AId⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦. (8)

When , we can upper bound by a quantity , independent of , as shown in Lemma 4 presented in the appendix:

 ∥Γ∥≤J(A)=∑s≥0∥As∥. (9)

Our main result is the following Theorem.

###### Theorem 1 (OLS performance).

Let denote the OLS estimator. For any , and any , we have: as long as the following condition holds

 λmin(t−1∑s=0Γs(A))≥cmax{1ε2,∥Γ∥2}(log(1δ)+d), (10)

where and is a universal constant.

## 4 Spectrum of the Covariates Matrix

In this section, we analyze the spectrum of the covariates matrix . Such an analysis is at the core of proof of Theorem 1. Indeed, relating to the spectrum of the matrix is close to what Theorem 1 states. We are actually able to control the entire spectrum of with high probability, and prove:

###### Theorem 2.

Let . Let . Then:

 1∥M∥(1−K2ε)≤sd(X)≤⋯≤s1(X)≤(1+K2ε)1sd(M)

holds with probability at least

 1−2exp(−c1ε21∥M∥2∥Γ∥2+c2d),

for some universal constants .

Theorem 2 is a direct consequence of the following two lemmas. Lemma 1 is a corollary of [19, Ch. 4, Lemma 4.5.6]), and is proved in the appendix. Informally, in this lemma, we may think of

as a tall random matrix and

as a deterministic normalizing matrix. If the latter is chosen appropriately, then one would hope that is an approximate isometry222 A mapping is said to be an isometry if for all . in the sense of Lemma 1, which in turn will provide a two-sided bound on the spectrum of .

###### Lemma 1 (Approximate isometries).

Let , and let be a full rank matrix. Let and assume that

 ∥(XM)⊤XM−Id∥≤max(ε,ε2). (11)

Then the following holds

 1s1(M)(1−ε)≤sd(X)≤⋯≤s1(X)≤(1+ε)1sd(M).

Lemma 2 is the main ingredient of the proof of Theorem 2, but also of that of Theorem 1. The lemma is established by expressing as the supremum of a chaos process, which we can control combining Hanson-Wright inequality and a classical -net argument. In the next two subsections, we formally introduce chaos processes and provide related concentration inequalities; we also present the -net argument used to complete the proof of Lemma 2.

###### Lemma 2.

Let .

 ∥(XM)⊤XM−Id∥>max(ε,ε2)K2

holds with probability at most

 2exp(−c1ε21∥M∥2∥Γ∥2+c2d)

for some positive absolute constants .

### 4.1 Chaos Processes and Hanson-Wright Inequality

In this section, we introduce chaos processes, and provide Hanson-Wright inequality, an instrumental concentration result related to these processes. A chaos

in probability theory refers to a quadratic form

, where is a deterministic matrix in and is a random vector with independent coordinates. If the vector is isotropic, we simply have . The process indexed by a set of matrices is referred to as a chaos process.

In our analysis, we use the following concentration result on chaos, due to Hanson-Wright [20, 21], see  (Theorems 2.1).

###### Theorem 3 (Concentration of anisotropic random vectors).

Let , and

be a random vector with zero-mean, unit-variance, sub-gaussian independent coordinates. Then for all

,

where is an absolute positive constant and .

### 4.2 The ϵ-net argument

The epsilon net argument is a simple, yet powerful tool in the non-asymptotic theory of random matrices and high dimensional probability [22, 19, 23]. Here, we provide an instance of this argument.

###### Lemma 3.

Let be an random matrix, and . Furthermore, let be an -net of with minimal cardinality. Then for all , we have

 P(∥W∥>ρ) ≤(2ϵ+1)dmaxx∈NP(∥Wx∥2>(1−ϵ)ρ)

Lemma 3 exploits a variational form of the operator norm, namely . The proof of this standard lemma can be found for example in [19, Chapter 4].

### 4.3 Proof of Lemma 2

Step 1 ( as the supremum of a chaos process)
Note that we can write . It follows from the fact that is isotropic that

 EX⊤X=Et∑s=1xsx⊤s=t∑s=1s−1∑k=0Ak(Ak)⊤=t−1∑s=0Γs(A).

Noting that is a symmetric positive definite matrix, we may define the inverse of its square root matrix which itself is a symmetric positive definite matrix. We think of the inverse of this matrix as a normalization of . Now, we have:

 ∥ (XM)⊤XM−Id∥=∥(XM)⊤XM−E(XM)⊤XM∥ =supu∈Sd−1∣∣u⊤(XM)⊤XMu−Eu⊤(XM)⊤XMu∣∣ =supu∈Sd−1∣∣∥XMu∥22−E∥XMu∥22∣∣. =supu∈Sd−1∣∣∥σMu⊤Γξ∥22−E∥σMu⊤Γξ∥22∣∣ =supu∈Sd−1∣∣∥σMu⊤Γξ∥22−∥σMu⊤Γ∥2F∣∣

where in the second equality, we used the variational form of the operator norm, where in third equality, the matrix is defined as

 σMu=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣MuOMu⋱OMu⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,

and the last equality follows from the fact that is isotropic. In fact, by definition of , we have for all . We have proved that is the supremum of a chaos process , where the parametrizing matrix is

 W=Γ⊤σMuσMu⊤Γ.

Step 2 (Uniform bound on the chaos) Let , and . Again, recalling that is a zero-mean, subgaussian random vector with independent coordinates, from Hanson-Wright inequality (see Theorem 3), we deduce that :

 ∣∣∥σ⊤MuΓξ∥22−∥σ⊤MuΓ∥2F∣∣>ρ∥σMu⊤Γ∥2F

holds with probability at most

 2exp(−cmin(ρ2K4,ρK2)∥σMu⊤Γ∥2F∥σMu⊤Γ∥2)

for some positive universal constant . Noting that , and , an upper bound on the above probability is

 2exp(−cmin(ρ2K4,ρK2)1∥M∥2∥Γ∥2).

Step 3 (-net argument) Using an -net argument with (see Lemma 3), we obtain that

 ∥(XM)⊤XM−Id∥>ρ

holds with probability at most

 2⋅5dexp(−cmin(ρ2K4,ρK2)14∥M∥2∥Γ∥2).

By choosing , which is equivalent to , we obtain that

 ∥(XM)⊤XM−Id∥>max(ε,ε2)K2

holds with probability at most

 2exp(−c1ε21∥M∥2∥Γ∥2+c2d),

for some positive absolute constants . This completes the proof of Lemma 2.

## 5 Proof of Theorem 1

We first decompose the estimation error as:

 ∥At−A∥≤∥E⊤X((X⊤X)†)12∥∥((X⊤X)†)12∥.

We introduce the matrix , and the event defined as:

 E={∥(XM)⊤XM−Id∥≤12}.

Observe that:

### 5.1 Upper bound of P(¯¯¯E)

We use Lemma 2. Let , which is equivalent to writing . Then choose . With this choice, Lemma 2 implies that

 ∥(XM)⊤XM−Id∥≤12

holds with probability at most

 2exp(−c1min(12K2,14K4)1∥M∥2∥Γ∥2+c2d).

Since is sub-gaussian and isotropic, we have . We conclude that when

 1∥M∥2≥16K4∥Γ∥2c(log(4δ)+nlog(5)). (12)

### 5.2 Upper bound of P({∥E⊤X((X⊤X)†)12∥∥((X⊤X)†)12∥>ε}∩E)

We derive an upper bound on the above probability using a similar technique as in . We first use the event to simplify the condition until we get a quantity that can be analyzed using concentration results on self-normalized processes.

When the event occurs, we have:

 32Id⪰(XM)⊤XM⪰12Id

or equivalently

 32t−1∑s=0Γs(A)=32M−2⪰X⊤X⪰12M−2=12t−1∑s=0Γs(A).

Define and . Note that when event occurs,

 sd(X)≥β= ⎷12λmin(t−1∑s=0Γs(A)).

Thus and we obtain:

 {∥E⊤X((X⊤X)†)12∥∥((X⊤X)†)12∥>ε}∩E⊆{∥E⊤X(X⊤X)−12∥>εβ}∩E.

Next , again when occurs, we have , and thus . We deduce that:

 ∥E⊤X(X⊤X)−12∥≤√2∥E⊤X(X⊤X+S)−12∥.

Hence,

 {∥E⊤X(X⊤X)−12∥>ε}∩E⊆{√2∥E⊤X(X⊤X+S)−12∥>εβ}∩E.

Furthermore, note that under , we have , which implies that for all

 1≥  ⎷log⎛⎜⎝2⋅5d(det(X⊤X+S)S−1)12δ⎞⎟⎠log⎛⎜⎝2⋅5d(det(3S+S)S−1)12δ⎞⎟⎠−1

Now consider the condition

 ε ≥4√2√cKβ  ⎷log⎛⎜⎝2⋅5d(det(3S+S)S−1)12δ⎞⎟⎠−1 =4√2√cKβ√(log(2δ)+dlog(10))−1.

Since , the above condition is equivalent to

 1∥M∥2≥16cK2ε2(log(2δ)+dlog(10)) (13)

We deduce that, under the above condition,

 {√2∥E⊤X(X⊤X+S)−12∥>εβ}∩E⊆⎧⎪ ⎪⎨⎪ ⎪⎩∥E⊤X(X⊤X+S)−12∥>4√cK  ⎷log⎛⎜⎝2⋅5d(det(X⊤X+S)S−1)12δ⎞⎟⎠⎫⎪ ⎪⎬⎪ ⎪⎭.

We are now ready to apply the results presented in the appendix in Corollary 1 for self-normalized processes, and conclude that the event

 ⎧⎪ ⎪⎨⎪ ⎪⎩∥E⊤X(X⊤X+S)−12∥>4√cK  ⎷log⎛⎜⎝2⋅5d(det(X⊤X+S)S−1)12δ⎞⎟⎠⎫⎪ ⎪⎬⎪ ⎪⎭

occurs with probability at most . This implies that:

 P({√2∥E⊤X(X⊤X+S)−12∥>εβ}∩E)≤δ2

as long as condition holds. Hence

 P({∥E⊤X((X⊤X)†)12∥∥((X⊤X)†)12∥>ε}∩E)≤δ2

when (13) holds.

### 5.3 Concluding steps

Introducing

 τ1 =16K4∥Γ∥2c2(log(4δ)+nlog(5)), τ2 =16c1K2ε2(log(2δ)+dlog(10)),

we may re-write Condition (12) as and Condition (13) as . Thus, we have

 P(∥At+1−A∥>ε) ≤P({∥E⊤X((X⊤X)†)12∥∥((X⊤X)†)12∥>ε}∩E)+P(¯¯¯E) ≤δ2+δ2

provided that Finally observe that:

 max(τ1,τ2) ≤16K2max(c1ε2,K2c2∥Γ∥2)(log(4δ)+nlog(10)) ≤Cmax(1ε2,∥Γ∥2)(log(4δ)+nlog(10))

where . We conclude that

 P(∥At+1−A∥>ε)≤δ

holds as long as

 λmin(t∑s=0Γs(A))=1∥M∥2≥Cmax(1ε2,∥Γ∥2)(log(4δ)+nlog(10)).

This completes the proof of Theorem 1.

## References

•  M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference On Learning Theory, 2018, pp. 439–473.
•  Y. Jedra and A. Proutiere, “Sample complexity lower bounds for linear system identification,” in IEEE Conference on Decision and Control, 2019.
•  G. C. Goodwin and R. L. Payne, Dynamic system identification: experiment design and data analysis.   Academic press New York, 1977, vol. 136.
•  L. Ljung, “Consistency of the least-squares identification method,” IEEE Transactions on Automatic Control, vol. 21, no. 5, pp. 779–781, October 1976.
•  L. Ljung, “On the consistency of prediction error identification methods,” in System Identification Advances and Case Studies, ser. Mathematics in Science and Engineering, R. K. Mehra and D. G. Lainiotis, Eds.   Elsevier, 1976, vol. 126, pp. 121 – 164. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0076539208608711
•  A. Rantzer, “Concentration bounds for single parameter adaptive control,” in 2018 Annual American Control Conference (ACC).   IEEE, 2018, pp. 1862–1866.
•  M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
•  T. Sarkar and A. Rakhlin, “How fast can linear dynamical systems be learned?” CoRR, vol. abs/1812.01251, 2018. [Online]. Available: http://arxiv.org/abs/1812.01251
•  S. Oymak and N. Ozay, “Non-asymptotic identification of LTI systems from a single trajectory,” CoRR, vol. abs/1806.05722, 2018. [Online]. Available: http://arxiv.org/abs/1806.05722
•  T. Sarkar, A. Rakhlin, and M. A. Dahleh, “Finite-time system identification for partially observed LTI systems of unknown order,” CoRR, vol. abs/1902.01848, 2019. [Online]. Available: http://arxiv.org/abs/1902.01848
•  S. Mendelson, “Learning without concentration,” in Proceedings of The 27th Conference on Learning Theory

, ser. Proceedings of Machine Learning Research, M. F. Balcan, V. Feldman, and C. Szepesvári, Eds., vol. 35.   Barcelona, Spain: PMLR, 13–15 Jun 2014, pp. 25–39. [Online]. Available:

http://proceedings.mlr.press/v35/mendelson14.html
•  R. Vershynin,

Introduction to the non-asymptotic analysis of random matrices

.   Cambridge University Press, 2012, p. 210–268.
•  V. H. Peña, T. L. Lai, and Q.-M. Shao, Self-normalized processes: Limit theory and Statistical Applications.   Springer Science & Business Media, 2008.
•  F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted isometry property,” Communications on Pure and Applied Mathematics, vol. 67, no. 11, pp. 1877–1904, 2014.
•  M. Rudelson, R. Vershynin et al., “Hanson-wright inequality and sub-gaussian concentration,” Electronic Communications in Probability, vol. 18, 2013.
•  M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342 – 353, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0005109818303546
•  T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97.   Long Beach, California, USA: PMLR, 09–15 Jun 2019, pp. 5610–5618. [Online]. Available: http://proceedings.mlr.press/v97/sarkar19a.html
•  S. Mendelson, “Learning without concentration,” in Conference on Learning Theory, 2014, pp. 25–39.
•  R. Vershynin,

High-Dimensional Probability: An Introduction with Applications in Data Science

, ser. Cambridge Series in Statistical and Probabilistic Mathematics.   Cambridge University Press, 2018.
•  D. L. Hanson and F. T. Wright, “A bound on tail probabilities for quadratic forms in independent random variables,” The Annals of Mathematical Statistics, vol. 42, no. 3, pp. 1079–1083, 1971.
•  F. T. Wright, “A bound on tail probabilities for quadratic forms in independent random variables whose distributions are not necessarily symmetric,” Ann. Probab., vol. 1, no. 6, pp. 1068–1070, 12 1973. [Online]. Available: https://doi.org/10.1214/aop/1176996815
•  A. Litvak, A. Pajor, M. Rudelson, and N. Tomczak-Jaegermann, “Smallest singular value of random matrices and geometry of random polytopes,” Advances in Mathematics, vol. 195, no. 2, pp. 491 – 523, 2005. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0001870804002750
•  T. Tao, Topics in random matrix theory.   American Mathematical Soc., 2012, vol. 132.
•  A. Böttcher and B. Silbermann, Introduction to large truncated Toeplitz matrices.   Springer Science & Business Media, 2012.
•  N. Young, “The rate of convergence of a matrix power series,” Linear Algebra and its Applications, vol. 35, pp. 261 – 278, 1981. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0024379581902780
•  Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.
•  V. H. Peña, T. L. Lai, and Q.-M. Shao, Self-normalized processes: Limit theory and Statistical Applications.   Springer Science & Business Media, 2008.

## Appendix A Proof of Lemma 1

###### Proof.

Observe that

 ∥(XM)⊤XM−Id∥=supu∈Sd−1∣∣∥XMu∥22−1∣∣.

Using the fact that for all , we obtain for all

 max(|∥XMu∥2−1|,|∥XMu∥2−1|2)≤max(ε,ε2).

By the monotonicity of for , we deduce that:

 |∥XMu∥2−1|≤ε,

from which we conclude

 1−ε≤∥XMu∥2≤1+ε.

It follows immediately that:

 1−ε≤sd(XM)≤⋯≤s1(XM)≤1+ε.

The proof is completed observing that:

 sd(XM)≤sd(X)s1(M)ands1(X)sd(M)≤s1(XM).

## Appendix B Properties of Γ

###### Lemma 4.

Consider the block Toeplitz matrix as defined earlier, and assume that . Then, there exists a positive constant that only depends on , such that for all

 ∥Γ∥≤J(A).
###### Proof.

From the theory of Toeplitz matrices (see ), we have

 ∥Γ∥≤supx∈[0,1]∥∥ ∥∥t−1∑s=0Ase2πisx∥∥ ∥∥.

The argument behind the above inequality consists of noting that is a submatrix of the infinite banded block Toeplitz matrix defined as

 T(A)=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣IdA⋱⋮⋱IdAt−1A⋱⋱⋮⋱At−1⋱⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

thus . It follows immediately from the results in [24, chapter 6] that

 ∥T(A)∥=supx∈[0,1]∥∥ ∥∥t−1∑s=0Ase2πisx∥∥ ∥∥.

Next, note that:

 supx∈[0,1]∥∥ ∥∥t−1∑s=0Ase2πisx∥∥ ∥∥ ≤supx∈[0,1]t−1∑s=0∥∥Ase2πisx∥∥ ≤t−1∑s=0∥As∥

Classic results on convergence of Matrix power series ensure convergence