Finite Sample Analysis of Stochastic System Identification

by   Anastasios Tsiamis, et al.
University of Pennsylvania

In this paper, we analyze the finite sample complexity of stochastic system identification using modern tools from machine learning and statistics. An unknown discrete-time linear system evolves over time under Gaussian noise without external inputs. The objective is to recover the system parameters as well as the Kalman filter gain, given a single trajectory of output measurements over a finite horizon of length N. Based on a subspace identification algorithm and a finite number of N output samples, we provide non-asymptotic high-probability upper bounds for the system parameter estimation errors. Our analysis uses recent results from random matrix theory, self-normalized martingales and SVD robustness, in order to show that with high probability the estimation errors decrease with a rate of 1/√(N). Our non-asymptotic bounds not only agree with classical asymptotic results, but are also valid even when the system is marginally stable.


page 1

page 2

page 3

page 4


Finite sample performance of linear least squares estimation

Linear Least Squares is a very well known technique for parameter estima...

Non-asymptotic Identification of LTI Systems from a Single Trajectory

We consider the problem of learning a realization for a linear time-inva...

Parameter Estimation Bounds Based on the Theory of Spectral Lines

Recent methods in the machine learning literature have proposed a Gaussi...

Sample Complexity of Kalman Filtering for Unknown Systems

In this paper, we consider the task of designing a Kalman Filter (KF) fo...

Finite sample deviation and variance bounds for first order autoregressive processes

In this paper, we study finite-sample properties of the least squares es...

Efficient learning of hidden state LTI state space models of unknown order

The aim of this paper is to address two related estimation problems aris...

Finite Time Adaptive Stabilization of LQ Systems

Stabilization of linear systems with unknown dynamics is a canonical pro...

1 Introduction

Identifying predictive models from data has been a fundamental problem across several fields, from classical control theory to economics and modern machine learning. System identification, in particular, has a long history of studying this problem from a control theoretic perspective [1]. Identifying linear state-space models:


from input-output data has been the focus of time-domain identification. In fact, some identification algorithms can not only learn the system matrices in (1) but also the Kalman filter required for state estimation [2].

Most identification methods for linear systems either follow the prediction error approach [3] or the subspace method [2, 4]. The prediction error approach is usually non-convex and directly searches over the system parameters

by minimizing a prediction error cost. The subspace approach is a convex one; first, Hankel matrices of the system are estimated, then, the parameters are realized via steps involving singular value decomposition (SVD). Methods inspired by machine learning have also also been employed 

[5]. In this paper, we focus on the subspace identification approach–see [6] for an overview. The convergence properties of subspace algorithms have been studied before in [7, 8, 9, 10, 11, 12]; the analysis relies on the assumption of asymptotic stability (spectral radius ) and is limited to asymptotic results. In [7, 8] it is shown that the identification error can decrease as fast as up to logarithmic factors, as the number of output data grows to infinity. While asymptotic results have been established, a finite sample analysis of subspace algorithms remains an open problem [2]. Another open question is whether the asymptotic stability condition can be relaxed to marginal stability .

From a machine learning perspective, finite sample analysis has been a standard tool for comparing algorithms in the non-asymptotic regime. Early work in finite sample analysis of system identification can be found in [13, 14, 15]. A series of recent papers [16, 17, 18] studied the finite sample properties of system identification from a single trajectory, when the system state is fully observed (). Finite sample results for partially observed systems (), which is a more challenging problem, appeared recently in [19, 20, 21]. These papers provide a non-asymptotic convergence rate of for the recovery of matrices up to a similarity transformation. The results rely on the assumption that that the system can be driven by external inputs, i.e. . In [19], the analysis of the classical Ho-Kalman realization algorithm was explored. In [20], it was shown that with a prefiltering step, consistency can be achieved even for marginally stable systems where . In [21], identification of a system of unknown order is considered. Finite sample properties of system identification algorithms have also been used to design controllers [22]. The dual problem of Kalman filtering has not been studied yet in this context.

In this paper, we perform the first finite sample analysis of system (1) in the case , when we have no inputs, also known as stochastic system identification (SSI) [2]. This problem is more challenging than the case , since the system can only be driven through noise and establishing persistence of excitation is harder. We provide the first non-asymptotic guarantees for the estimation of matrices as well as the Kalman filter gain of (1). Similar to [18, 19], the analysis is based on new tools from machine learning and statistics [23, 24, 25]. In summary the main contributions of this paper are:

  • To the best of our knowledge, our paper provides the first finite sample upper bounds in the case of stochastic system identification, where we have no inputs and the system is only driven by noise. We also provide the first finite sample guarantees for the estimation error of the Kalman filter gain.

  • We prove that the outputs of the system satisfy persistence of excitation in finite time with high probability. This result is fundamental for the analysis of most subspace identification algorithms, which use outputs as regressors.

  • We show that we can achieve a learning rate of up to logarithmic factors for marginally stable systems. To the best of our knowledge, the classical subspace identification results do not offer guarantees in the case of marginal stability where . For stable systems (), this rate is consistent with classical asymptotic results [7].

All proofs are included in the Appendix.

2 Problem formulation

Consider the standard state space representation (1) with , where is the system state, is the output, is the system matrix, is the output matrix, is the process noise and is the measurement noise. The noises are assumed to be i.i.d. zero mean Gaussian, with covariance matrices and respectively, and independent of each other. The initial state is also assumed to be zero mean Gaussian, independent of the noises, with covariance . Matrices , , , , are initially unknown. However, the following assumption holds throughout the paper.

Assumption 1.

The order of the system is known. System is marginally stable: , where denotes the spectral radius. The pair is observable, is controllable and is strictly positive definite.

We leave the case of unknown for future work 111The results of Section 4 do not depend on the order being known.. The assumption is more general than the stricter condition found in previous works, see [7, 8, 9, 10, 11, 12]. The remaining conditions in Assumption 1 are standard for the stochastic system identification problem to be well posed and the Kalman filter to converge.

The steady-state Kalman filter of system (1) is :


where is the steady-state Kalman gain


and is the positive definite solution of the Riccati equation:

A byproduct of Assumption 1 is that the closed-loop matrix

has all the eigenvalues strictly inside the unit circle 

[26]. The state of the Kalman filter is the minimum mean square error prediction:


We denote its covariance matrix by:


The innovation error sequence has covariance


Since the original errors are Gaussian i.i.d., by the orthogonality principle the innovation error sequence is also Gaussian and i.i.d. The later property is true since we also assumed that the Kalman filter is in steady-state.

Assumption 2.

We assume that , so that the Kalman filter (2) has converged to its steady-state.

Since the Kalman filter converges exponentially fast to the steady-state gain, this assumption is reasonable in many situations; it is also standard [7, 12]. Nonetheless, we leave the case of general for future work.

In the classical stochastic subspace identification problem, the main goal is to identify the Kalman filter parameters from output samples –see for example Chapter 3 of [2]. The problem is ill-posed in general since the outputs are invariant under any similarity transformation , , . Thus, we can only estimate up to a similarity transformation.

In this paper, we will analyze the finite sample properties of a subspace identification algorithm, which is based on least squares.

Problem 1 (Finite Sample Analysis of SSI).

Consider a finite number of output samples , which follow model (1) with , and an algorithm , which returns estimates of the true parameters. Given a confidence level provide upper bounds , ,

and an invertible matrix

such that with probability at least :


where denotes the spectral norm. The bounds can also depend on the model parameters as well as the identification algorithm used.

3 Subspace Identification Algorithm

The procedure of estimating the parameters is based on a least square approach, see for example [7, 12]. It involves two stages. First, we regress future outputs to past outputs to obtain a Hankel-like matrix, which is a product of an observability and a controllability matrix. Second, we perform a balanced realization step, similar to the Ho-Kalman algorithm, to obtain estimates for .

Before describing the algorithm, we need some definitions. Let , with be two design parameters that define the horizons of the past and the future respectively. Assume that the total number of output samples is . Then, the future outputs and past outputs at time are defined as follows:


By stacking the outputs for all sample sequences, over all times , we form the batch outputs:

The past and future noises are defined similarly. Finally, define the batch states:

The (extended) observability matrix and the reversed (extended) controllability matrix associated to system (2) are defined as:


respectively. We denote the Hankel(-like) matrix by:


Finally, for any , define block-Toeplitz matrix:


3.1 Regression for Hankel Matrix Estimation

First, we establish a linear regression between the future and past outputs. This step is common in most (stochastic) subspace identification algorithms. From (

2), for every :

Meanwhile, from (2), the state prediction can be expressed in terms of the past outputs:

After some algebra, we derive the linear regression:


where the regressors and the residuals are independent column-wise. The term introduces a bias due to the Kalman filter truncation, where we use only past outputs instead of all of them. Based on (13), we compute the least squares estimate


The Hankel matrix can be interpreted as a (truncated) Kalman filter which predicts future outputs directly from past outputs, independently of the internal state-space representation [2]. In this sense, the estimate is a “data-driven" Kalman filter. Notice that persistence of excitation of the outputs (invertibility of ) is required in order to compute the least squares estimate .

3.2 Balanced Realization

This step determines a balanced realization of the state-space, which is only one of the possibly infinite state-space representations–see Section 6 for comparison with other subspace methods. First, we compute a rank- factorization of the full rank matrix . Let the SVD of be:


where contains the largest singular values. Then, a standard realization of , is:


This step assumed knowing the order of the system, see Assumption 1. In addition, matrix should have full rank . This is equivalent to the pair being controllable. Otherwise, will have rank less than making it impossible to accurately estimate .

Assumption 3.

The pair is controllable.

The above assumption is standard–see for example [12].

Based on the estimated observability/controllability matrices, we can approximate the system parameters as follows:

where the notation means we pick the first rows and all columns. The notation for has similar interpretation. For simplicity, define

which includes the “upper" rows of matrix . Similarly, we define the lower part . For matrix we exploit the structure of the extended observability matrix and solve in the least squares sense by computing

where denotes the pseudoinverse.

Now that we have a stochastic system identification algorithm, our goal is to perform finite sample analysis, which is divided in two parts. First, in Section 4, we provide high probability upper bounds for the error in the regression step. Then, in Section 5, we analyze the robustness of the balanced realization step.

4 Finite Sample Analysis of Regression

In this section, we provide the finite sample analysis of the linear regression step of the identification algorithm. We provide high-probability upper bounds for estimation error of the Hankel-like matrix . Before we state the main result, let us introduce some notation. Recall the definition of the innovation covariance matrix in (6). We denote the past noises’ weighted covariance by:


The least singular value of the above matrix is denoted by:


In Lemma B.2 in the Appendix, we show that .

Theorem 1 (Regression Step Analysis).

Consider system (2) under the Assumptions 123. Let be the estimate (14) of the subspace identification algorithm given an output trajectory and let be as in (11). Fix a confidence and define:


There exist such that if , (see definitions (B.1), (B.5), (C.3) in the Appendix), then with probability at least :




over-approximates the condition number of and


are system-dependent constants.

The final result in (20) has been simplified for compactness–see (C.4) for the full expression in the Appendix.

Remark 1 (Result interpretation).

From (13), (14) the estimation error consists of two terms:


The first term in (20) corresponds to the cross-term error, while the second term corresponds to the Kalman filter truncation bias term. To obtain consistency for , we have to let the term go to zero with . Recall that the matrix has spectral radius less than one, thus, the second term decreases exponentially with . By selecting , for some , we can force the Kalman truncation error term to decrease at least as fast as the first one, see for example [7]. In this sense, the dominant term is the first one, i.e. the cross-term. Notice, that can be kept bounded as long as it is larger than . Notice that the norm remains bounded by as increases–see Lemma E.3 in Appendix.

Remark 2 (Statistical rates).

For marginally stable systems () and , we have , since depend at most polynomially on –see Corollary E.1 in the Appendix. In this case,  (20) results in a rate of:

To the best of our knowledge, there have not been any bounds for the performance of subspace algorithms in the case of marginally stable systems.

In the case of asymptotically stable systems (), we have , since are now –see Corollary E.1 in the Appendix. Hence, if , we obtain a rate of:

As a result, our finite sample bound (20) is consistent with the asymptotic bound in equation (14) of [7] as grows to infinity.

In the absence of inputs (), the noise both helps and obstructs identification. Larger noise leads to better excitation of the outputs, but also worsens the convergence of the least squares estimator. To see how our finite sample bounds capture that, observe that larger noise leads to bigger but also bigger . This trade-off is captured by .

If is sufficiently large (condition ), the outputs are guaranteed to be persistently exciting in finite time; more details can be found in Section 4.1 and the Appendix. Meanwhile, condition is not necessary; it just leads to a simplified expression for the bound of the Kalman filter truncation error–see Section 4.3 and Appendix. The definitions of can be found in (B.1), (B.5), (C.3). Their existence is guaranteed even if varies slowly with , i.e. logarithmically.

Obtaining the bound on the error in (20) of Theorem 1 requires the following three steps:

  1. Proving persistence of excitation (PE) for the past outputs, i.e. invertibility of .

  2. Establishing bounds for the cross-term error in (23).

  3. Establishing bounds for the the Kalman truncation error in (23).

In the following subsections, we sketch the proof steps.

4.1 Persistence of Excitation in Finite Time

The next theorem shows that with high probability the past outputs and noises are persistently exciting in finite time. The result is of independent interest and is fundamental since most subspace algorithms use past outputs as regressors.

Theorem 2 (Persistence of Excitation).

Consider the conditions of Theorem 1 and , as in (B.1), (B.5). If , then with probability at least both of the following events occur:


where denotes comparison in the positive semidefinite cone. Hence, with probability at least the outputs satisfy the PE condition:

where is defined in (18).

The above result implies that if the past noises satisfy a PE condition, then PE for the outputs is also guaranteed; the noises are the only way to excite the system in the absence of control inputs. The prove PE, from (2), the past outputs satisfy:

Thus, their correlations are


We can first show PE for the noise correlations , i.e. show that the event occurs with high probability when is sufficiently large (condition ). This behavior is due to the fact that and the sequence is component-wise i.i.d. To prove this step, we use Lemma C.2 from [19]–see Lemma B.1 in the Appendix.

Meanwhile, the cross terms are much smaller and their norm increases with a rate of at most up to logarithmic terms. This is since and the product has martingale structure (see Appendix and Theorem 3 below). Eventually, if the number of samples is large enough (condition ), the cross-terms will be dominated by the noise and state correlations with high probability, which establishes output PE.

4.2 Cross-term error

To bound the cross-term error, we express it as a product of and , as in [18]. The second term of the product can be bounded by applying Theorem 2. The first term is self-normalized and has martingale structure component-wise. In particular, the product is equal to:

where every sum above is a martingale. To bound it, we apply the next theorem, which generalizes Theorem 1 in [24] and Proposition 8.2 in [18].

Theorem 3 (Cross terms).

Let be a filtration. Let , be -measurable, independent of . Suppose also that has independent components , which are sub-Gaussian:

Let , be measurable. Assume that is a positive definite matrix. For any , define:


for some integer . Then, for any , with probability at least , for all

The above theorem along with a Markov upper bound on (see Lemma B.3 in the Appendix) are used to bound .

4.3 Kalman truncation error

For the Kalman truncation error term, we need to bound the term , which is . Using the identities , and , we derive the following equality:


From Theorem 2, we obtain . The last term in (27) can be treated like the cross-term in Section 4.2, by applying Theorems 23 and Lemma B.3. It decreases with a rate of up to logarithmic terms, so it is much smaller than the other terms in (27). To keep the final bound simple, we select such that


with high probability–see also (C.3) for the definition of .

5 Robustness of Balanced Realization

In this section, we analyze the robustness of the balanced realization. In particular, we upper bound the estimation errors of matrices in terms of the estimation error obtained by Theorem 1.

Assume that we knew exactly. Then, the SVD of the true , would be:

for some . Hence, if we knew exactly, the output of the balanced realization would be:


The respective matrices are defined similarly, based on , as described in Section 3. The system matrices are equivalent to the original matrices up to a similarity transformation , , for some invertible . For simplicity, we will quantify the estimation errors in terms of the similar instead of the original .

The next result follows the steps of [19] and relies on Lemma 5.14 of [25] and Theorem 4.1 of [27]. Let denote the th largest singular value.

Theorem 4 (Realization robustness).

Consider the true Hankel-like matrix defined in (11) and the noisy estimate defined in (14). Let , be the output of the balanced realization algorithm based on . Let , be the output of the balanced realization algorithm based on the true . If has rank and the following robustness condition is satisfied:


there exists an orthonormal matrix such that:


The notation refers to the upper part of the respective matrix (first rows)–see Section 3.2.

Remark 3.

The result states that if the error of the regression step is small enough, then the realization is robust. The singular value can be quite small. Hence, the robustness condition (30) can be quite restrictive in practice. However, such a condition is a fundamental limitation of the SVD procedure. The gap between the singular values of and the singular values coming from the noise

should be large enough; this guarantees that the singular vectors related to small singular values of

are separated from the singular vectors coming from the noise , which can be arbitrary. See also Wedin’s theorem [28]. Such robustness conditions have also appeared in model reduction theory [29].

The term which appears in the bound of is . The value of is random since it depends on . However, we could replace it by a deterministic bound. From:

will eventually be lower bounded by if the error is small enough. The norm is bounded as increases, since is asymptotically stable and is finite.

Remark 4 (Total bounds).

The final upper bounds for the estimation of the system parameters , as stated in Problem 1, can be found by combining the finite sample guarantees of the regression step (Theorem 1) with the robustness analysis of the realization step (Theorem 4). All matrix estimation errors depend linearly on the Hankel matrix estimation error . As a result, all matrix errors have the same statistical rate as the error of , i.e. their estimation error decreases at least as fast as up to logarithmic factors.

6 Discussion and Future Work

One of the main differences between the subspace algorithm considered in this paper and other stochastic subspace identification algorithms is the SVD step. The other algorithms perform SVD on instead of , where are full rank weighting matrices, possibly data dependent [30, 3, 2]. From this point of view, the results of Section 4 (upper bound for in Theorem 1 and persistence of excitation in Theorem 2) are fundamental for understanding the finite sample properties of other subspace identification algorithms. Here, we studied the case , which is not the standard choice [12]. It is subject of future work to explore how the choice of affects the realization step, especially the robustness condition of the SVD step.

We could also provide finite sample bounds for the estimation of the closed-loop matrix . This can be done in two ways. One option is to form matrix from the estimates . Alternatively, we could estimate directly from in the same way that we estimated . However, we do not have finite sample guarantees for the stability of the closed-loop estimates , . It would be interesting to address this in future work.

Another direction for future work is repeating the analysis when the Kalman filter has not reached steady-state, i.e. relax Assumption 2. Finally, in this work, we only considered upper bounds. It would be interesting to study lower bounds as well to evaluate the tightness of our upper bounds. In any case, from lower bounds for fully observed systems [17], the factor of is tight.


  • [1] L. Ljung, “Perspectives on System identification,” Annual Reviews in Control, vol. 34, no. 1, pp. 1–12, 2010.
  • [2] P. Van Overschee and B. De Moor, Subspace identification for linear systems: Theory–Implementation–Applications.   Springer Science & Business Media, 2012.
  • [3] L. Ljung, System Identification: Theory for the User.   Prentice Hall, 1999.
  • [4] M. Verhaegen and V. Verdult, Filtering and system identification: a least squares approach.   Cambridge university press, 2007.
  • [5] A. Chiuso and G. Pillonetto, “System identification: A machine learning perspective,” Annual Review of Control, Robotics, and Autonomous Systems, vol. 2, no. 1, 2019.
  • [6] S. J. Qin, “An overview of subspace identification,” Computers & chemical engineering, vol. 30, no. 10-12, pp. 1502–1513, 2006.
  • [7] M. Deistler, K. Peternell, and W. Scherrer, “Consistency and relative efficiency of subspace methods,” Automatica, vol. 31, no. 12, pp. 1865–1875, 1995.
  • [8] K. Peternell, W. Scherrer, and M. Deistler, “Statistical analysis of novel subspace identification methods,” Signal Processing, vol. 52, no. 2, pp. 161–177, 1996.
  • [9] M. Viberg, B. Wahlberg, and B. Ottersten, “Analysis of state space system identification methods based on instrumental variables and subspace fitting,” Automatica, vol. 33, no. 9, pp. 1603–1616, 1997.
  • [10] M. Jansson and B. Wahlberg, “On consistency of subspace methods for system identification,” Automatica, vol. 34, no. 12, pp. 1507–1519, 1998.
  • [11] D. Bauer, M. Deistler, and W. Scherrer, “Consistency and asymptotic normality of some subspace algorithms for systems without observed inputs,” Automatica, vol. 35, no. 7, pp. 1243–1254, 1999.
  • [12] T. Knudsen, “Consistency analysis of subspace identification methods based on a linear regression approach,” Automatica, vol. 37, no. 1, pp. 81–89, 2001.
  • [13] E. Weyer, R. C. Williamson, and I. M. Mareels, “Finite sample properties of linear model identification,” IEEE Transactions on Automatic Control, vol. 44, no. 7, pp. 1370–1383, 1999.
  • [14] M. C. Campi and E. Weyer, “Finite sample properties of system identification methods,” IEEE Transactions on Automatic Control, vol. 47, no. 8, pp. 1329–1334, 2002.
  • [15] M. Vidyasagar and R. L. Karandikar, “A learning theory approach to system identification and stochastic adaptive control,” Journal of Process Control, vol. 18, no. 3-4, pp. 421–430, 2008.
  • [16] M. K. S. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
  • [17] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification,” arXiv preprint arXiv:1802.08334, 2018.
  • [18] T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” arXiv preprint arXiv:1812.01251, 2018.
  • [19] S. Oymak and N. Ozay, “Non-asymptotic Identification of LTI Systems from a Single Trajectory,” arXiv preprint arXiv:1806.05722, 2018.
  • [20] M. Simchowitz, R. Boczar, and B. Recht, “Learning Linear Dynamical Systems with Semi-Parametric Least Squares,” arXiv preprint arXiv:1902.00768, 2019.
  • [21] T. Sarkar, A. Rakhlin, and M. A. Dahleh, “Finite-Time System Identification for Partially Observed LTI Systems of Unknown Order,” arXiv preprint arXiv:1902.01848, 2019.
  • [22] S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,” arXiv preprint arXiv:1710.01688, 2017.
  • [23] R. Vershynin,

    High-dimensional probability: An introduction with applications in data science

    .   Cambridge University Press, 2018, vol. 47.
  • [24] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.
  • [25] S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in International Conference on Machine Learning, 2016, pp. 964–973.
  • [26] B. Anderson and J. Moore, Optimal Filtering.   Dover Publications, 2005.
  • [27] P.-Å. Wedin, “Perturbation theory for pseudo-inverses,” BIT Numerical Mathematics, vol. 13, no. 2, pp. 217–232, 1973.
  • [28] ——, “Perturbation bounds in connection with singular value decomposition,” BIT Numerical Mathematics, vol. 12, no. 1, pp. 99–111, 1972.
  • [29] L. Pernebo and L. Silverman, “Model reduction via balanced state space representations,” IEEE Transactions on Automatic Control, vol. 27, no. 2, pp. 382–387, 1982.
  • [30] P. Van Overschee and B. De Moor, “A Unifying Theorem for Three Subspace System Identification Algorithms,” Automatica, vol. 31, no. 12, pp. 1853–1864, 1995.
  • [31] F. Krahmer, S. Mendelson, and H. Rauhut, “Suprema of chaos processes and the restricted isometry property,” Communications on Pure and Applied Mathematics, vol. 67, no. 11, pp. 1877–1904, 2014.
  • [32] R. A. Horn and C. R. Johnson, Matrix analysis, 2nd ed.   Cambridge University Press, 2012.
  • [33] ——, Topics in Matrix Analysis.   Cambridge University Press, 1994.

Appendix A Proof of Theorem 3

Let us first state a result which follows from the arguments of Chapter 4 of [23]. See also Proposition 8.1 of [18].

Proposition A.1 ([23]).

Consider a matrix . Let denote the unit sphere. Then for any :

Second, we state Theorem 1 of [24], which upper bounds self-normalized martingales.

Theorem A.1 (Theorem 1 in [24]).

Let be a filtration. Let , be real-valued, -measurable, and conditionally sub-Gaussian:


Let , be vector valued and measurable. Assume that is a positive definite matrix. For any , define:

Then, for any , with probability at least , for all


Finally, we state a standard linear algebra result. We include the proof for completeness.

Lemma A.1 (Block Matrix Norm).


for matrices of appropriate dimensions. Then:


Consider a vector such that is defined. Then, from triangle inequality, the definition of matrix norm and Cauchy-Schwartz:

where we used that . ∎

Now, we can prove Theorem 3. Notice that:

We can analyze each component separately and apply a union bound afterwards, since by Lemma A.1:


Now, fix a and let

Consider an arbitrary element of the unit sphere . The scalar is conditionally -sub-Gaussian and satisfies the conditions of Theorem A.1. Thus, with probability at least :

Now, we apply Proposition A.1 for :

Finally, by (A.3) and a union bound over all components: