Finite sample deviation and variance bounds for first order autoregressive processes

10/17/2019 ∙ by Rodrigo A. González, et al. ∙ 0

In this paper, we study finite-sample properties of the least squares estimator in first order autoregressive processes. By leveraging a result from decoupling theory, we derive upper bounds on the probability that the estimate deviates by at least a positive ε from its true value. Our results consider both stable and unstable processes. Afterwards, we obtain problem-dependent non-asymptotic bounds on the variance of this estimator, valid for sample sizes greater than or equal to seven. Via simulations we analyze the conservatism of our bounds, and show that they reliably capture the true behavior of the quantities of interest.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identification of autoregressive (AR) processes is a well-developed research topic, which is deeply connected with the areas of probability and statistics, time series analysis, econometrics, system identification, signal processing, machine learning and control.

In AR processes, estimation of parameters given data is commonly done by the least squares method, which is known to have good statistical performance and is related to maximum likelihood in a Gaussian noise framework. Consequently, the asymptotic properties of this method for AR processes have been studied thoroughly, starting since at least [1]. In general regression models, a detailed consistency analysis of the least squares estimate was given in [2].

Despite the popularity of least squares, non-asymptotic performance bounds with this method are rare in the literature. Finite-time bounds are important for computing the number of samples needed for achieving a specified accuracy, deriving finite-time confidence sets, and designing robust control schemes. Most of the difficulties in obtaining these bounds lie on the limitations of classical statistical techniques, which are better suited for asymptotic analyses.

Recently, results in statistical learning theory, self-normalizing processes

[3] and high dimensional probability [4], have motivated control, signal processing, and machine learning communities to explore finite-time properties of linear system estimates by least squares. Among topics of interest, we can find sample complexity bounds [5], probability bounds on parameter errors [6], confidence bounds [7, Chap. 20], Cramér-Rao lower bounds [8], and bounds on quadratic criterion cost deviation [9, 10]. In these works general autoregressive structures often appear, but the powerful statistical machinery used usually leads to loss of insight in scalar cases, where specialized theory could serve better. A problem-dependent sharp analysis of AR(1) processes is therefore needed, as it can be considered as a natural stepping stone towards studying finite-time properties of more general model structures such as ARX and ARMAX models, and can also be useful for analyzing finite-time properties of two-stage ARMA estimation algorithms [11].

In summary, the main results of this paper are:

  • For both stable and unstable single-parameter scalar autoregressive processes, we derive upper bounds on the deviation probability of the least-squares estimate;

  • We provide interpretable upper bounds on the variance of this estimated parameter for both aforementioned scenarios;

  • We show that our variance bound for the stable case matches the well-known asymptotic linear decay; and for unstable systems, the variance bound decays exponentially on the number of data points, with rate of decay proportional to the magnitude of the unstable pole;

  • We illustrate the applicability of these bounds via Monte Carlo simulations, and show that they manifest the correct qualitative behavior of deviation probability and variance.

The rest of this paper is organized as follows. We put our work into context in Section 2, and explicitly formulate the problem in Section 3. Preliminaries from decoupling theory are introduced in Section 4. After this, we state and prove our non-asymptotic bounds for stable and unstable processes in Sections 5 and 6 respectively. Simulations and conclusions are presented in Sections 7 and 8.

2 Relation to Prior Work

In signal processing, Cramér-Rao lower bounds for finite number of measurements were studied in [8]. In a time-series context, [12, 13] studied large deviation rates of the least squares estimator in an AR(1) process. In adaptive control, a bound on the estimation error of the parameter of a scalar first order system was given in [14], together with an upper bound on the variance of the estimate. Although similar to our work, these bounds do not depend on the real system parameter. This leads to a) conservatism, as usually the worst-case-scenario is taken into account, and b) lack of insight on the effect of stability margins in the accuracy of the estimate. A deviation bound for finite time that is also problem independent was given in [15].

In a broader context, one of the first non-asymptotic results in system identification was presented in [10], where a uniform bound for the difference between empirical and theoretical identification costs was obtained. More recently, among works that have analyzed finite-time identification for stochastic processes are [6, 16, 17, 18]

. These papers have focused on vector autoregressive models described by matrix parameters in the form of state-space realizations, which do not reduce to interpretable and tight bounds for scalar cases. Only

[16] studied a scalar system separately, and derived probability error guarantees depending on the real parameter value. However, as opposed to our bounds, these results are rather difficult to interpret, and no variance bound is obtained.

3 Problem formulation

Consider the AR(1) process

(1)

where

is a Gaussian white noise process of variance

, and is unknown. Given data points , the least squares estimate of is

(2)

It is challenging to derive an analytical expression for the distribution of the random variable

. Therefore, instead of exact expressions, it is of interest to obtain an upper bound for the deviation of around the true value for finite samples. Due to symmetry, our interest lies on upper bounding

for any positive and fixed . Provided this probability bound, the next natural step is to compute an upper bound on the variance of least squares for finite samples in this scenario, which is defined as

The goal of this paper is to compute both bounds under two different regimes: stable and strictly unstable (). For this purpose, we make use of an exponential inequality for martingales, which we detail next.

4 Preliminaries: Theory of decoupling

In this section we give a brief summary of decoupling theory, mostly based on [19]. Decoupling theory provides an approach to handling complex problems were dependent variables are involved. By breaking the dependence structure, it is possible to tackle inequalities with greater ease. Before presenting the result we have used, we state two important definitions.

Definition 1 (-tangent sequence).

Let , , be two sequences of random variables adapted to an increasing sequence of -fields , and assume that is the trivial -field. Then is said to be -tangent to if for all ,

where denotes the conditional probability law of given .

Definition 2 (Condition CI).

A sequence of random variables adapted to an increasing sequence of -fields contained in is said to satisfy condition CI if there exists a -algebra contained in such that is a sequence of conditionally independent random variables given and

for all . The sequence is then said to be decoupled.

It is known [20] that there always exists a sequence of conditionally independent random variables given a master -algebra (where usually ) that has the same individual marginals as the original sequence. In particular, has the same conditional distribution as , for all , since it can be seen as drawn from independent repetitions of the same random mechanism. This construction can be made sequentially [19].

We now are ready to state the key Proposition used in this work.

Proposition 1.

[19, Corollary 3.1] Let be -tangent. Assume that is decoupled (i.e., condition CI is satisfied). Let be any random variable measurable with respect to . Then for all finite ,

5 Bounds for stable AR(1) processes

In this section, we derive the first key result of this paper, Theorem 1, which provides deviation probability and variance bounds for a stable first order autoregressive process and stationary data . We do this by exploiting Proposition 1.

Theorem 1.

Consider the AR(1) process described by (1), and the least-squares estimate (2). Assume that is stationary and . Then, for all and sample size ,

(3)

Furthermore, for all ,

(4)
Proof sketch.

Given , we have that

where we have used Chernoff’s inequality [4, Chap. 2]. Next, we need a bound on the right hand side. This can be derived by constructing a decoupled sequence and applying Proposition 1, as follows: let with the same -conditional distribution as . Moreover, let the sequence be conditionally independent given the master -field . Given this decoupled sequence, from Proposition 1 we have that

where we have used the fact that is -measurable. Now, is minimum at , giving . Thus, we have found the following bound, valid for stable and unstable systems:

(5)

To further compute this bound for the stable case, notice that , where

is a Toeplitz covariance matrix. By computing the moment generating function in (

5) using [21, Theorem 3.2a.1], we find that

(6)

From now on, we denote the symmetric Toeplitz matrix as . The final step consists in finding a lower bound for det(), for which we rely on a result from estimation theory. As minimum values of one-step prediction errors of increasing predictor complexity [22], it is known that

and also

where in this case, . Hence, we can lower bound as follows:

(7)

Here, can be computed as , while the integral in (7) can be evaluated using Szegö’s formula [22], according to which the integral is equal to the minimum variance of the prediction error for a process with spectrum . By this result and (6), we obtain (3).

Finally, the upper bound of the estimator’s variance for finite samples can be computed through the following inequality:

var

Note that our probability bound does not depend on the noise variance, which is intuitive since the random variable

is unaffected by scale changes. Regarding variance, our bound captures the correct behavior of the least squares estimate for this scenario, since stable systems that are closer to the stability limit are in fact easier to learn due to a greater signal-to-noise ratio, as argued in

[16]. By comparing (4) to the well-known asymptotic Cramér-Rao variance lower bound for this case, which says that for large ,

we see that the variance bound we have obtained has the same structure up to a positive constant factor, indicating that it is rate-optimal.

6 Bounds for unstable AR(1) processes

In the previous section, we provided a rigorous finite-sample analysis of a stable AR(1) process. We will now consider the unstable case, where . Contrary to the previous scenario, stationarity cannot be assumed. Despite this, a similar analysis still holds, and we are able to show that the variance bound decays exponentially in the number of samples, in contrast with the stable case. We state the result in Theorem 2.

Theorem 2.

Consider the AR(1) process described by (1) where and , and consider the least-squares estimate (2). Then, for all and sample size ,

(8)

where

(9)

Furthermore, for all ,

var
(10)
Proof sketch.

By following the same lines as the stable case proof, we reach (5). Since , we find that , where is a covariance matrix with entries

This covariance matrix can be shown to satisfy , where denotes the

-th column of the identity matrix of size

, and is a non-singular tridiagonal Toeplitz matrix with tridiagonal formed by . Thus, by similar computations as in (6), we reach

Using Cauchy’s rank-one perturbation formula [23], and the fact that , we can write the previous inequality as

The rest of the proof consists in calculating the determinants, which can be done in a recursive manner. These computations lead to the desired probability bound.

In order to obtain a bound on the variance, we first determine another probability bound which is easier to integrate. By manipulating the roots in (9), we find that , and hence, we can relax the deviation bound to

(11)

where is a fixed value which we will set later. For the next calculations, we take a fixed positive number , which we will also determine later, and

By substitution with this same relation, we compute

(12)

where we have used that , which is justified because

Next, we must find an appropriate value for and . The idea behind our approach is to make , (approximately) coincide with the change of regime of the bound of (11). It can be shown that valid values for and are

By replacing these quantities into (12), we can obtain the best rate of decay by choosing , leading to the bound in (10). ∎

This bound provides the intuitive decay for unstable systems: as the sample size increases, the variance of the least squares estimate decreases exponentially, and the rate of decay is proportional to the degree of instability of the system.

7 Simulations

To validate our theoretical results, we now show how the bounds obtained in Theorems 1 and 2 perform in practice. First, we test the probability bounds given in (3) and (8). With Monte Carlo runs, we have obtained sample probabilities for for ranging from to , and from to . Both simulated and upper bound surfaces for and can be seen in Fig. 1. We have also obtained the sample variance for the same number of Monte Carlo runs, and from to . Fig. 2 shows the results for the same values of previously studied.

Figure 1: True deviation probability (green) and its upper bound (yellow) for , and .
Figure 2: True (solid) and upper bounds (dashed) for var. Left: (red), (blue). Right: (red), (blue).

For both stable and unstable cases, the bounds in deviation probability and variance are shown to have similar qualitative behavior to the simulated curves. Some conservatism was expected, due in part to Chernoff’s inequality and by the square root introduced by the application of Proposition 1. As predicted from the bounds, as the real parameter ranges from stable to unstable the deviation probability flattens to zero quicker. For fixed , this naturally leads to a reduction of variance for more unstable systems, which is also qualitatively predicted accurately with our results.

8 Conclusions

In this paper we have obtained finite-sample upper bounds for the probability of deviation and variance of the least square estimator of the parameter of a scalar AR(1) process. Our results consider both stable and unstable cases, and reflect the correct qualitative behavior for both scenarios. For obtaining these bounds we have used elements of decoupling theory, which provides a promising approach to obtaining finite-sample bounds in dynamical systems.

Future work includes extending these results for general autoregressive processes with sub-Gaussian noise, and applying these bounds for performing a non-asymptotic analysis of general system identification model structures.

References

  • [1] H. B. Mann and A. Wald, “On the statistical treatment of linear stochastic difference equations,” Econometrica, Journal of the Econometric Society, pp. 173–220, 1943.
  • [2] T. L. Lai and C. Z. Wei, “Asymptotic properties of general autoregressive models and strong consistency of least-squares estimates of their parameters,”

    Journal of multivariate analysis

    , vol. 13, no. 1, pp. 1–23, 1983.
  • [3] V. H. De la Peña, T. L. Lai, and Q.-M. Shao, Self-normalized processes: Limit theory and Statistical Applications, Springer, 2008.
  • [4] M. J. Wainwright, High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge University Press, 2019.
  • [5] Y. Jedra and A. Proutiere, “Sample Complexity Lower Bounds for Linear System Identification,” arXiv preprint arXiv:1903.10343, 2019.
  • [6] T. Sarkar and A. Rakhlin, “Near optimal finite time identification of arbitrary linear dynamical systems,” in International Conference on Machine Learning, 2019, pp. 5610–5618.
  • [7] T. Lattimore and C. Szepesvári, Bandit algorithms, preprint, 2018.
  • [8] B. Friedlander and B. Porat, “The Exact Cramér-Rao Bound for Gaussian Autoregressive Processes,” IEEE Transactions on Aerospace and Electronic Systems, vol. 25, no. 1, pp. 3–7, 1989.
  • [9] E. Weyer, R. C. Williamson, and I. M. Mareels, “Finite sample properties of linear model identification,” IEEE Transactions on Automatic Control, vol. 44, no. 7, pp. 1370–1383, 1999.
  • [10] M. C. Campi and E. Weyer, “Finite sample properties of system identification methods,” IEEE Transactions on Automatic Control, vol. 47, no. 8, pp. 1329–1334, 2002.
  • [11] P. Stoica and R. L. Moses, Spectral Analysis of Signals, Prentice Hall, 2005.
  • [12] B. Bercu, F. Gamboa, and A. Rouault, “Large deviations for quadratic forms of stationary gaussian processes,” Stochastic Processes and their Applications, vol. 71, no. 1, pp. 75–90, 1997.
  • [13] B. Bercu, “On large deviations in the gaussian autoregressive process: stable, unstable and explosive cases,” Bernoulli, vol. 7, no. 2, pp. 299–316, 2001.
  • [14] A. Rantzer, “Concentration bounds for single parameter adaptive control,” in 2018 Annual American Control Conference (ACC), 2018, pp. 1862–1866.
  • [15] B. Bercu and A. Touati, “Exponential inequalities for self-normalized martingales with applications,” The Annals of Applied Probability, vol. 18, no. 5, pp. 1848–1869, 2008.
  • [16] M. Simchowitz, H. Mania, S. Tu, M. I. Jordan, and B. Recht, “Learning without mixing: Towards a sharp analysis of linear system identification,” in Conference On Learning Theory, 2018, pp. 439–473.
  • [17] M. Faradonbeh, A. Tewari, and G. Michailidis, “Finite time identification in unstable linear systems,” Automatica, vol. 96, pp. 342–353, 2018.
  • [18] A. Tsiamis and G. J. Pappas, “Finite sample analysis of stochastic system identification,” arXiv preprint arXiv:1903.09122, 2019.
  • [19] V. H. De la Peña, “A general class of exponential inequalities for martingales and ratios,” The Annals of Probability, vol. 27, no. 1, pp. 537–564, 1999.
  • [20] V. H. De la Peña and E. Giné, Decoupling: from Dependence to Independence, Springer, 2012.
  • [21] A. M. Mathai and S. B. Provost, Quadratic forms in random variables: theory and applications, Dekker, 1992.
  • [22] R. M. Gray, “Toeplitz and Circulant matrices: A review,” Foundations and Trends in Communications and Information Theory, vol. 2, no. 3, pp. 155–239, 2006.
  • [23] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed, Cambridge University Press, 2012.
  • [24] T. Kailath, A. H. Sayed, and B. Hassibi, Linear estimation, Prentice Hall, 2000.
  • [25] D. Kulkarni, D. Schmidt, and S.-K. Tsui,

    Eigenvalues of tridiagonal pseudo-Toeplitz matrices,”

    Linear Algebra and its Applications, vol. 297, pp. 63–80, 1999.
  • [26] P. Shaman, “On the inverse of the covariance matrix of a first order moving average,” Biometrika, vol. 56, no. 3, pp. 595–600, 1969.
  • [27] S. M. Kay, Fundamentals of Statistical Signal Processing, Prentice Hall, 1993.

9 Supplementary material

In this section we complete details on parts of the proofs of Theorems 1 and 2.

Fact 1.

The following identity holds:

(13)
Proof.

By Szegö’s formula [24], the integral in question is equal to , where is the constant factor of the spectral factorization of

Here, , and therefore the numerator of has roots at

Thus,

With this factorization, we conclude that , from which we obtain (13). ∎

Fact 2.

The matrix formed with entries

has a tridiagonal matrix inverse of the form

(14)

Furthermore, the Toeplitz matrix , where is the -th column of the -identity matrix, is non-singular for .

Proof.

Denote the matrix product between and the matrix in (14) as , and denote the elements of and the matrix in (14) as and respectively. We shall compute the elements of the matrix , labeled .

  • For :

  • For :

  • For :

  • For :

  • For :

Lastly, and , from which we conclude that .

Regarding the invertibility of , we use Geršgorin’s circle Theorem111En fact, an explicit expression for the eigenvalues exists: it can be shown (see e.g. [25]) that the eigenvalues of are given by , , which implies that for . [23] and the symmetry of this matrix to bound the spectrum of . This spectrum, which we call , satisfies

which implies that each eigenvalue of must satisfy

We note that similar results regarding inverses of covariance matrices have been studied before; we refer to [26] for more details. ∎

Fact 3.

Denote as the tridiagonal Toeplitz matrix of size with non-zero diagonal elements . Then, the following determinant equalities hold:

where

Proof.

For , via Proposition 2 we know that the following determinant relation holds:

with , and . So, we see that

For the second equality, we denote . Due to the tridiagonal structure, by using Proposition 2 again, we know that the determinants of interest satisfy

with , and . So, we write

For simplicity, we denote , . We are interested in