# Moment bounds for autocovariance matrices under dependence

The goal of this paper is to obtain expectation bounds for the deviation of large sample autocovariance matrices from their means under weak data dependence. While the accuracy of covariance matrix estimation corresponding to independent data has been well understood, much less is known in the case of dependent data. We make a step towards filling this gap, and establish deviation bounds that depend only on the parameters controlling the "intrinsic dimension" of the data up to some logarithmic terms. Our results have immediate impacts on high dimensional time series analysis, and we apply them to high dimensional linear VAR(d) model, vector-valued ARCH model, and a model used in Banna et al. (2016).

## Authors

• 29 publications
• 7 publications
01/20/2018

### Joint CLT for eigenvalue statistics from several dependent large dimensional sample covariance matrices with application

Let X_n=(x_ij) be a k × n data matrix with complex-valued, independent a...
08/18/2021

### Dimension-free Bounds for Sums of Independent Matrices and Simple Tensors via the Variational Principle

We consider the deviation inequalities for the sums of independent d by ...
05/17/2021

### Eigenvalue distribution of a high-dimensional distance covariance matrix with application

We introduce a new random matrix model called distance covariance matrix...
04/27/2021

### Central Limit Theorems for High Dimensional Dependent Data

Motivated by statistical inference problems in high-dimensional time ser...
01/14/2020

### Large sample autocovariance matrices of linear processes with heavy tails

We provide asymptotic theory for certain functions of the sample autocov...
12/24/2021

### Optimal Variable Clustering for High-Dimensional Matrix Valued Data

Matrix valued data has become increasingly prevalent in many application...
01/15/2020

### Detecting Changes in the Second Moment Structure of High-Dimensional Sensor-Type Data in a K-Sample Setting

The K sample problem for high-dimensional vector time series is studied,...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider a sequence of -dimensional mean-zero random vectors and a size- fraction of it. This paper aims to establish moment bounds for the spectral norm deviation of lag- autocovariances of , , from their mean values.

A first result at the origin of such problems concerns product measures, with and independent and identically distributed (i.i.d.). For this, Rudelson (1999) derived a bound on , where represents the spectral norm for matrices. The technique is based on symmetrization and the derived maximal inequality is a consequence of a concentration inequality on a “symmetrized” version of symmetric and deterministic matrices, (cf. Oliveira (2010)). That is, for any ,

 (1.1)

where are independent and taking values

with equal probability. The applicability of this technique then hinges on the assumption that the data are i.i.d..

Later, Vershynin (2012), Srivastava and Vershynin (2013), Mendelson and Paouris (2014), Lounici (2014), Bunea and Xiao (2015), Tikhomirov (2017), among many others, derived different types of deviation bounds for under different distributional assumptions. For example, Lounici (2014) and Bunea and Xiao (2015) showed that, for such that are subgaussian and i.i.d.,

 E\vvvert^\bSigma0−\bSigma0\vvvert≤C\vvvert\bSigma0\vvvert{√r(\bSigma0)log(ep)n+r(\bSigma0)log(ep)n}. (1.2)

Here is a universal constant, , and is termed the “effective rank” (Vershynin, 2012) where for any real matrix .

Statistically speaking, Equation (1.2) is of rich implications. For example, combining (1.2) with Davis-Kahan inequality (Davis and Kahan, 1970)

suggests that the principal component analysis (PCA), a core statistical method whose aim is to recover the leading eigenvectors of

, could still produce consistent estimators even if the dimension is much larger than the sample size , as long as the “intrinsic dimension” of the data, quantified by , is small enough. See Section 1 in Han and Liu (2018) for more discussions on the statistical performance of PCA in high dimensions.

The main goal of this paper is to give extensions of the deviation inequality (1.2

) to large autocovariance matrices, where the matrices are constructed from a high dimensional structural time series. Examples of such time series include linear vector autoregressive model of lag

(VAR(

)), vector-valued autoregressive conditionally heteroscedastic (ARCH) model, and a model used in

Banna et al. (2016). The main result appears below as Theorem 2, and is nonasymptotic in its nature. This result will have important consequences in high dimensional time series analysis. For example, it immediately yields new analysis for estimating large covariance matrix (Chen et al., 2013), a new proof of consistency for Brillinger’s PCA in the frequency domain (cf. Chapter 9 in Brillinger (2001)), and we envision that it could facilitate a new proof of consistency for the PCA procedure proposed in Chang et al. (2018).

The rest of the paper is organized as follows. Section 2 characterizes the settings and gives the main concentration inequality for large autocovariance matrices. In Section 3, we present applications of our results to some specific time series models. Proofs of the main results are given in Section 4, with more relegated to an appendix.

## 2 Main results

We first introduce the notation that will be used in this paper. Without further specification, we use bold, italic lower case alphabets to denote vectors, e.g., as a -dimensional real vector, and as its vector norm. We use bold, upper case alphabets to denote matrices, e.g., as a real matrix, and as the identity matrix. Throughout the paper, let be generic universal constants, whose actual values may vary at different locations. For any two sequences of positive numbers , we denote if there exists an universal constant such that for all large enough. We write if both and hold.

Consider a time series of -dimensional real entries with denoting the sets of real and integer numbers respectively. In the sequel, the considered time series does not need to be stationary nor centered, and we are focused on a size- fraction of it. Without loss of generality, we denote this fraction to be .

As described in the introduction, the case of independent has been discussed in depth in recent years. We are interested here in the time series setting, and our main emphasis will be to describe nontrivial but easy to verify cases for which Inequality (1.2) still holds. The following four assumptions are accordingly made, with the notation that

 Sp−1:={\bx∈Rp:\vvvert\bx\vvvert2=1},¯¯¯Sp−1:={\bx∈Rp:|x1|=⋯=|xp|=1},

and

 \vvvertX\vvvertL(p):=(E|X|p)1/p,\vvvertX\vvvertψ2:=inf{k∈(0,∞):E[exp{(|X|/k)2}−1]≤1}

for any random variable

.

1. [label=(A0)]

2. Assume that

 κ1:=supt∈Zsup\bu∈Sp−1\vvvert\buT\bYt\vvvertψ2<∞,   κ∗:=supt∈Zsup\bv∈¯¯¯Sp−1\vvvert\bvT\bYt\vvvertψ2<∞.
3. Assume, for any integer , there exists a sequence of random vectors which is independent of , identically distributed as , and for any integer ,

 \vvvert\vvvert\bYk−~\bYk\vvvert2\vvvertL(1+ϵ)≤γ1κ1exp{−γ2(k−j−1)}

for some constants .

4. Assume, for any integer , there exists a sequence of random vectors which is independent of , identically distributed as , and for any integer ,

 sup\bu∈Sp−1\vvvert(\bYk−~\bYk)T\bu\vvvertL(1+ϵ)≤γ3κ1exp{−γ4(k−j−1)}

for some constants .

5. Assume there exists an universal constant such that, for all and for all , .

Two observations are in order. We first reveal the relationship between and the effective rank highlighted in (1.2). As a matter of fact, it is easy to see, as , and scale at the same orders of and , and the same observation applies to all subgaussian distributions with the additional condition 4, which is identical to Assumption 1 in Lounici (2014). Hence, could be pictured as a generalized “effective rank”. In the following, this important ratio will be denoted as .

Secondly, we note that Assumptions 2 and 3 are characterizing the intrinsic coupling property of the sequence. In practice, such couples can be constructed from time to time. Consider, for example, the following causal shift model,

 \bYt=Ht(ξt,ξt−1,ξt−2,…),

where consists of independent elements with values in a measurable space and is a vector-valued function. Then it is natural to consider

 ~\bYt=Ht(ξt,…ξj+1,~ξj,~ξj−1,…)

for an independent copy of .

The following is the main result of this paper.

Let be a sequence of random vectors satisfying Assumptions 1-3 and recall . Assume and . Then, for any integer and , we have

 E\vvvert^Σm−E^Σm\vvvert≤Cκ21{√r∗logepn−m+r∗logep(lognp)3n−m} (2.1)

for some constant only depending on . If in addition, is secondarily stationary of mean-zero random vectors and Assumption 4 holds, then

 E\vvvert^Σm−E^Σm\vvvert≤C′\vvvertΣ0\vvvert{√r(\bSigma0)logepn−m+r(\bSigma0)logep(lognp)3n−m}

for some constant only depending on .

We first comment on the temporal correlatedness conditions, Assumptions 2 and 3. We note that they correspond exactly to the -measure of dependence introduced in Chapter 3 of Dedecker et al. (2007), for the sequence and respectively. In addition, as will be seen soon, our measure of dependence is also very related to the -measure introduced in Dedecker and Prieur (2004). In particular, ours is usually stronger than, but as , reduces to the -measure. Lastly, our conditions are also quite connected to the functional dependence measure in Wu (2005), on which many moment inequalities in real space have been established (cf. Liu et al. (2013) and Wu and Wu (2016)). However, it is still unclear if a similar matrix Bernstein inequality could be developed under Weibiao Wu’s functional dependence condition.

Secondly, we note that one is ready to verify that Inequality (2.1) gives the exact control of the deviation from the mean. Actually, Inequality (2.1) is nearly a strict extension of the results in Lounici (Lounici, 2014) and Bunea and Xiao (Bunea and Xiao, 2015) to weak data dependence up to some logarithmic terms.

Admittedly, it is still unclear if Inequality (2.1) could be further improved under the given conditions. Recently, in a remarkable series of papers (Koltchinskii and Lounici, 2017a, b, c), Koltchinskii and Lounici showed that, for subgaussian independent data, the extra multiplicative term on the righthand side of Inequality (2.1) could be further removed. The proof rests on Talagrand’s majorizing measures (Talagrand, 2014) and a corresponding maximal inequality due to Mendelson (Mendelson, 2010). In the most general case, to the authors’ knowledge, it is still unknown if Talagrand’s approach could be extent to weakly dependent data, although we conjecture that, under stronger temporal dependence (e.g., geometrically -mixing) conditions, it is possible to recover Koltchinskii and Lounici’s result without resorting to the matrix Bernstein inequality in the proof of Theorem 2.

Nevertheless, we make a first step towards eliminating these logarithmic terms via the following theorem. It shows, when assuming a Gaussian sequence is observed, one could further tighten the upper bound in Inequality (2.1) by removing all logarithm factors. The obtained bound is thus tight in view of Theorem 2 in Lounici (2014) and Theorem 4 in Koltchinskii and Lounici (2017a).

Let be a stationary mean-zero Gaussian sequence that satisfies Assumptions 2-3 with , , and . Then, for any integer and ,

 E\vvvert^Σm−Σm\vvvert≤C\vvvertΣ0\vvvert(√r(\bSigma0)n−m+r(\bSigma0)n−m)

for some constant only depending on .

In a related track of studies, Bai and Yin (1993), Srivastava and Vershynin (2013), Mendelson and Paouris (2014), and Tikhomirov (2017), among many others, explored the optimal scaling requirement in approximating a large covariance matrix for heavy-tailed data. For instance, for i.i.d. data and as is identity, Bai and Yin (Bai and Yin, 1993) showed that will converge to zero in probability as long as and 4-th moments exist. Some recent developments further strengthen the moment requirement. These results cannot be compared to ours. In particular, our analysis is focused on characterizing the role of “effective rank”, a term of strong meanings in statistical implications and a feature that cannot be captured using these alternative procedures.

## 3 Applications

In this section, we examine the validity of Assumptions 1-4 in Section 2 under three models, a stable VAR(d) model, a model proposed by Banna et al. (2016), and an ARCH-type model.

We first consider such that is a random sequence generated from VAR(d) model, i.e.,

 \bYt=\Ab1\bYt−1+⋯+\Abd\bYt−d+\bEt,

where is a sequence of independent vectors such that for all and , for some universal constant . In addition, assume for some universal positive constant , for all , and , where are some universal constants.

Under these conditions, we have the following theorem. The above satisfies Assumptions 1-4 with

 γ1=C(κ∗/κ1)(\vvvert¯¯¯¯¯¯¯¯\Ab\vvvert/ρ1)K,γ2=log(ρ−11),γ3=C′d(\vvvert¯¯¯¯¯¯¯¯\Ab\vvvert/ρ1)K,γ4=log(ρ−11).

Here we denote

 ¯¯¯¯¯¯¯¯\Ab:=⎡⎢ ⎢ ⎢⎣a1a2…ad−1ad10…00500…10⎤⎥ ⎥ ⎥⎦,

is a universal constant such that whose existence is guaranteed by the assumption that (cf. Lemma 4.4 in Section 4), is some constant only depending on , and are some constants only depending on .

We secondly consider the following time series generation scheme whose corresponding matrix version has been considered by Banna, Merlevède, and Youssef (Banna et al., 2016). In detail, let be a random sequence generated by

 \bYt=Wt\bEt,

where is a sequence of independent random vectors independent of such that for all and , for some universal constant . In addition, we assume

 supt∈Zsup\bu∈Sp−1\vvvert\buT\bEt\vvvertψ2≤κ′1   and   supt∈Zsup\bv∈¯¯¯Sp−1\vvvert\bvT\bEt\vvvertψ2≤κ′∗

for some constants , is a sequence of uniformly bounded -mixing random variables such that , and

 τ(k;{Wt}t∈Z,|⋅|)≤κWγ5exp{−γ6(k−1)}

for some constants .

The above satisfies Assumptions 1-4 with

 γ1=Cκ′∗κWγ11+ϵ5/κ1,γ2=γ6/(1+ϵ),γ3=C′κ′1κWγ11+ϵ5/κ1,γ4=γ6/(1+ϵ)

for some constants only depending on .

Lastly, we consider an vector-valued ARCH-model with being a random sequence generated by

 \bYt=\Ab\bYt−1+H(\bYt−1)\bEt,

where is a matrix-valued function and is a sequence of independent random vectors such that

 supt∈Zsup\bu∈Sp−1\vvvert\buT\bEt\vvvertψ2≤κ′1   and   supt∈Zsup\bv∈¯¯¯Sp−1\vvvert\bvT\bEt\vvvertψ2≤κ′∗

for some constants . Assume further that and the function satisfies

 sup\bu,\bv∈Rp\vvvertH(\bu)−H(\bv)\vvvert≤a2κ′∗\vvvert\bu−\bv\vvvert2

for some universal constant such that .

If the above satisfies Assumption 1, it satisfies Assumptions 2-3 with

 γ1=Cκ∗/κ1,γ2=−log(a1+a2),γ3=C′max(κ∗κ′1/κ1κ′∗,1),γ4=log(a1+a2)−1

for some constants only depending on . If we further assume the above to be a stationary sequence and for some universal constant , then satisfies Assumption 1.

## 4 Proofs

### 4.1 Proof of Theorem 2

###### Proof of Theorem 2.

The proof depends mainly on the following tail probability bound of deviation of the sample covariance from its mean.

Let be a sequence of random vectors satisfying 1-3. For any integer , integer and real number , define

 Mδ:=Cmax{(κ∗κ1)2logn−mδ,(κ∗κ1)2,2κ∗γ1κ1}.

Then for any ,

 P[\vvvert^Σm−E^Σm\vvvert≥κ21{x+√δ/(n−m)}]≤2pexp{−C′(n−m)2x2A1(n−m)+A2M2δ+A3(n−m)xMδ}+δ,

with

 A1:={κ∗γ1/κ1+(κ∗/κ1)2(γ3+2m+1)+2m+1}1−exp{−min(5+ϵ6ϵ+10γ2,γ4)},A2:=4532γ2, A3:=2log(n−m)log2max{1,8m+48log(n−m)pγ2}

for some constants only depending on .

Without loss of generality, let . Taking , for some , , and in Proposition 4.1, we obtain

 P(\vvvert^Σ0−E^Σ0\vvvert≥C1κ21√r∗logepnt)≤2pexp[−C2(logep)t/{log(√r∗logepnt)}21+r∗(logn)2n+√r∗logepnt(lognp)3]+x−γ

for some constants only depending on .

If , we have

 E\vvvert^Σ0−E^Σ0\vvvert2(C1κ21√r∗logepn)2≤ 1+r∗(logn)2n+∫{1+r∗(logn)2n}2r∗logep(lognp)6n1+r∗(logn)2n2pexp[−C2(logep)t/{log(√r∗logepnt)}21+r∗(logn)2n]dt +∫∞{1+r∗(logn)2n}2r∗logep(lognp)6n2pexp[−C2(logep)√t/{log(√r∗logepnt)}2√r∗logep(lognp)6n]dt ≤ C3(1+r∗(logn)2n+r∗logep(lognp)6n).

This gives that

 E\vvvert^Σ0−E^Σ0\vvvert2≤C4κ41{r∗logepn+r2∗(logep)2(lognp)6n2}.

On the other hand, if ,

 E\vvvert^Σ0−E^Σ0\vvvert2(C1κ21√r∗logepn)2≤ r∗logep(lognp)6n+∫∞r∗logep(lognp)6n2pexp[−C2(logep)√t/{log(√r∗logepnt)}2√r∗logep(lognp)6n]dt ≤ C5r∗logep(lognp)6n.

This renders

 E\vvvert^Σ0−E^Σ0\vvvert2≤C5κ41{r2∗(logep)2(lognp)6n2}.

Combining two cases gives us the final result by using the simple fact that . This completes the proof of the first part of Theorem 2.

Notice that under Assumptions 1, 4, zero-mean, and stationarity, we have and . Thus plugging in Theorem 2 finishes the proof. ∎

Now we prove Proposition 4.1 under Assumptions 1-3. In the proof, the cases for covariance and autocovariance matrices are treated separately. In addition, the proof depends on a Berstein-type inequality for -mixing random matrices and some related lemmas, whose proofs are presented later.

Given a sequence of random vectors , denote for all . Then for any constant , we introduce the following “truncated” version of :

 \XbMt:=M∧\vvvert\Xbt\vvvert\vvvert\Xbt\vvvert\Xbt,

where for any two real numbers .

For any integer , we denote for all . For the sake of clarification, the superscript “” is dropped when no confusion is possible. Then the truncated version is

 \ZbMt:=M∧\vvvert\Zbt\vvvert\vvvert\Zbt\vvvert\Zbt

for any .

We further define the “variances” for

and as

 ν2\XbM ν2\ZbM :=supK⊆{1,…,n−m}1card(K)∥∥∥E(∑i∈K\ZbMi−E\ZbMi)2∥∥∥.

Here and

denote the largest and smallest eigenvalues of

respectively.

###### Proof of Proposition 4.1.

We first assume . We consider two cases.

Case I: When , is a sequence of symmetric random matrices. We have,

 P{1n∥∥∥n∑i=1(\Xbi−E\Xbi)∥∥∥≥x} = P{1n∥∥∥n∑i=1(\Xbi−\XbMi+\XbMi−E\XbMi+E\XbMi−E\Xbi)∥∥∥≥x} ≤ ≤ ≤ P{∥∥∥n∑i=1(\XbMi−E\XbMi)∥∥∥≥nx−n∑i=1\vvvertE\XbMi−E\Xbi\vvvert}+n∑i=1P(\Xbi≠\XbMi) ≤ P[λmax{n∑i=1(\XbMi−E\XbMi)}≥nx−n∑i=1\vvvertE\XbMi−E\Xbi\vvvert]+ P[λmin{n∑i=1(\XbMi−E\XbMi)}≤−nx+n∑i=1\vvvertE\XbMi−E\Xbi\vvvert]+n∑i=1P(\Xbi≠\XbMi). (4.1)

We first show that the difference in expectation between the “truncated” and original one can be controlled with the chosen truncation level . For this, we need the following lemma.

Let be a sequence of -dimensional random vectors under Assumption 1. Then for all and for all ,

 P{\vvvert\bYt\vvvert22≥2κ2∗+8κ2∗(x+√x)}≤exp(−Cx)

for some arbitary constant .

By applying Lemma 4.1, we obtain that for all ,

 \vvvertE\XbMδi−E\Xbi\vvvert= ∥∥∥E(1−Mδ\vvvert\Xbi\vvvert)\Xbi1{\vvvert\Xbi\vvvert>Mδ}∥∥∥ ≤ sup\bu,\bv∈Sp−1E|\buT\Xbi\bv|1{\vvvert\Xbi\vvvert>Mδ} ≤ sup\bu,\bv∈Sp−1{E(\buT\bYi\bYTi\bv)2}12{P(\vvvert\Xbi\vvvert>Mδ)}12 ≤ √δ/n,

where the last line followed by Assumption 1, Lemma 4.1, and the chosen .

The second step heavily depends on a Bernstein-type inequality for -mixing random matrices. The theorem slightly extends the main theorem of Banna et al. (2016)

in which the random matrix sequence is assumed to be

-mixing. Its proof is relegated to the Appendix.

Consider a sequence of real, mean-zero, symmetric random matrices with for some positive constant . In addition, assume that this sequence is -mixing (see, Appendix Section A.1 for a detailed introduction to the -mixing coefficient) with geometric decay, i.e.,

 τ(k;{\Xbt}t∈Z,\vvvert⋅\vvvert)≤Mψ1exp{−ψ2(k−1)}

for some constants . Denote . Then for any and any integer , we have

 P{λmax(n∑i=1\Xbi)≥x}≤pexp{−x28(152nν2+602M2/ψ2)+2xM~ψ(~ψ1,ψ2,n,p)},

where

In order to apply Theorem 4.1, we need the following two lemmas. Lemma 4.1 is to show that the sequence of “truncated” matrices under Assumptions 1-2 is a -mixing random sequence with geometric decay. Lemma 4.1 calculates the upper bound for term in Theorem 4.1 for .

Let be a sequence of random vectors under Assumptions 1-2. Then , , , and are all -mixing random sequences. Moreover,

 τ(k;{\XbMt}t∈Z,\vvvert⋅\vvvert)≤Cγ1κ1κ