## 1 Introduction

The past decade has seen considerable progress on approaches to structured parameter estimation, especially in the linear regression setting, where one considers regularized estimation problems of the form:

(1) |

where , , such that and , is the training set of independently and identically distributed (i.i.d.) samples, is a regularization parameter, and denotes a suitable norm [32, 41, 38]. Specific choices of lead to certain types of structured parameters to be estimated. For example, the decomposable norm yields Lasso, estimating sparse parameters, gives Group Lasso, estimating group sparse parameters, and , the ordered weighted norm (OWL) [7], gives sorted -penalized estimator, clustering correlated regression parameters [12]. Non-decomposable norms, such as -support norm [2] or overlapping group sparsity norm [15] can be used to uncover more complicated model structures. Theoretical analysis of such models, including sample complexity and non-asymptotic bounds on the estimation error rely on the design matrix , usually assumed (sub)-Gaussian with independent rows, and the specific norm under consideration [25, 26]. Recent work has generalized such estimators to work with any norm [23, 4] with i.i.d. rows in .

The focus of the current paper is on structured estimation in vector auto-regressive (VAR) models [20], arguably the most widely used family of multivariate time series models. VAR models have been applied widely, ranging from describing the behavior of economic and financial time series [33] to modeling the dynamical systems [18] and estimating brain function connectivity [34], among others. A VAR model of order is defined as

(2) |

where denotes a multivariate time series, are the parameters of the model, and is the order of the model. In this work, we assume that the noise

follows a Gaussian distribution,

, with and , for . The VAR process is assumed to be stable and stationary [20], while the noise covariance matrixis assumed to be positive definite with bounded largest eigenvalue, i.e.,

and .In the current context, the parameters are assumed to be structured, in the sense of having low values according to a suitable norm . We consider a general setting where any norm can be applied to the rows of , allowing the possibility of different norms being applies to different rows of , and different norms for different parameter matrices . Choosing -norm for all rows and all parameter matrices is a simple special case of our setting. We discuss certain other choices in Section 3.1, and discuss related results in Section 5. In order to estimate the parameters, one can consider regularized estimators of the form (1), where and correspond to in the VAR setting. Unfortunately, unlike in (1), the are far from independent, having strong dependence across time and correlated across dimensions. As a result, existing results from the rich literature on regularized estimators for structured problems [39, 36, 21] cannot be directly applied to get sample complexities and estimation error bounds in VAR models.

The rest of the paper is organized as follows. In Section 2 we review the related work on structured VAR estimation. In Section 3 we present the estimation problem for the VAR model and in Section 4 we establish the main results of our analysis on the VAR estimation guarantees. We present experimental results in Section 5 and conclude in Section 6. The proofs and other detailed discussions can be found in Appendices A, B, C and D.

## 2 Related Work

In recent literature, the problem of estimating structured VAR models has been considered for the special case of norm. [14] analyzed a constrained estimator based on the Dantzig selector [8], and established the recovery results for the special case of norm. [28] considered a regularized VAR estimation problem under Lasso and Group Lasso penalties and derived oracle inequalities for the prediction error and estimation accuracy. However, their analysis is for the case when the dimensionality of the problem is fixed with respect to the sample size. Moreover, they employed an assumption on the dependency structure in the VAR, thus limiting the sample correlation issues mentioned earlier.

The work of [16] studied regularized Lasso-based estimator while allowing for problem dimensionality to grow with sample size, utilizing suitable martingale concentration inequalities to analyze dependency structure. [19] considered VAR estimation for first order models () assuming , and the analysis was not extended to the general case of . In recent work, [5] considered a VAR Lasso estimator and established the sample complexity and error bounds by building on the prior work of [19]. Their approach exploits the spectral properties of a general VAR model of order , providing insights on the dependency structure of the VAR process. However, in line with the existing literature, the analysis was tailored to the special case of norm, thus limiting its generality.

Compared to the existing literature, our results are substantially more general since the results and analysis apply to any norm . One may wonder—given the popularity of norm, why worry about other norms? Over the past decade, considerable effort has been devoted to generalize norm based results to other norms [23, 10, 4, 12]. Our work obviates the need for a similar exercise for VAR models. Further, some of these norms have found key niche in specific application areas e.g., [40, 37]. From a technical perspective, one may also wonder—once we have the result for norm, why should not the extension to other norms be straightforward? A key technical aspect of the estimation error analysis boils down to getting sharp concentration bounds for , where is the dual norm of , is the design matrix, and is the noise [4]. For the special case of , the dual norm is , and one can use union bound to get the required concentration. In fact, this is exactly how the analysis in [5] was done. For general norms, the union bound is inapplicable. Our analysis is based on a considerably more powerful tool, generic chaining [31], yielding an analysis applicable to any norm, and producing results in terms of geometric properties, such as Gaussian widths [17], of sets related to the norm. Results for specific norms can then be obtained by plugging in suitable bounds on the Gaussian widths [9, 11]. We illustrate the idea by recovering known bounds for Lasso and Group Lasso, and obtaining new results for Spare Group Lasso and OWL norms. Finally, in terms of the core technical analysis, the application of generic chaining to the VAR estimation setting is not straightforward. In the VAR setting, generic chaining has to consider a stochastic process derived from sub-exponential martingale difference sequence (MDS). We first generalize the classical Azuma-Hoeffding inequality applicable to sub-Gaussian MDSs to get an Azuma-Bernstein inequality for sub-exponential MDSs. Further, we use suitable representations of Talagrand’s -functions [31] in the context of generic chaining to obtain bounds on in terms of the Gaussian width of the unit norm ball . Our estimation error bounds in the VAR setting are exactly of the same order as Lasso-type models in the i.i.d. setting implying, surprisingly, that the strong temporal dependency in the VAR setting has no adverse effect on the estimation.

## 3 Structured VAR Model

In this section we formulate structured VAR estimation problem and discuss its properties, which are essential in characterizing sample complexity and error bounds.

### 3.1 Regularized Estimator

To estimate the parameters of the VAR model, we transform the model in (2) into the form suitable for regularized estimator (1). Let denote the samples generated by the stable VAR model in (2), then stacking them together we obtain

which can also be compactly written as

(3) |

where , , , and for . Vectorizing (column-wise) each matrix in (3), we get

where , , , , and is the Kronecker product. The covariance matrix of the noise is now . Consequently, the regularized estimator takes the form

(4) |

where can be any vector norm, separable along the rows of matrices . Specifically, if we denote and as the row of matrix for , then our assumption is equivalent to

(5) |

To reduce clutter and without loss of generality, we assume the norm to be the same for each row . Since the analysis decouples across rows, it is straightforward to extend our analysis to the case when a different norm is used for each row of , e.g., for row one, for row two, -support norm [2] for row three, etc. Observe that within a row, the norm need not be decomposable across columns.

The main difference between the estimation problem in (1) and the formulation in (4) is the strong dependence between the samples , violating the i.i.d. assumption on the data . In particular, this leads to the correlations between the rows and columns of matrix (and consequently of ). To deal with such dependencies, following [5], we utilize the spectral representation of the autocovariance of VAR models to control the dependencies in matrix .

### 3.2 Stability of VAR Model

Since VAR models are (linear) dynamical systems, for the analysis we need to establish conditions under which the VAR model (2) is stable, i.e., the time-series process does not diverge over time. For understanding stability, it is convenient to rewrite VAR model of order in (2) as an equivalent VAR model of order

(6) |

where . Therefore, VAR process is stable if all the eigenvalues of satisfy for , . Equivalently, if expressed in terms of original parameters , stability is satisfied if (see Appendix A for more details).

### 3.3 Properties of Data Matrix

In what follows, we analyze the covariance structure of matrix in (3) using spectral properties of VAR model (see Appendix B

for additional details). The results will then be used in establishing the high probability bounds for the estimation guarantees in problem (

4).Define any row of as , . Since we assumed that , it follows that each row is distributed as , where the covariance matrix is the same for all

(7) |

where . It turns out that since is a block-Toeplitz matrix, its eigenvalues can be bounded as (see [13])

(8) |

where denotes the -th eigenvalue of a matrix and for ,

, is the spectral density, i.e., a Fourier transform of the autocovariance matrix

. The advantage of utilizing spectral density is that it has a closed form expression (see Section 9.4 of [24])where denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound

(9) |

where we defined for

(10) |

see Appendix B.1 for additional details.

In establishing high probability bounds we will also need information about a vector for any , . Since each element , it follows that with a covariance matrix . It can be shown (see Appendix B.3 for more details) that can be written as

(11) |

where for which is obtained from matrix by stacking all the rows in a single vector, i.e, . In order to bound eigenvalues of (and consequently of ), observe that can be viewed as a vector obtained by stacking outputs from VAR model in (6). Similarly as in (8), if we denote the spectral density of the VAR process in (6) as , , where , then we can write

The closed form expression of spectral density is

where is the covariance matrix of a noise vector and are as defined in expression (6). Thus, an upper bound on can be obtained as , where we defined for

(12) |

Referring back to covariance matrix in (11), we get

(13) |

We note that for a general VAR model, there might not exist closed-form expressions for and . However, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in [5]).

## 4 Regularized Estimation Guarantees

Denote by the error between the solution of optimization problem (4) and , the true value of the parameter. The focus of our work is to determine conditions under which the optimization problem in (4) has guarantees on the accuracy of the obtained solution, i.e., the error term is bounded: for some known . To establish such conditions, we utilize the framework of [4]. Specifically, estimation error analysis is based on the following known results adapted to our settings. The first one characterizes the restricted error set , where the error belongs.

###### Lemma 4.1

Assume that

(14) |

for some constant , where is a dual form of the vector norm , which is defined as , for , where and . Then the error vector belongs to the set

(15) |

The second condition in [4] establishes the upper bound on the estimation error.

###### Lemma 4.2

Assume that the restricted eigenvalue (RE) condition holds

(16) |

for and some constant , where is a cone of the error set, then

(17) |

where is a norm compatibility constant, defined as .

Note that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error satisfies the upper bound in (17). However, the results are defined in terms of the quantities, involving and , which are random. Therefore, in the following we establish high probability bounds on the regularization parameter in (14) and RE condition in (16).

### 4.1 High Probability Bounds

In this Section we present the main results of our work, followed by the discussion on their properties and illustrating some special cases based on popular Lasso and Group Lasso regularization norms. In Section 4.4 we will present the main ideas of our proof technique, with all the details delegated to the Appendices C and D.

To establish lower bound on the regularization parameter , we derive an upper bound on , for some , which will establish the required relationship .

###### Theorem 4.3

Let , and define to be a Gaussian width of set for . For any and with probability at least we can establish that

where , and are positive constants.

To establish restricted eigenvalue condition, we will show that , for some and then set .

###### Theorem 4.4

Let , where is a unit sphere. The error set is defined as , for , , and , for is of size , and , for . The set is a part of the decomposition in due to the assumption on the row-wise separability of norm in (5). Also define to be a Gaussian width of set for and . Then with probability at least , for any

where and , , are positive constants, and and are defined in (9) and (13).

### 4.2 Discussion

From Theorem 4.4, we can choose and set and since must be satisfied, we can establish a lower bound on the number of samples

(18) |

Examining this bound and using (9) and (13), we can conclude that the number of samples needed to satisfy the restricted eigenvalue condition is smaller if and are larger and and are smaller. In turn, this means that matrices and in (10) and (12) must be well conditioned and the VAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we can also understand (18) as showing that large values of and small values of indicate stronger dependency in the data, thus requiring more samples for the RE conditions to hold with high probability.

Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and dimensionality , and of the problem increase, we emphasize the scale of the results and use the order notations to denote the constants. Select a number of samples at least and let the regularization parameter satisfy . With high probability then the restricted eigenvalue condition for holds, so that is a positive constant. Moreover, the norm of the estimation error in optimization problem (4) is bounded by . Note that the norm compatibility constant is assumed to be the same for all , which follows from our assumption in (5).

Consider now Theorem 4.3 and the bound on the regularization parameter . As the dimensionality of the problem and grows and the number of samples increases, the first term will dominate the second one . This can be seen by computing for which the two terms become equal , which happens at . Therefore, we can rewrite our results as follows: once the restricted eigenvalue condition holds and , the error norm is upper-bounded by .

### 4.3 Special Cases

While the presented results are valid for any norm , separable along the rows of , it is instructive to specialize our analysis to a few popular regularization choices, such as and Group Lasso, Sparse Group Lasso and OWL norms.

#### 4.3.1 Lasso

To establish results for norm, we assume that the parameter is -sparse, which in our case is meant to represent the largest number of non-zero elements in any , , i.e., the combined -th rows of each , . Since is decomposable, it can be shown that . Next, since , then using Lemma in [4] and Gaussian width results in [9], we can establish that . Therefore, based on Theorem 4.3 and the discussion at the end of Section 4.2, the bound on the regularization parameter takes the form . Hence, the estimation error is bounded by as long as .

#### 4.3.2 Group Lasso

To establish results for Group norm, we assume that for each , the vector can be partitioned into a set of disjoint groups, , with the size of the largest group . Group Lasso norm is defined as . We assume that the parameter is -group-sparse, which means that the largest number of non-zero groups in any , is . Since Group norm is decomposable, as was established in [23], it can be shown that . Similarly as in the Lasso case, using Lemma in [4], we get . The bound on the takes the form . Combining these derivations, we obtain the bound for .

#### 4.3.3 Sparse Group Lasso

Similarly as in Section 4.3.2, we assume that we have disjoint groups of size at most . The Sparse Group Lasso norm enforces sparsity not only across but also within the groups and is defined as , where is a parameter which regulates a convex combination of Lasso and Group Lasso penalties. Note that since , it follows that . As a result, for , so that and thus , according to Section 4.3.2. Assuming is -sparse and -group-sparse and noting that the norm is decomposable, we get . Consequently, the error bound is .

#### 4.3.4 OWL norm

Ordered weighted norm is a recently introduced regularizer and is defined as , where is a predefined non-increasing sequence of weights and is the sequence of absolute values of , ranked in decreasing order. In [11] it was shown that , where is the average of and the norm compatibility constant is . Therefore, based on Theorem 4.3, we get and the estimation error is bounded by .

We note that the bound obtained for Lasso and Group Lasso is similar to the bound obtained in [28, 5, 16]. Moreover, this result is also similar to the works, which dealt with independent observations, e.g., [6, 23], with the difference being the constants, reflecting correlation between the samples, as we discussed in Section 4.2. The explicit bound for Sparse Group Lasso and OWL is a *novel* aspect of our work for the non-asymptotic recovery guarantees for the VAR estimation problem with norm regularization, being just a simple consequence from our more general framework.

### 4.4 Proof Sketch

In this Section we outline the steps of the proof for Theorem 4.3 and 4.4, all the details can be found in Appendix C and D.

#### 4.4.1 Bound on Regularization Parameter

Recall that our objective is to establish for a probabilistic statement that , where for and for in (3). We denote as a column of noise matrix and note that since , then using the row-wise separability assumption in (5) we can split the overall probability statement into parts, which are easier to work with. Thus, our objective would be to establish

(19) |

for , where and .

The overall strategy is to first show that the random variable

has sub-exponential tails. Based on the generic chaining argument, we then use Theorem 1.2.7 from [31] and bound the expectation . Finally, using Theorem 1.2.9 in [31] we establish the high probability bound on concentration of around its mean, i.e., derive the bound in (19).We note that the main difficulty of working with the term is the complicated dependency between and , which is due to the VAR generation process in (3). However, if we write , where and we can interpret this as a summation over martingale difference sequence [20]. This can be easily proven by showing . The latter is true since in the terms and are independent since is independent from for and (see (2)).

To show that has sub-exponential tails, recall that since in (2) is Gaussian, and are independent Gaussian random variables, whose product has sub-exponential tails. Moreover, the sum over sub-exponential martingale difference sequence can be shown to be itself sub-exponential using [27], based on Bernstein-type inequality [35].

#### 4.4.2 Restricted Eigenvalue Condition

To show for all , similarly as before, we split the problem into parts by using row-wise separability assumption of the norm in (5). In particular, denote , where is , then we can represent the original set as a Cartesian product of subsets , i.e., , implying that . Therefore, our objective would be to establish

(20) |

for each , where and we defined , since it will be easier to operate with unit-norm vectors. In the following, to reduce clutter, we drop the index from the notations.

## 5 Experimental Results

In this Section we present the experiments on simulated and real data to demonstrate the obtained theoretical results. In particular, for and Group , Sparse Group and OWL we investigate how error norm and regularization parameter scale as the problem size and change. Moreover, using flight data we also compare the performance of the regularizers in real world scenario.

### 5.1 Synthetic Data

Using synthetically generated datasets we evaluate the obtained theoretical bounds for estimation VAR under Lasso, Sparse Group Lasso, OWL and Group Lasso regularizations.

#### 5.1.1 Lasso

To evaluate the estimation problem with norm, we simulated a first-order VAR process for different values of , , and . Regularization parameter was varied in the range , where is the largest parameter, for which estimation problem (4) produces a zero solution. All the results are shown after averaging across runs.

The results for Lasso are shown in the top row of Figure 1. In particular, in Figure 1. we show for different and for fixed . When is small, the estimation error is large and the results cannot be trusted. However, once , the RE condition in Lemma 4.2 is satisfied and we see a fast decrease of errors for all ’s. In Figure 1. we plot against rescaled sample size . The errors are now closely aligned, confirming results of Section 4.3.1, i.e, .

Finally, in Figures 1. and 1. we show the dependence of optimal (for fixed and , we picked achieving the smallest estimation error) on and . It can be seen that as increases, grows (for fixed ) at the rate similar to . On the other hand, as increases, the selected decreases (for fixed ) at the rate similar to .

#### 5.1.2 Sparse Group Lasso

To evaluate the estimation problem with Sparse Group Lasso norm, we constructed first-order VAR process for the following set of problem sizes , , and . The parameter was set to . Results are shown in Figure 2, top row. Similarly as in main paper, we can see that the errors are scaled by . Moreover, the parameter is decreasing when number of samples increases. On the other hand, as the problem dimension increases, the selected grows at the rate similar to .

#### 5.1.3 Owl

To test the VAR estimation problem under OWL norm we constructed a first-order VAR process with , and . The vector of weights was set to be a monotonically decreasing sequence of numbers in the range . Figure 2, bottom row, shows the results. It can be seen from Figure 2-f that when the errors are plotted against , they become tightly aligned, confirming the bounds established in Section 3.3.4 in the main paper for the error norm. As shown in Figure 2-g,h the selected regularization parameter grows with the problem dimension and decreases with the number of samples

#### 5.1.4 Group Lasso

For Group Lasso the sparsity in rows of was generated in groups, whose number varied as . We set the largest number of non-zero groups in any row as . Results are shown in the bottom row of Figure 1, which have similar flavor as in Lasso case. The difference can be seen in Figure 1., where a close alignment of errors occurs when is now scaled as . Moreover, the selected regularization parameter increases with the number of groups and decreases with .

Lasso | OWL | Group Lasso | Sparse Group Lasso | Ridge |

32.3(6.5) | 32.2(6.6) | 32.7(6.5) | 32.2(6.4) | 33.5(6.1) |

32.7(7.9) | 44.5(15.6) | 75.3(8.4) | 38.4(9.6) | 99.9(0.2) |

for each method (darker dots - stronger dependencies, lighter dots - weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 flights.

1 | Altitude |
---|---|

2 | Corrected angle of attack |

3 | Brake temperature |

4 | Computed airspeed |

5 | Drift angle |

6 | Engine temperature |

7 | Low rotor speed |

8 | High rotor speed |

9 | Engine oil pressure |

10 | Engine oil quantity |

11 | Engine oil temperature |

12 | Engine pre-cooler outlet temperature |

13 | Fuel mass flow rate |

14 | Lateral acceleration |

15 | Longitudinal acceleration |

16 | Normal acceleration |

17 | Glide slope deviation |

18 | Ground speed |

19 | Localization deviation |

20 | Magnetic heading |

21 | Burner pressure |

22 | Pitch angle |

23 | Roll angle |

24 | HPC exit temperature |

25 | Angle magnitude |

26 | Angle true |

27 | Total fuel quantity |

28 | True heading |

29 | Vertical speed |

Comments

There are no comments yet.