Despite the lack of theoretical investigations of the repeated sampling properties for BVAR models, Bayesian methodology can surely offer important contributions to the high-dimensional VAR model literature, beyond what could be developed in a frequentist framework. One notable such contribution is the construction of posterior distributions over the set of all relative model probabilities. This framework of posterior inference has been widely exploited over the last decade in the high-dimensional linear regression literature, and we anticipate it will see comparable success for high-dimensional VAR models in the near future.
Our constructed EAS methodology allows for such posterior-like inference of relative model probabilities for all graphs, and additionally we provide an algorithm which is self-tuning (i.e., no cross-validation is needed for calibration to data sets). Such Bayesian model selection approaches are very useful for learning important relationships among the various components (univariate time-series) in the VAR model. The EAS methodology is an entirely new perspective on model selection which was originally developed to effectively account for linear dependencies among subsets of covariates in the high-dimensional linear regression setting in Williams and Hannig (2019).
To the best of our knowledge, our established and model selection consistency results are the first of their kind in the BVAR literature. This type of result is sure to be followed by similar results in the high-dimensional BVAR literature, analogous to the emergence of model selection strong consistency results in the high-dimensional Bayesian linear regression literature such as Johnson and Rossell (2012); Narisetty and He (2014); Williams and Hannig (2019).
Further, we demonstrate how to construct an alternative framework for posterior-like inference in the VAR(1) model setting which eliminates prior choice and specification. We avoid the necessity of prior distributions altogether by implementing a generalized fiducial inference (GFI) approach (see Hannig et al. 2016
). And while our model selection consistency results derive from a Gaussian assumption on the VAR(1) model errors, they are actually the first ever results about a fiducial distribution under model misspecification. This is due to the fact that all of the supporting theorems and lemmas we contribute are non-asymptotic, and rely on a collection of explicit fourth moment bounds given in Section3.3. Consequently, as long as the VAR(1) model errors are independent within and across time and there exist bounded fourth moments, our generalized fiducial consistency results (which assume Gaussian data) still hold even if the true data is not Gaussian.
We validate our methods empirically in low and high-dimensional settings on both synthetic and real data, and provide Python code for implementing our algorithm. This code, and the workflow for reproducing all numerical results can be found at
Fiducial inference has a long history, but in the last decade there has been a renewed interest in the topic with a large number of authors contributing fundamental insights (Edlefsen, Liu and Dempster, 2009; Berger, Bernardo and Sun, 2009; Xie and Singh, 2013; Taraldsen and Lindqvist, 2013; Veronese and Melilli, 2015; Martin and Liu, 2015; Schweder and Hjort, 2016; Fraser, 2019). A gentle introduction to technical aspects of GFI is provided in Section 2.
Recent theoretical work on VAR models is largely comprised of considerations of regularized estimation procedures, most notably Basu et al. (2015). The Bayesian literature has not yet caught up. There do exist numerous papers on BVAR methodology, especially in the econometric literature, but on predominantly empirical investigations, see for example Bańbura, Giannone and Reichlin (2010); Korobilis (2013); Giannone, Lenza and Primiceri (2015); Ahelegbey, Billio and Casarin (2016). The primary tool of the BVAR literature has been implementations of the Minnesota (shrinkage) prior and its variants (Litterman, 1986).
It has been found that BVAR with shrinkage priors is effective for large VAR models of economic time-series, but little has been provided in the way of theoretical guarantees (a notable exception is Ghosh, Khare and Michailidis 2018) or even uncertainty quantification of competing model choices (a notable exception is Korobilis 2013). To the best of our knowledge, Ghosh, Khare and Michailidis (2018) is the first in the literature to establish posterior parameter estimation consistency in the “large p large n” BVAR setting with , where is the dimension of the VAR model and
is the number of observed time instances. While their consistency results are about the posterior behavior of the transition matrix coefficients under various prior specifications, our consistency results are about the posterior-like behavior of all relative model probabilities (akin to Bayes factors) under the prior-free GFI framework.
We loosely adopt notation for multivariate time-series from Lütkepohl (2005). The time-series is taken to denote data from a VAR(1) model with no serial correlation, and so is generated as
where and are matrices, is a matrix with for , is a matrix of coefficients, and . Assume is the -dimensional zero vector. Further, let be a set of indices denoting a graph of active components of , and take to be the matrix with active components corresponding to the graph (all other components are zero).
We extend the high-dimensional linear regression EAS methodology developed in Williams and Hannig (2019) to this VAR(1) setting. The idea behind the EAS procedure is to efficiently make inference on the set of graphs, , by discriminating on graphs which contain redundant active components. Our notion of redundancy is defined rigorously by the ‘-function’ given later in (4).
However, the basic intuition is to assign negligible posterior-like probability to all that can be closely approximated, predictively, by a graph containing fewer active components. This can occur for a variety of reasons, namely, correlated time-series in the VAR system of equations, and too small signal-to-noise coefficient magnitudes. For example, suppose with . Then the coefficient matrix is not - if, for instance, for some well-calibrated precision, ,
where is some measure of distance. In this case, predictions from the graph approximate that of within precision, and so is said to contain redundant information.
Note that in finite samples, and particularly high-dimensional, settings with highly-correlated data the EAS framework has the intuition that the oracle graph itself may not be -. In these settings, the EAS methodology re-defines the notion of the ‘true’ graph to be some non-redundant subgraph of the oracle graph, at least non-asymptotically. This idea is important because it suggests that to develop inherently scalable methodology the key may be to re-define the notion of what one should hope to recover from a ‘true’ data generating model in high-dimensional settings. Additional intuition for the EAS methodology is provided in Williams and Hannig (2019) in the context of linear regression.
The remainder of the paper is organized as follows. Section 2 defines the notion of -
as well as constructs the generalized fiducial distribution for the EAS approach, and describes the Markov chain Monte Carlo (MCMC)-based computations. The main theoretical results are presented in Section3, and numerical results are provided in Sections 4 and 5. The majority of the proofs are moved to the supplementary materials.
where , , , , and (as well as seen later) denotes the oracle graph. Here and throughout, the superscript-zero notation denotes the true fixed values of the corresponding quantities. The subscript notation, (or ), refers to the sub-matrix (or sub-vector) with columns (or components) corresponding to the active components given by the index set . The operator transforms an matrix into an vector by stacking columns in descending order, from left to right. For example, . This linear model representation is also more convenient for expressing the likelihood function,
which will be needed later on. For conciseness, the notation is used as shorthand for .
Additional notation used for the remainder of the paper includes the following. For a scalar-valued argument represents the absolute value, but for a set-valued argument it represents the cardinality. The norms and denote the vector and norms, respectively, while for a matrix , and represent the matrix spectral and Frobenius norms, respectively. Additionally, the quantities and
denote the minimum and maximum eigenvalues of a given matrix,, respectively. The notation and refer, respectively, to the probability measure and expectation with respect to the joint generalized fiducial distribution of and . Conversely, the notation and refer, respectively, to the probability measure and expectation associated with the uncertainty from the VAR(1) process, rather than the probability measure for the generalized fiducial distribution of the unknown parameters.
The centerpiece of the EAS model selection approach is a definition of model redundancy, as made rigorous by our notion of - and the -function, presented next. As described in Section 1, the main intuition is that is considered non-redundant, or -, if and only if there does not exist a close fitting graph with strictly fewer active components. However, there are also two additional constraints embedded in the -function for -.
Assume and . A given coefficient matrix , equivalently , for some graph is said to be - if and only if , where
where solves ,
and is the least squares estimator for graph .
To begin to understand the behavior of the -function, first note that
is analogous to a noiseless version of the Dantzig selector (Candes and Tao, 2007) where is the design matrix for the linear model representation (2). One reason to use versus simply is that the former is scale-invariant to the and invariant to orthogonal transformations of the data. Second, note that if contains linearly dependent columns, then for any coefficients , the linear prediction can be exactly recovered by (since ). This immediately implies that since is an matrix, for all with , by definition. For high-dimensional settings where , then by construction, considering only - graphs reduces the model selection problem from candidate graphs to only . This fact makes the EAS methodology inherently scalable.
The quantities , , and will now be described in alphabetical order. The component, , in the -function concentrates the distribution of to only allow for stable VAR(1) models with . In practice, since is typically not known the constraint is replaced by . The second component in the -function is the expression , where for is understood as the residual sum-of-squares (RSS) for the component of the VAR system. The basic idea is that the data-dependent quantity should be calibrated to which corresponds to the oracle graph, and so any graphs which have a better fit than the oracle will be excluded from consideration via the -function. Accordingly, this device is designed to eliminate graphs which over-fit the data, and is important for establishing our asymptotic consistency results. However, in practice can be set to a small value and left alone; more will be said about this in Section 4 with the numerical results.
For which have full column rank, the degree to which the features associated with graph are redundant depends on the correlations between the components of the VAR model, the distribution of the coefficients (i.e., the transition matrix ), scale matrix components , and the specified level of precision, . Our proposed default choice of , formulated from theoretical investigations (based on the Gaussian contemporaneous errors assumption), is for some ,
There are predominantly two components to ; the quantity is particularly calibrated to the observed data since it originates from a tight concentration inequality for the transition matrix , and the term is necessary asymptotically for managing the accumulating data and rapidly growing number of candidate graphs as . The basic idea is that will always contribute, and the remaining terms will contribute for sufficiently large or for which exceeds the number of active components in the oracle model. However, as is demonstrated in Section 4, for observed data is so well-calibrated that it suffices to set , and thus also eliminating the need for a tuning parameter. More details about are given in Section 3.2, particularly its expectation in (12).
With the EAS methodology now developed a framework of statistical inference is required for implementing it. A suitable such framework is GFI because it will allow us to construct posterior-like inference over the candidate graphs without having to specify any prior distributions. The intuition for GFI is to begin with a data generating equation such as (2) and invert the equation on the data to solve for the unknown parameters. The resulting quantity is defined as the generalized fiducial distribution of the unknown parameters. Precise details for the construction of this approach are provided in Hannig et al. (2016)
. The generalized fiducial probability density function for the parameters in the VAR(1) model (2) has the form
where the multiplication by the -function appears as an infusion of the EAS methodology into the GFI framework, and the Jacobian term,
with and denoting the data generating equation (2). The Jacobian term results from inverting the data generating equation on the unknown parameters. Note that the are also dependent on the the particular graph , but this dependence is suppressed in the notation for conciseness.
The likelihood function in (7) is given by (3), the -function is given by (4), and the derivation of the Jacobian term is presented in the supplementary material. From the generalized fiducial density of and , the generalized fiducial mass function for a graph is proportional to the normalizing constant in (7). In Bayesian theory, this constant of proportionality is understood as the marginal density of the data. Evaluating the integral in the denominator of (7) gives,
where is the set of active row indices of for column , and is a data-dependent and parameter-free matrix defined in the supplementary material as part of the Jacobian term. Note that the inner expectation is with respect to the N distribution, conditional on , and for each , is taken with respect to the inv-gamma distribution. To ensure that defines a proper probability mass function, the normalizing constant in (8) is scaled so that .
Lastly, the relative model probabilities (8) can be computed via psuedo-marginal MCMC algorithms. Traditional MCMC is not feasible because the expected value appearing in (8) is not available in closed form. We implement the grouped independence Metropolis-Hastings (GIMH) algorithm described in (Andrieu and Roberts, 2009), which replaces the expected value with the empirical mean of importance samples at each step of the MCMC algorithm. In the case of (8), efficient importance samples are easily drawn from the N and inv-gamma distributions for and , respectively. The GIMH algorithm we construct is a Markov chain on the set of graphs , and proposals are made by either adding, removing, or replacing a component index in the current iterate of in the chain.
A point of caution about the GIMH algorithm is that the mixing conditions are usually particularly sensitive to the number of importance samples taken to estimate an expectation at each step of the algorithm. However, the algorithm mixed well enough to yield very encouraging numerical results for the high-dimensional linear regression setting in (Williams and Hannig, 2019), and Sections 4 and 5, here, serve to demonstrate that the algorithm is not only computationally feasible but also favorable for graph selection in the VAR(1) model setting. Further discussion of the algorithm is provided in (Williams and Hannig, 2019), and a detailed pseudo-code description of the algorithm is provided at
3 Theoretical results
The problem of graphical selection is difficult because the number of candidate graphs to choose among grows super-exponentially in the dimension of the VAR(1) model, . Accordingly, the utility of the EAS procedure is its inherent ability to effectively manage a very large number of candidate graphs by assigning negligible posterior-like probability to redundant graphs. The meaning of this assertion is made precise in Theorem 3.12 which states that the generalized fiducial distribution obtained from the EAS methodology exhibits pairwise graph selection consistency as both and are taken to infinity, and as a corollary, strong selection consistency for fixed . The necessary mathematical conditions are discussed next.
The first two conditions presented are related to the identifiability of the true data generating graph, . We consider only a stable VAR(1) model for our theoretical investigation, and adopt the common notion of stability that for the true transition matrix for some . It is assumed throughout that a valid has been fixed a-priori.
Condition 3.1 arises in the proof of Lemma LABEL:VAR_JacobianLowBLemma which is a necessary result for Theorem 3.10. It guarantees that the Jacobian term for the oracle graph in (8) will be lower bounded away from zero in probability. The quantity represents an approximation to (via Lemma 3.15) which manages the uncertainty resulting from the minimum eigenvalue of the Jacobian matrix , where and . It is also assumed that a valid has been fixed a-priori.
The true transition matrix satisfies , is bounded from above by a fixed constant, and
where , and
Observe that this condition also implies that .
Note that Lemma 3.15 guarantees as , assuming the versus relationship given by Condition 3.4. Thus, the condition can reasonably be verified on real data by assuming is arbitrarily small and comparing the value of to , where is the obvious sample analogue to the population quantity considered in Condition 3.1. Since is unknown in practice, for the purposes of checking this condition on real data evaluate for the worst case with replaced by 1. We demonstrate on synthetic data in Section 4 that this verifiable condition is indeed meaningful for practical applications.
Condition 3.2, which originates from the proof of Theorem 3.10, is also well calibrated to real data. This condition states the maximum rate at which can be allowed to grow as a function of , and , whilst the oracle model remains identifiable (i.e., no faster than ). The fixed quantity represents the ‘gap’ between how fast must grow (stated in Condition 3.4) to effectively manage the set of all candidate graphs under consideration, and how slow it must grow to not eliminated the oracle graph from consideration. Namely, simultaneously satisfies Conditions 3.2 and 3.4 for any . It is assumed throughout that a valid has been fixed a-priori. The quantities on the left side of the inequality in Condition 3.2 are expected values of the corresponding quantities on the left side of the first constraint in the -function (4).
The oracle graph, , satisfies ,
where , solves subject to , and for some not depending on or .
Unless the oracle model is known, Condition 3.2 is not verifiable on real data, but in Section 4 we are able to demonstrate the varying performance of the EAS procedure on simulated data when this condition is and is not satisfied. Note, that the coefficient of is a constant more pertinent to asymptotic considerations (and our proof technique), and should be understood as closer to the value of (which appears in the -function).
The next condition is a component in the proof of Theorem 3.9 for guaranteeing that the -function will drive the EAS procedure to assign negligible posterior-like probability to non-- graphs, , via the mass function in (8).
For any with ,
where solves subject to , and for some not depending on or .
The intuition for Condition 3.3 is that for graphs containing redundant active components the central tendency of the least squares estimator can be closely approximated by a vector of fewer active components. Notice that is an approximation to . Since the least squares estimator is asymptotically well behaved for Gaussian VAR models, this condition is not particularly interesting and is easily satisfied in numerical experiments. Furthermore, it will hold trivially, for instance, if the columns are linearly dependent.
The final condition in this section is Condition 3.4, which simply states the asymptotic rate at which and from the definition of in (4) must increase as for our main result, Theorem 3.12, to be established. In fact, the previous three conditions were all for establishing non-asymptotic bounds of concentration.
For some fixed , . For the positive constant specified in (14), as or , satisfies
and , where with corresponding to the full model (i.e., all components active), and for some not depending on or .
A important attribute of Condition 3.4 is the requirement that while the dimension of the VAR(1) model, , can be taken to infinity, it must be exceeded polynomially by the number of observed time instances, . This is in contrast to the model selection consistency result established for the high-dimensional linear regression setting in (Williams and Hannig, 2019), where was allowed to grow sub-exponentially in . The primary difference here is that we derive model selection consistency results for the multivariate VAR model setting which are robust to model misspecification, namely the assumption of Gaussian VAR model errors. Such a robust generalized fiducial result requires (to the best of our understanding) non-asymptotic second moment concentration bounds. High-dimensional () consistency results require exponential tail bounds when establishing concentration of data-dependent quantities such as in Lemma 3.8 in the next section, and exponential tail bounds here are intimately related to the assumption of Gaussianity.
Note that no assumption of sparsity is made in any of the conditions. This section concludes with a definition of various quantities that will be referenced in the next section, and throughout the proofs.
is any positive constant such that implies
is any positive constant such that implies
Additionally, is defined as in (LABEL:VAR_TrueGraphLowBTheorem_final).
with . The alternate denotes with replaced by .
Our strategy for establishing graph selection consistency in Theorem 3.12 is largely composed of the contents of Lemmas 3.6 and 3.8 and Theorems 3.9 and 3.10. Lemmas 3.6 and 3.8 describe, respectively, the generalized fiducial concentration of the VAR(1) transition matrix around its least squares estimate and the concentration of the least squares estimate around an approximation to its expectation. The probability bounded in Lemma 3.6 is with respect to the joint generalized fiducial distribution of and . In contrast, Lemma 3.8 is a concentration inequality with respect to the data generating mechanism (2) which derives its distribution from the errors for . In what follows we chose for some not depending on or .
For any with ,
where , and .
Recall that , which comes from the proof of this lemma, is a key component of our suggested default in (6) and of Condition 3.4. This results from the fact that must control for in order to establish the well-behaved concentration of the generalized fiducial distribution of which is exhibited by this lemma. The plays the role of appropriately scaling the design matrix . Observe that for the full model ,
Thus, for a given graph , is a combined measure of the covariance or dependence among the univariate time-series in the VAR model, the contemporaneous error precision matrix, and the number of observed instances of the time-series. This is what makes effective as apart of in the -function for determining the - of a given . Lemma 3.7 gives a probabilistic bound on as a function of and , given the -function constraint that .
For any ,
Next, consider the concentration of the least squares estimate.
Materially, the three preceding lemmas are needed in the proofs of Theorems 3.9 and 3.10, presented next. These theorems are results about the behavior of the EAS methodology coupled with the generalized fiducial distribution (i.e., the Jacobian term); they are analogous to studying the behavior of given priors for a (Bayesian) posterior distribution. Theorem 3.9 is a non-asymptotic concentration inequality which yields an upper bound on the rate at which the expected value (w.r.t. the joint generalized fiducial distribution of and ) of the -function times the Jacobian term diverges for non-- graphs, .
Conversely, Theorem 3.10 is a non-asymptotic lower bound on the -function times the Jacobian term for the oracle graph, .
Before stating the main result of this paper one final condition, Condition 3.11, is needed. In its absence a less strong, yet still meaningful statement of posterior-like graphical consistency holds; we formulate this alternative statement as Corollary 3.13. The importance of Condition 3.11 is that it covers the gap left open in Theorem 3.9 since the theorem only bounds the generalized fiducial probability of non-- graphs (i.e., ).
For the positive constant specified in (13),
as or .
Recall from (5) that is the univariate RSS, corresponding to graph , for the component of the VAR(1) model. Hence, this condition is a statement that the product of the ratio of RSS components for the true graph over that of any strict sub-graph, taken to a power on the order of , will vanish at a rate of . This is not unreasonable to expect since for each , , , for , and an explicit condition about the oracle model being sufficiently better fitting than all sub-models is typical of model consistency results.
The main result of our paper, a statement of pairwise graphical selection consistency for the constructed EAS methodology, is now presented. This result demonstrates that the generalized fiducial probability of the oracle graph will asymptotically dominate that of all other graphs. Note that there is no assumption of sparsity.
If Condition 3.11 is violated, Corollary 3.13 demonstrates that the generalized fiducial mass function will concentrate asymptotically on the subset of graphs . In practice, for sufficiently large , this means that there will be a few graphs which the algorithm visits frequently, and the largest one (in cardinality) likely contains the greatest number of the oracle components.
Corollary 3.13 (pairwise selection consistency).
The additional corollary stated next demonstrates that the EAS methodology will concentrate all generalized fiducial mass on the true model, asymptotically, for fixed .
Note the following short remark about the meaning of the difference between and model selection consistency. The statement of graph selection consistency is essentially a statement that the true model will be assigned large probability and all other models will be assigned small probabilities. Conversely, the implication of graph selection consistency is that the probability assigned to the true model will be large relative to each of the other model probabilities, individually, but that all models (including the true model) may have small probabilities. Such a phenomenon is common for model selection paradigms in which the set of candidate models grows very fast with dimension (i.e., like in the case of a VAR(1) model).
The next subsection illustrates the additional attribute that our model selection consistency results are robust to model misspecification, namely, the assumption of Gaussian VAR model errors.
3.3 Standalone supporting results
This subsection provides five lemmas which were foundational to our proof techniques for establishing our theory for the EAS methodology. Non-asymptotic moment bounds on products of and (in the VAR model formulation (1)), with respect to and , are the building blocks for any theoretical pursuit of understanding high-dimensional, multivariate VAR models. These results are essentially a collection of second moment bounds of the quantities and cross-quantities in , and establish the notion that our preceding fiducial consistency results will remain true under model misspecification. This is due to the fact that as long as the VAR(1) model errors are independent within and across time and there exist bounded fourth moments (i.e., components appearing in ), the following collection of lemmas will remain true (up to some constants of proportionality). And as a consequence, our generalized fiducial consistency results (which assume Gaussian data) will still hold even if the true data is not Gaussian.
Assume . Then for all ,
where is as in (11),