Convergence Analysis of a Collapsed Gibbs Sampler for Bayesian Vector Autoregressions

07/06/2019
by   Karl Oskar Ekvall, et al.
0

We propose a collapsed Gibbs sampler for Bayesian vector autoregressions with predictors, or exogenous variables, and study the proposed algorithm's convergence properties. The Markov chain generated by our algorithm converges to its stationary distribution at least as fast as those of competing (non-collapsed) Gibbs samplers and is shown to be geometrically ergodic regardless of whether the number of observations in the underlying vector autoregression is small or large in comparison to the order and dimension of it. We also give conditions for when the geometric ergodicity is asymptotically stable as the number of observations tends to infinity. Specifically, the geometric convergence rate is shown to be bounded away from unity asymptotically, either almost surely or with probability tending to one, depending on what is assumed about the data generating process. Our results are among the first of their kind for practically relevant Markov chain Monte Carlo algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/27/2021

On the convergence rate of the "out-of-order" block Gibbs sampler

It is shown that a seemingly harmless reordering of the steps in a block...
03/10/2021

Dimension free convergence rates for Gibbs samplers for Bayesian linear mixed models

The emergence of big data has led to a growing interest in so-called con...
08/27/2018

A Hybrid Alternative to Gibbs Sampling for Bayesian Latent Variable Models

Gibbs sampling is a widely popular Markov chain Monte Carlo algorithm wh...
11/30/2018

Markov chain Monte Carlo Methods For Lattice Gaussian Sampling:Convergence Analysis and Enhancement

Sampling from lattice Gaussian distribution has emerged as an important ...
04/29/2020

On the convergence complexity of Gibbs samplers for a family of simple Bayesian random effects models

The emergence of big data has led to so-called convergence complexity an...
01/11/2017

Fast mixing for Latent Dirichlet allocation

Markov chain Monte Carlo (MCMC) algorithms are ubiquitous in probability...
07/05/2021

Analyzing Relevance Vector Machines using a single penalty approach

Relevance vector machine (RVM) is a popular sparse Bayesian learning mod...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov chain Monte Carlo (MCMC) is often used to explore the posterior distribution of a vector of parameters given data . To ensure the reliability an analysis using MCMC it is essential to understand the convergence properties of the chain in use [6, 7, 9, 10, 22, 28, 49] and, accordingly, there are numerous articles establishing such properties for different MCMC algorithms [e.g. 1, 2, 15, 17, 19, 23, 37, 41, 47]. It has been common in this literature to treat the data as fixed, or realized. Thus, the model for how the data are generated has typically been important only insofar as it determines the likelihood function based on an arbitrary realization—the stochastic properties of the data prescribed by that model have not been emphasized. This is natural since the target distribution, i.e. the posterior distribution, treats the data as fixed. On the other hand, due to the rapid growth of data available in applications, it is also desirable to understand how performance is affected as the number of observations increases. When this happens, the data are more naturally thought of as stochastic; each time the sample size increases by one, the additional observation is randomly generated. The study of how convergence properties of MCMC algorithms are affected by changes in the data is known as convergence complexity analysis [39] and it has attracted increasing attention recently [18, 36, 37, 50, 51].

Accounting for randomness in the data and letting the sample size grow leads to a more complicated analysis than when the data are fixed. In fact, to date, convergence complexity analysis has only been successfully carried out for a few practically relevant MCMC algorithms, maybe even one [37]. We propose and study a MCMC algorithm for a fundamental model in time series analysis: a Bayesian vector autoregression with predictors (VARX), or exogenous variables. Briefly, the VARX we consider assumes that and satisfy, for ,

(1)

with

independent and multivariate normally distributed with mean zero and common covariance matrix

, , , and . The target distribution of our algorithm is the posterior distribution of given . More details on the model specification, priors, and resulting posterior distribution are given in Section 2. We will consider both fixed and growing data and refer to the two settings as the small- and large- setting, respectively. By being small we mean that it is fixed and possibly small in comparison to and , but throughout. Many large VARs in the literature [3, 11, 26] are covered by this setting. By being large we mean that it is increasing and that the data are stochastic.

The algorithm we propose is a collapsed Gibbs sampler. It exploits the structure in the VARX to generate a Markov chain that converges to its stationary distribution at least as fast as those generated by competing (non-collapsed) Gibbs samplers. To discuss the more precise convergence results we establish, we require some more notation.

Let denote the VARX posterior distribution having density with support on for some and let () be the -step transition kernel for a Markov chain with state space , started at a point . We assume throughout that all discussed Markov chains are irreducible, aperiodic, and Harris recurrent [34], and that sets on which measures are defined are equipped with their Borel -algebra. Our analysis is focused on convergence rates in total variation distance, by which we mean the rate at which approaches zero as tends to infinity, where denotes the total variation norm. If this convergence happens at a geometric (or exponential) rate, meaning there exist a and an such that for every and

(2)

then the Markov chain, or the kernel , is said to be geometrically ergodic. The geometric convergence rate is the infimum of the set of such that (2) holds [37]. Since all probability measures have unit total variation norm, is always in , and is geometrically ergodic if and only if

. A substantial part of the literature on convergence of MCMC algorithms is centered around geometric ergodicity, for good reasons: under moment conditions, a central limit theorem holds for functionals of geometrically ergodic Markov chains

[5, 20]

and the variance in the asymptotic distribution given by that CLT can be consistently estimated

[8, 21, 48], allowing principled methods for ensuring reliability of the results [42, 49]. Our main result in the small- setting gives conditions that ensure when the data are fixed and is the kernel corresponding to our proposed algorithm. Due to a well known correspondence [15, 24]

between the likelihoods of the VARX and the multivariate linear regression model when data are fixed, our small-

results also apply to certain versions of the latter.

Notice that, although it is suppressed in the notation, , , , and, hence, typically depend on . In the large- setting, we are no longer considering a single dataset, but a sequence of datasets , where here denotes a dataset with observations. Consequently, for every there is a posterior distribution and a Markov chain with kernel that is used to explore it. To each kernel there also corresponds a geometric convergence rate . Since depends on , the sequence

is now one of random variables, ignoring possible issues with measurability. We are interested in bounding

away from unity asymptotically, in either one of two senses: first, if there exists a sequence of random variables such that for every and almost surely, then we say that is asymptotically geometrically ergodic almost surely, or the geometric ergodicity is asymptotically stable almost surely. Secondly, if instead of the upper limit being less than unity almost surely it holds that , then we say that is asymptotically geometrically ergodic in probability, or that the geometric ergodicity is asymptotically stable in probability. Our main results in the large- setting give conditions for asymptotically stable geometric ergodicity, in both of the two senses, of the Markov chain generated by our algorithm. An intuitive, albeit somewhat loose, interpretation of our main results is that the geometric ergodicity is asymptotically stable if the parameters of the VARX can be consistently estimated using maximum likelihood.

The rest of the paper is organized as follows. We begin in Section 2 by completing the specification of the model and priors. Because some of the priors may be improper we derive conditions which guarantee the posterior exists and is proper. In Section 3 we propose a collapsed Gibbs sampler for exploring the posterior. Conditions for geometric ergodicity for small are presented in Section 4 and conditions for asymptotically stable geometric ergodicity are given in Section 5. Some concluding remarks are given in Section 6.

2 Bayesian vector autoregression with predictors

Recall the definition of the VARX in (1). To complete the specification, we assume that the starting point is non-stochastic and known and that the predictors are strongly exogenous. By the latter we mean that is independent of and has a distribution that does not depend on the model parameters. With these assumptions the following lemma is straightforward. Its proof is provided in Appendix B for completeness. Let , , , , and . Let also and , where is the vectorization operator, stacking the columns of its matrix argument.

Lemma 2.1.

The joint density for observations in the VARX is with

where .

We defer a discussion of exactly how , , , and compare since what is needed depends on how the prior distributions are specified. Let denote the set of symmetric positive semi-definite (SPSD) matrices and, to define priors, let , , , and

be hyperparameters. The prior on

we consider is of the form , with

and

where means the determinant when applied to matrices. The flat prior on is standard in multivariate scale and location problems, including in particular the multivariate regression model which is recovered when in the VARX. The priors on and are common in macroeconomics [25] and the prior on includes the inverse Wishart () and Jeffreys prior () as special cases. The following result gives two different sets of conditions that lead to a proper posterior. Though we only consider proper normal or flat priors for in the rest of the paper, it may be relevant for other work to note that the proposition holds for any prior satisfying the conditions.

Proposition 2.2.

If either

  1. , has full column rank, , and is proper; or

  2. has full column rank, , and is bounded,

then the posterior distribution is proper and, with , the posterior density is characterized by

(3)
Proof.

Appendix B. ∎

The first set of conditions is relevant to the small -setting. It implies that if the prior on is a proper inverse Wishart density, so that and , then the posterior is proper if is proper and has full column rank. In particular, or can be arbitrarily large in comparison to . Thus, this setting is compatible with large VARs [3, 11, 26]. The second set of conditions allows for the use of improper priors also on and when is large in comparison to all of , , and . The full column rank of is natural in large- settings. In practice, one expects it to hold unless the least squares regression of on and gives residuals that are identically zero.

To the best of our knowledge, the combination of an improper prior for and a proper prior for is new. In previous work, and have sometimes been grouped as and a proper multivariate normal prior assigned to [27]. Treating and differently is appealing because, as indicated by point 1 in Proposition 2.2, one can then use the standard flat prior on while still allowing for large and . Moreover, even when is large enough that one could use a flat prior also on , a proper prior can be preferable: many time series, in particular in economics and finance, are known to be near non-stationary in the unit root sense, and if so that is proper, then can be chosen to reflect this. For a discussion of priors in Bayesian VARs more generally we refer the reader to Karlsson [25].

As alluded to in the introduction, for fixed data the density , and hence the posterior, is the same as in a multivariate linear regression with design matrix and coefficient matrix . Thus, our results with fixed data apply also to a multivariate regression model with partitioned design matrix—one part which has a flat prior for its coefficient and one which is possibly high-dimensional and has a proper prior for its coefficient. This configuration is unlike those in other work on similar models which typically assume either that has full rank or that the prior for is proper [1, 2, 11, 15, 46].

The literature on convergence properties of MCMC algorithms for Bayesian VAR(X)s is limited. An MCMC algorithm for a multivariate linear regression model has been proposed and its convergence rate studied [15]. By the preceding discussion, this includes the VARX as a special case, however, the (improper) prior used is which is not compatible with the large VARXs we allow for in the small- setting. In the large- setting, i.e. when doing convergence complexity analysis, the data are no longer considered fixed and, hence, the VARX is no longer a special case of multivariate linear regression. A two-component ( and ) Gibbs sampler for Bayesian vector autoregressions without predictors has been proposed [24]. However, the analysis of it is simulation-based and as such does not provide any theoretical guarantees. Our results address this since, as we will discuss in more detail below, the algorithm we propose simplifies to this Gibbs sampler when there are no predictors.

3 A collapsed Gibbs sampler

If , then the VARX posterior is a normal-(inverse) Wishart for which MCMC is unnecessary. However, when the posterior is analytically intractable and there are many potentially useful MCMC algorithms. For example, the full conditional distributions have familiar forms so it is straightforward to implement a three-component Gibbs sampler. Another sensible option is to group and and update them together. Here, we will instead make use of the particular structure the partitioned matrix offers and devise a collapsed Gibbs sampler [29]. Well known results [30] imply that our collapsed Gibbs sampler converges to its stationary distribution at least as fast as both the three-component Gibbs sampler and the two-component sampler that groups and . For the case but , i.e. there are no predictors in the model, a two-component Gibbs sampler has been proposed [25]. Our algorithm specializes to this two-component Gibbs sampler when and, as a consequence, our results apply almost verbatim. A formal description of one iteration of the collapsed Gibbs sampler is given in Algorithm 1.

1:Input: Current value
2:Draw from the distribution of
3:Draw from the distribution of
4:Draw from the distribution of
5:Set
Algorithm 1 Collapsed Gibbs sampler

We next derive the conditional distributions necessary for its implementation. Let denote the matrix normal distribution with mean and scale matrices and (see Definition A.1), and let denote the inverse Wishart distribution with scale matrix and degrees of freedom. For any real matrix , define to be the projection onto its column space and the projection onto the orthogonal complement of its column space. Let also denote the Kronecker product and define and .

Lemma 3.1.

If at least one of the two sets of conditions in Proposition 2.2 holds, then

Proof.

Appendix B. ∎

The collapsed Gibbs sampler in Algorithm 1 simulates a realization from a Markov chain having the following one-step transition kernel: for any measurable ,

where the subscript is short for collapsed. However, instead of working directly with we will use its structure to reduce the problem in a convenient way. Consider the sequence , , obtained by ignoring the component for in Algorithm 1. The sequence is generated as a two-component Gibbs sampler and its transition kernel is, for any measurable ,

A routine calculation shows that since , by construction, has invariant distribution , then has the VARX posterior as its invariant distribution.

The sequences and are also Markov chains. The transition kernel for the sequence is, for any measurable ,

(4)

The transition kernel, , for the sequence is constructed similarly. The kernel satisfies detailed balance with respect to the posterior marginal and similarly for and hence each has the respective posterior marginal as its invariant distribution. However, the kernels and do not satisfy detailed balance with respect to their invariant distributions.

In Sections 4 and 5 we will establish geometric ergodicity of and study its asymptotic stability, respectively. Our approach, which is motivated by the following lemma, will be to analyze in place of ; the lemma says we can analyze either of , or in place of . The proof of the lemma uses only well known results about de-initializing Markov chains [40] and can be found in Appendix B.

Lemma 3.2.

For any , and ,

The primary tool we will use for investigating both geometric ergodicity and asymptotic stability is the following well known result [43, Theorem 12], which has been specialized to the current setting. Note that the kernel acts to the left on measures, that is, for a measure , we define

Theorem 3.3.

Suppose is such that for some and some

(5)

Also suppose there exists , a measure , and some such that

(6)

Then is geometrically ergodic and, moreover, if

then, for any initial distribution ,

(7)

It is common for the initial value to be chosen deterministically, in which case (7) suggests choosing a starting value to minimize . Theorem 3.3 has been successfully employed to determine sufficient burn-in in the sense that the upper bound on the right-hand side of (7) is below some desired value [22, 23, 44], but, unfortunately, the upper bound is often so conservative as to be of little utility. However, our interest is twofold; it is easy to see that there is a such that and hence if satisfies the conditions, then it is geometrically ergodic and, as developed and exploited in other recent research [37], the geometric convergence rate is upper bounded by . Outside of toy examples, we know of no general state space Monte Carlo Markov chains for which is known.

Consider the setting where the number of observations tends to infinity; that is, there is a sequence of data sets and corresponding transition kernels with . If almost surely, then we say the drift (5) and minorization (6) are asymptotically unstable in the sense that, at least asymptotically, they provide no control over . On the other hand, because establishing that almost surely or that , leads to asymptotically stable geometric ergodicity as defined in the introduction.

Notice that depends on the drift function through , , and . Thus the choice of drift function which establishes geometric ergodicity for a fixed may not result in asymptotic stability as . Indeed, in Section 4 we use one to show that is geometrically ergodic under weak conditions when is fixed, while in Section 5 a different drift function and slightly stronger conditions are needed to achieve asymptotically stable geometric ergodicity of .

4 Geometric ergodicity

In this section we consider the small- setting. That is, is fixed and the data and observed, or realized, and hence treated as constant. Accordingly, we do not use a subscript for the sample size on the transition kernels. We next present some preliminary results that will lead to geometric ergodicity of , and hence and .

We fix some notation before stating the next result. Let denote the Euclidean norm when applied to vectors and the spectral (induced) norm when applied to matrices, denotes the Frobenius norm for matrices, and superscript denotes the Moore–Penrose pseudo-inverse. Least squares estimators of and are denoted by and , respectively, and .

Lemma 4.1.

Define by . If and at least one of the two sets of conditions in Proposition 2.2 holds, then for any and with

the kernel satisfies the drift condition

Proof.

Assume without loss of generality that , where denotes the identity matrix; the general case is recovered by replacing and by and everywhere. Using (4) and Fubini’s Theorem yields

Lemma 3.1 and standard expressions for the moments of the multivariate normal distribution [45, Theorem 10.18] gives for the inner integral that

The triangle inequality gives . We work separately on the last two summands. First, since is SPSD, we get by Lemma A.2.2 that

Secondly,

Now by Lemma A.3, with and taking the roles of what is there denoted and , we have for any generalized inverse (denoted by superscript ) that

is upper bounded by

(8)

Lemma A.4 says that is one such generalized inverse. Using that the Moore–Penrose pseudo-inverse distributes over the Kronecker product [33], the middle part of this generalized inverse can be written as . Thus, for this particular choice of generalized inverse (8) is equal to

Thus, using also that by Lemma A.2 since SPSD,

Since the right-hand side does not depend on , the proof is completed upon integrating both sides with respect to . ∎

Lemma 4.2.

If at least one of the two sets of conditions in Proposition 2.2 holds, then for any and such that , there exists a probability measure and

such that

Proof.

We will prove that there exists a function , depending on the data and hyperparameters, such that and for every such that , or, equivalently, . This suffices since if such a exists, then we may take and define the distribution by, for any Borel set ,

Let and so that can be written

To establish existence of a with the desired properties we will lower bound the first and third term in using two inequalities, namely

and, for every such that ,

We prove the former inequality first. Since , it suffices to prove that . For this, Lemma A.2.3 says it is enough to prove that is SPSD. But the Frisch–Waugh–Lovell theorem [32, Section 2.4] says , and therefore

which is clearly SPSD.

For the second inequality we get, using the triangle inequality, sub-multiplicativity, and that the Frobenius norm upper bounds the spectral norm,

Since the spectral norm for SPSD matrices is the maximum eigenvalue, we have shown that

is SPSD. Thus, is also SPSD and, hence, , which is what we wanted to show. We have thus established

Finally, the stated expression for , and that it is indeed positive, follows from that under the first set of conditions in Proposition 2.2 is SPD, and under the second set of conditions is SPD by Lemma A.1; in either case, both and are SPD and, consequently, is proportional to an inverse Wishart density with scale matrix and degrees of freedom. ∎

We are now ready for the main result of this section.

Theorem 4.3.

If and at least one of the two sets of conditions in Proposition 2.2 holds, then the transition kernels , , and are geometrically ergodic.

Proof.

By Lemma 3.2 it suffices to show it for . Lemma 4.1 establishes that a drift condition (5) holds for with and all , while Lemma 4.2 establishes a minorization condition (6) for . The claim now follows immediately from Theorem 3.3. ∎

5 Asymptotic stability

We consider asymptotically stable geometric ergodicity as . Motivated by Lemma 3.2, we focus on the sequence of kernels , where is the kernel with the dependence on the sample size made explicit; we continue to write when is arbitrary but fixed. Similar notation applies to the kernels and .

It is clear that as changes so do the data and . Treating and as fixed (observed) is not appropriate unless we only want to discuss asymptotic properties holding pointwise, i.e. for particular, or all, paths of the stochastic process , which is unnecessarily restrictive. Thus, to be clear, in what follows we assume that and are defined on an underlying probability space. We also assume that, for every

, the joint distribution of

and is as prescribed by the VARX, for some specific, “true” . Unless indicated otherwise, probability statements and expectations are with respect to the underlying probability space, or equivalently with respect to the distribution of , for the true .

Recall that Theorem 3.3 is instrumental to our strategy: if satisfies Theorem 3.3 with some , , , , and , then there exists a that upper bounds . We focus on the properties of those , , as tends to infinity. Throughout the section we assume that the priors, and in particular the hyperparameters, are the same for every .

Clearly, the choice of drift function is important for the upper bound one obtains. The drift function used for the small-

regime is not well suited for the asymptotic analysis in this section. Essentially, problems occur if

, , or so that the corresponding upper bounds satisfy almost surely [37, Proposition 2]. Consider Theorem 4.3. Since we can take for all , only or can lead to problems. In Appendix B.1 we show that as long as is consistent, then while if, almost surely, as ,

then almost surely. We expect this to occur in many relevant configurations of the VAR. Indeed, we expect the order of will often be at least that of . To see why, consider the case without predictors. Then , and is the maximum likelihood estimator of which is known to be consistent for stable VARs with i.i.d. Gaussian innovations [31]. For such VARs it also holds that converges in probability to some SPD limit [31], and hence .

The intuition as to why the drift function that works in the small- regime is not suitable for convergence complexity analysis is that the drift function should be centered (minimized) at a point the chain in question can be expected to visit often [37]. The function defined by is minimized when , but there is in general no reason to believe the -component of the chain will visit a neighborhood of the origin often. On the other hand, if the number of observations grows fast enough in comparison to other quantities, then we expect the marginal posterior density of to concentrate around the true , i.e. the according to which the data is generated. We also expect that for large the least squares and maximum likelihood estimator is close to the true . Thus, intuitively, the -component of the chain should visit the vicinity of often. Formalizing this intuition leads to the main result of the section.

Let us re-define by . The following lemma establishes a result that will lead to verification of the drift condition in (9) for all large enough and almost all sample paths of the VAR under appropriate conditions. Notice, however, that the given here need not be less than unity for a fixed or particular sample path of the VARX.

Lemma 5.1.

If has full column rank, , at least one of the two sets of conditions in Proposition 2.2 holds,

then

Proof.

Suppose first that and notice that has full column rank, and hence