Denote by a time series of equally spaced -variate observations. A stochastic process is said to be strictly stationary if its properties are unaffected by a shift in the time origin and weakly stationary if the mean remains constant over time and the matrix-valued autocovariance function depends only on the lag , for , with . For Gaussian processes, the two are equivalent, and we simply refer to a process as being stationary . As raw time series often exhibit periodic variation and systematic changes in the mean, stationarity is generally assumed only for the residuals of a “detrended” series, the variables of a differenced process, or those latent components of a state space model that are believed to be mean-reverting. In these cases, stationarity prevents the predictive variance of the transformed process from growing without bound as the forecast horizon increases, moving either forwards or backwards in time. This is a very reasonable assumption in many applications, for instance where the inferential objective is long-term forecasting or characterising the long-run behaviour of linear dynamic systems.
. As raw time series often exhibit periodic variation and systematic changes in the mean, stationarity is generally assumed only for the residuals of a “detrended” series, the variables of a differenced process, or those latent components of a state space model that are believed to be mean-reverting. In these cases, stationarity prevents the predictive variance of the transformed process from growing without bound as the forecast horizon increases, moving either forwards or backwards in time. This is a very reasonable assumption in many applications, for instance where the inferential objective is long-term forecasting or characterising the long-run behaviour of linear dynamic systems.
All stationary Gaussian processes can be approximated arbitrarily well by a finite-order, vector autoregressive moving average (VARMA) model; see, for example, Chapter 12 of Neusser (2016). Although we focus on the more commonly used subclass of vector autoregressive (VAR) models, we provide comments in Section 6 on the extension to the more general case. Without loss of generality, assume that the time series can be modelled as a zero-mean, order- vector autoregressive, or , process,
in which the errors form a sequence of uncorrelated, zero-mean multivariate normal random vectors, . The parameters of the model therefore comprise the autoregressive coefficient matrices for , and the error variance matrix , where denotes the set of matrices with entries in and denotes the set of symmetric, positive definite matrices. Henceforth, we refer to the collection of as .
In the time series literature, the process is often written as
where is the backshift operator, that is , is the identity matrix and , , is termed the characteristic polynomial. The process is stationary if and only if all the roots of lie outside the unit circle. We refer to this subset of as the stationary region, denoted by .
When and or , the stationary region is simple, with representing the interval , and representing a triangle in the -plane. However, increasing or further increases the complexity of the polynomial equation and hence the geometry of the stationary region. This causes a number of problems for Bayesian inference. First, because there are no standard distributions on , it is not generally possible to directly specify a prior over this space which encodes genuine beliefs. Notwithstanding this philosophical point, a second, computational challenge remains; designing efficient Markov chain Monte Carlo (MCMC) samplers with state space constrained to
and hence the geometry of the stationary region. This causes a number of problems for Bayesian inference. First, because there are no standard distributions on
, it is not generally possible to directly specify a prior over this space which encodes genuine beliefs. Notwithstanding this philosophical point, a second, computational challenge remains; designing efficient Markov chain Monte Carlo (MCMC) samplers with state space constrained tois very difficult. In the univariate () case, Chib (1993) and Chib and Greenberg (1994) address the latter difficulty by assigning a multivariate normal prior to , truncated to the region . For , the normalising constant cannot be evaluated in closed form and so the prior density is known only up to proportionality. An MCMC sampler is described which updates in a block with a multivariate normal proposal distribution, namely the full conditional distribution when the constraints are ignored; proposals that fall outside of are rejected in the Metropolis acceptance step. However, this will be efficient only if the proposal density is concentrated over . In the univariate case, Piccolo (1982) calculates the volume of the stationary region for the parameters of autoregressive (AR) models. In the special case where the error variance matrix is diagonal, a trivial corollary of this result is that the volume of is equal to , in which for even and for odd, where . This region becomes vanishingly small as increases and is likely to render an inferential scheme that is not tailored to the geometry of the problem highly inefficient. The natural way around this problem is to find a reparameterisation of the model which maps to a space with simpler constraints. Ideally, the new parameters should be interpretable to allow prior specification to be carried out in a meaningful way.
As a result of their ubiquity in time series analysis, there is a large literature on reparameterisations of (univariate) autoregressive models. Barndorff-Nielsen
and Schou (1973) establish a bijection between and the first partial autocorrelations of a stationary process, showing further that the mapping is both ways continuously differentiable. Monahan (1984) provides an alternative derivation of the mapping and explicit recursive formulae for its inverse. The new parameterisation has the advantage that the partial autocorrelations are interpretable and only constrained to lie in the Cartesian product space . Marriott et al. (1996) and Barnett
et al. (1996) present prior distributions for the partial autocorrelations and MCMC methods for computational inference. The latter suggest a uniform prior over . Owing to the analytic tractability of the Jacobian of the mapping from to , it can easily be shown that this is equivalent to
where denotes the floor operator (Jones, 1987). The former use slab and spike priors for the in which the “slab” is uniform over and the “spike” is an atom of probability at zero, which allows for uncertainty in the model order.
and the “spike” is an atom of probability at zero, which allows for uncertainty in the model order.
Marriott and Smith (1992) and Huerta and West (1999) both describe reparameterisations based on a representation of the characteristic equation in factorised form, that is, . In this case, the condition for stationarity reduces to for all . Huerta and West (1999) also allow uncertainty in the model order , and the balance of real and complex (reciprocal) roots , by placing priors on the real roots and the moduli of the pairs of complex roots with atoms of probability at zero.
Extensions of these reparameterisations of univariate AR models to the vector case, especially with a focus on prior specification, are surprisingly scarce in the literature. Morf
et al. (1978) generalise the results of Barndorff-Nielsen
and Schou (1973); for every error variance matrix , a bijection is established between and the first partial autocorrelation matrices of a stationary process. Denoting by the subset of matrices in whose singular values are all less than one, in terms of the new parameters, the stationary region reduces to a simple Cartesian product space is described along with a maximum likelihood estimation procedure. Although the reparameterisation, hereafter the
whose singular values are all less than one, in terms of the new parameters, the stationary region reduces to a simple Cartesian product space. Ansley and Kohn (1986) build on the earlier work in Ansley and Newbold (1979) by generalising the construction of Monahan (1984) and explicitly providing recursive formulae for the inverse mapping. A second bijective mapping between and
is described along with a maximum likelihood estimation procedure. Although the reparameterisation, hereafter theAK-parameterisation, gives a set of unconstrained parameters , it is not immediately amenable to Bayesian inference because the are difficult to interpret.
Roy et al. (2019) derive an alternative reparameterisation of stationary and invertible VARMA models, hereafter the RML-parameterisation. Focusing on the VAR subclass, we show in Section 3.3 that the new parameters represent differences in conditional variances, for , and orthogonal matrices , where arises from the polar decomposition of an affine transformation of the corresponding partial autocorrelation matrix , and denotes the set of orthogonal matrices. From an inferential perspective, the space of the new parameters is almost as problematic as the original space . Prior specification remains difficult because the orthogonal matrices are difficult to interpret. Computational inference also remains challenging because the complex constraints defining are notoriously troublesome for MCMC schemes. Although a second mapping from to Euclidean space is presented, the transformation is many-to-one, and hence the new parameters are not identifiable in the likelihood.
In this paper we propose a modification of the AK-parameterisation which allows various structural forms of the partial autocorrelation matrices to be preserved when they are mapped to . This, in turn, facilitates construction of prior distributions for that encourage shrinkage towards meaningful parametric structures. In particular, we describe a prior which is exchangeable with respect to the order of the elements in the observation vector. This is likely to be a useful representation of prior beliefs in a wide variety of applications where the modeller does not have information, a priori, to distinguish between the time series. We also show how a simple bijective transformation of the RML-parameters from to allows assignment of a “vague” prior and straightforward computational inference.
The remainder of this paper is organised as follows. In Section 2 we describe our reparameterisation of the parameters of a stationary model and the structural forms that are preserved under the mapping from partial autocorrelation matrices to unconstrained square matrices. Section 3 details some prior distributions over unconstrained Euclidean space, along with their properties, and outlines a number of examples representing different beliefs about the structure of the partial autocorrelations. In Section 4, calculation of the likelihood and posterior are discussed, along with the use of Hamiltonian Monte Carlo for computational inference. Section 5 describes an application to a macroeconomic time series and compares posterior predictive inference under the stationary prior to inference obtained under standard prior distributions from the literature where stationarity is not enforced. Finally, we offer some conclusions and avenues for further research in Section
describes an application to a macroeconomic time series and compares posterior predictive inference under the stationary prior to inference obtained under standard prior distributions from the literature where stationarity is not enforced. Finally, we offer some conclusions and avenues for further research in Section6.
2 An unconstrained reparameterisation over the stationary region
Our reparameterisation can be broken down into two parts. For every value of the error variance matrix, the first part maps the autoregressive coefficient matrices to a set of partial autocorrelation matrices, each of which lies in . Then the second part maps the partial autocorrelation matrices to a set of unconstrained square matrices. We describe each transformation in detail below, before examining some special cases of the second transformation in which the structural form of the matrix is preserved.
2.1 Part 1: partial autocorrelation matrices
Ansley and Kohn (1986) establish a one-to-one correspondence between the parameters of a stationary process and the parameter set in which denotes the -th partial autocorrelation matrix. In essence, the -th partial autocorrelation matrix is a conditional cross-correlation matrix between and given . More precisely, the matrices are defined as follows. For each let
in which the matrices and are the coefficients of the -th terms and , respectively, in the conditional expectations and . It follows that for and and are the values of the coefficients, say and , in the autoregression of on its predecessors or successors, respectively, that minimise the mean squared error or ; see, for example, Chapter 3 of Christensen (1991). We define the corresponding conditional variances as and for and let where is the -th autocovariance of . Now, express the conditional variance matrices through a matrix-square-root decomposition, and . In this paper, we take the symmetric matrix-square-root factorisation, so that and are symmetric and positive definite. A different reparameterisation can be defined using the Cholesky factorisation, in which case and are lower triangular. However, as we discuss in Section 3.1, we prefer the former factorisation as it facilitates construction of a prior which is closed under orthogonal transformation of the observation vectors.
Finally, let and be standardised versions of the forward and reverse time series and, for each , let and be standardised versions of the forward and reverse error series. We can now define the partial autocorrelation matrix as
Complete details of the mapping from to and its inverse are given in Appendix A. The former is taken from Ansley and Newbold (1979) and the latter is a generalised version of the algorithm presented in Ansley and Kohn (1986) which allows the conditional variance matrices to be decomposed into symmetric matrix-square-roots as an alternative to Cholesky factors. Original proofs are provided in the Supplementary Materials. In particular, we show that equation (3) can be expressed as . Using the symmetric matrix-square-root factorisation explicitly, this is
for . In the univariate case, note that this simplifies to .
2.2 Part 2: unconstrained square matrices
Although the constraints on the -fold Cartesian product space are substantially simpler than those on the stationary region , there are no standard distributions defined over . This makes it difficult to specify a prior that encodes genuine prior beliefs. Fortunately, Ansley and Kohn (1986) also define a one-to-one mapping from to . The forward mapping is defined as follows. Let
be a matrix-square-root factorisation of . Then write . Similarly, for the inverse mapping, let
be a matrix-square-root factorisation of , then write . Again, in this paper we take the symmetric matrix-square-root factorisation in each case.
It follows that we can reparameterise the model in terms of the parameter set over which a distribution can readily be specified. If we use symmetric matrix-square-roots in both parts of the reparameterisation, we refer to this as the symmetric AK-parameterisation.
2.3 Preservation of structural forms
If the symmetric matrix-square-root factorisation is used in (5) and (6), the transformation from to and its inverse preserve a number of structural forms for the square matrices. Specific examples are detailed below.
2.3.1 Scaled all-ones matrix
For a scaled all-ones matrix , where and is an -vector of 1s, taking is a necessary and sufficient condition for the singular values of to be less than one. It is straightforward to verify that (with ) if and only if where . Therefore is a scaled all-ones matrix if and only if the same is true of .
2.3.2 Diagonal matrix and scaled identity matrix
A necessary and sufficient condition for the diagonal matrix to have singular values less than one is . It is then straightforward to show that (with ) if and only if where for . Therefore is diagonal if and only if is diagonal and, in the case where , is a scaled identity matrix if and only if the same is true of .
2.3.3 Zero matrix
Denote by an matrix of zeros. We have immediately that if and only if . This is, of course, simply a special case of the scaled all-ones matrix or scaled identity matrix where . However, it is a particularly useful theoretical result because if and only if . The order of the vector autoregression is therefore if and only if but for . We return to this point in Section 6.
2.3.4 Two-parameter “exchangeable” matrix
The most general form for a square matrix which is invariant under a common permutation of the rows and columns is . We will call this a two-parameter exchangeable matrix. The necessary and sufficient conditions for the two-parameter exchangeable matrix to have all its singular values less than one are and or, equivalently, and where
It is straightforward to verify that (with and ) if and only if , with , where
for . Therefore, is a two-parameter exchangeable matrix if and only if the same is true of . Note that the scaled all-ones matrix and scaled identity matrix are special cases of this result when or , respectively.
3 Prior distributions over the unconstrained space
Let denote the vectorisation operator. Conditional on a set of unknown hyperparameters, we construct a prior distribution with joint density
denote the vectorisation operator. Conditional on a set of unknown hyperparameters, we construct a prior distribution with joint density
in which is assigned a distribution over and , , is assigned a multivariate normal distribution. Predominantly, our focus in this paper is specification of a prior for the latter.
In the analysis of multivariate stochastic processes, a desirable property for the prior is often invariance under permutation of the order of the components of the multivariate observations. In this section, we begin by considering the effect on the reparameterisation of an orthogonal transformation, such as a permutation, of the components of the -variate observation vector. We then focus on the construction of permutation-invariant prior distributions in Section 3.2.
3.1 Invariance to orthogonal transformation
The multivariate normal distribution is closed under linear transformation and determined completely by its first two moments. Consider a permutation, rotation or any other orthogonal transformation of the observation vectors, that is
The multivariate normal distribution is closed under linear transformation and determined completely by its first two moments. Consider a permutation, rotation or any other orthogonal transformation of the observation vectors, that is, where is an orthogonal matrix. It follows that will follow a stationary process characterised by parameters
in which and for . Moreover, because is orthogonal, the symmetric square-roots of and are similar, with . Correspondingly, in (1) and (2) for the forward and reverse autoregressions on terms, the coefficients in the conditional expectations for the transformed process are and for , . Similarly, the matrix-square-roots of the conditional variance matrices are and for . The matrix triples and are therefore simultaneously similar and hence the -th partial autocorrelation in (4) for the transformed process is
for . It then follows trivially from (5) that in the unconstrained parameterisation for . Hence the parameters (10) map to or, equivalently, where and for . It is important to note that if Cholesky factors are used in the first or second part of the reparameterisation, the partial autocorrelation matrices and their real-valued transformations are not similarity transformations of and . This is why we choose to use symmetric square-roots, rather than Cholesky factors, in our mappings.
Consider a prior of the form (9) in which and each have distributions from the same family as and , respectively, for any orthogonal matrix . It follows that the prior induced for over has the appealing property of being closed under orthogonal transformation of the observation vectors. Distributions for over possessing this closure property include the inverse Wishart and the distribution induced by taking the matrix logarithm to be multivariate normal (Leonard and Hsu, 1992). For , the multivariate normal distribution meets this requirement.
3.2 Structured prior distributions
3.2.1 Exchangeable prior distribution
It often happens that, a priori, we do not have information to distinguish between the components of . In this case it would be reasonable to adopt a prior for which is exchangeable with respect to the ordering of the elements in the observation vectors.
Section 3.1 described the properties which the prior (9) must have in order to induce a distribution for which is closed under orthogonal transformation of the observation vectors. To obtain an exchangeable prior, we restrict our attention to distributions for and which would be the same as the distributions of and , respectively, for any permutation matrix . That is, distributions for and which are invariant under a common permutation of the rows and columns.
Given certain choices of their hyperparameters, both the (conjugate) inverse Wishart distribution and the multivariate normal distribution for the matrix-logarithm can yield an exchangeable prior for . For instance, we could assign an inverse Wishart prior with scale matrix equal to a positive definite two-parameter exchangeable matrix.
Now, suppose we wish to assign an exchangeable prior to , . Given the potential for the model to contain a very large number of parameters, suppose further that we want to specify a prior that allows borrowing of strength between the diagonal elements and between the off-diagonal elements of each . To this end, we can adopt a prior in which the diagonal and off-diagonal elements are taken to be independent and then given hierarchical priors. At the top-level, we take
|independently for ,||(11)|
|independently for with .||(12)|
The mean and precision at the bottom level of the hierarchy can then be assigned priors on and , such as
Marginally, , (for ) and with similar expressions for the moments of the off-diagonal elements. Therefore, given specifications for the common diagonal elements, say , and off-diagonal elements, say , in , one can calculate corresponding values for the common diagonal and off-diagonal elements, say and , in through (7)–(8). These can be taken as values for and . Uncertainty in these central values, and the proportion of this which is shared among all the diagonal or all the off-diagonal elements, can be reflected through choices of the other hyperparameters. Clearly, specifications which make the marginal variances small and the marginal correlations large will shrink the posterior so that it is more concentrated over the space of two-parameter exchangeable structures. Guidance on the general problem of eliciting prior distributions for correlations can be found in Revie et al. (2010).
3.2.2 Prior distribution centred on a diagonal matrix
Let . For each , suppose it is believed a priori that once is known, is the only element in which provides further information about . This is tantamount to a conjecture that the partial autocorrelation matrix , and hence , is diagonal; see Section 2.3.2. To represent this belief we choose a prior in which the diagonal and off-diagonal elements are independent and then take for the diagonal elements. Alternatively, if there was nothing in our prior beliefs to distinguish among the diagonal elements, and we wanted to allow borrowing of strength between them, we might adopt the hierarchical prior defined by (11) and (13).
For the off-diagonal elements (), we can centre our prior around zero, whilst allowing the data to influence the degree of shrinkage towards zero, by adopting a special case of the hierarchical prior defined by (12) and (14) in which the distribution for is a point mass at zero. Alternatively, as we discuss further in Section 3.2.4, the off-diagonal elements could be assigned a sparsity-inducing prior.
3.2.3 Prior distribution centred on a scaled-all ones matrix
Suppose there was nothing in our prior beliefs to distinguish among the elements of and it is believed a priori that once is known, it is only the sum that is informative about . This is equivalent to the hypothesis that the partial autocorrelation matrix , and hence , is a scaled all-ones matrix; see Section 2.3.1. To represent this belief we could choose a hierarchical prior of the form (11)–(14), but with and . Alternatively, we might like to allow and to differ in order to allow different degrees of shrinkage towards the common mean amongst the diagonal and off-diagonal elements.
3.2.4 Sparse matrices
A zero-mean model is highly parameterised with parameters; more precisely, there are degrees of freedom. In some applications, the number of observations in the time series can be small compared to the vector dimension . For example, in macroeconomics, data sets typically involve monthly, quarterly or even annual measurements, which clearly limits the practicable length of the series that can be observed. Consequently, it is entirely plausible for there to be fewer observations in the data set than there are parameters in the model. In principle, this is not a problem in the Bayesian paradigm; a parameter that cannot be identified in the likelihood can still be identifiable in the posterior. More practically, practioners can construct a joint prior distribution that allows information to be shared between parameters. However, in the absence of a carefully elicited prior distribution, which is a practical (albeit imperfect) reality in most applications, the posterior distribution can be very diffuse, and this lack of precision can filter through to predictive distributions. Given that forecasting is often the main inferential objective in time series analysis, a diffuse posterior predictive distribution is clearly undesirable.
In general, this issue is often addressed by using sparsity-inducing priors based on a zero-mean scale-mixture of normals. The mixing distribution can either be discrete, as in spike-and-slab priors (Mitchell and Beauchamp, 1988; George and McCulloch, 1993), or continuous, as in the (regularized) horseshoe (Carvalho et al., 2010; Piironen and Vehtari, 2017); see O’Hara and Sillanpää (2009) or Hahn and Carvalho (2015) for a review. Particularly in the financial and econometric literature, such ideas have been employed with models, most often assigning sparsity-inducing priors for the individual elements of the (unconstrained) autoregressive coefficient matrices ; for example, see George et al. (2008), Lei et al. (2011) or Billio et al. (2019). In such cases the parameters in the can be directly interpreted as regression coefficients in a linear model, and so the problem can be viewed as one of variable selection. Further, when the error variance matrix is assumed to be diagonal, a zero in position of implies conditional independence between and given where . In our parameterisation, although it is true that if and only if , an individual zero in position of does not give an individual zero in position of and vice versa. Therefore an individual zero in does not have a clear structural interpretation like an individual zero in . Although this weakens the motivation for sparsity-inducing priors from an explanatory perspective, they can still be helpful in reducing the variance of predictive distributions in situations where the number of observations in the time series is small compared to the vector dimension . In this case, a zero-mean scale-mixture of normals would take the form with , independently for , and with , independently for , where . Here and are either discrete or continuous mixing distributions.
3.3 The RML-parameterisation
Roy et al. (2019) establish a bijective mapping between the parameters of a stationary process and the parameter set