Bayesian Dynamic Fused LASSO

05/29/2019 ∙ by Kaoru Irie, et al. ∙ 0

The new class of Markov processes is proposed to realize the flexible shrinkage effects for the dynamic models. The transition density of the new process consists of two penalty functions, similar to Bayesian fused LASSO in its functional form, that shrink the current state variable to its previous value and zero. The normalizing constant of the density, which is not ignorable in the posterior computation, is shown to be essentially the log-geometric mixture of double-exponential densities and treated as a part of the likelihood. The dynamic regression models with this new process used as a prior is conditionally Gaussian and linear in state variables, for which the posterior can be computed efficiently by utilizing the forward filtering and backward sampling in Gibbs sampler. The latent integer-valued parameter of the log-geometric distribution is understood as the amount of shrinkage to zero realized in the posterior and can be used to detect in which period the corresponding predictor becomes inactive. With the hyperparameters and observational variance estimated in Gibbs sampler, the new prior is compared with the standard double-exponential prior in the estimation of and prediction by the dynamic linear models for illustration. It is also applied to the time-varying vector autoregressive models for the US macroeconomic data and performs as an alternative of the dynamic model of variable selection type, such as the latent threshold models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The univariate dynamic linear models (DLMs) in practice are frequently over-parametrized because of the massive amount of predictors, most of which are believed to be noises. For example, the time-varying vector autoregressive models are typically decomposed into the multiple univariate sub-models for the computational feasibility (e.g., Zhao et al. 2016 and Gruber and West 2016), but this approach results in the excess amount of predictors in those sub-models even for the moderate dimensional observations. This research contributes to the appropriate modeling of sparsity in the univariate DLMs with many predictors by defining a new shrinkage prior on the dynamic coefficients, and its application to the modeling of multivariate time series.

Specifically, for the univariate state variable , which is the time-varying regression coefficient in the context of DLMs, we consider the new Markov process defined by its transition density,

(1)

The two penalty functions in the exponential realize the conditional shrinkage effects. By using this process as the prior in DLMs, we shrink the state variable at time toward zero by the first penalty function, while shrinking it to the previous state variable at as well to penalize the excess dynamics by the second penalty. The technical difficulty is apparently the unknown normalizing constant abbreviated in equation (1) that involves state variable

and is not ignorable in the posterior analysis. The objective of this research is to compute this normalizing constant explicitly and provide the computational methodology for the efficient posterior analysis by Markov chain Monte Carlo methods.

The prior in (1) is named dynamic fused LASSO (DFL) prior for its similarity to Bayesian fused LASSO models (e.g., Kyung et al. 2010 and Betancourt et al. 2017). The Bayesian fused LASSO has rarely been applied to the time series analysis and the problem of unknown normalizing constant has not been discussed. This is because, in Bayesian fused LASSO, the state variables are modeled jointly

, not conditionally, by exponentiating the various penalty functions to define the joint density of all the state variables. By modeling the joint density directly, the normalizing constant becomes free from the state variables, which simplifies the Bayesian inference for the fused LASSO models and enables the scale mixture representation of the double exponential priors, as in the standard Bayesian LASSO models

(Park and Casella, 2008). From this viewpoint, our research is clearly different from the existing fused LASSO models in modeling the conditional

distribution of state variables to realize our prior belief in the dynamic modeling, which instead poses the problem of computing the normalizing constant that has been ignored (or not required to compute) in the study of the Bayesian fused LASSO. The modeling of the conditional distribution is crucial in predictive analysis; the direct application of the existing fused LASSO to the joint distribution of time-varying parameters does not define the conditional distribution coherently, and cannot be used for sequential posterior updating and forecasting.

Although the normalizing constant complicates the prior and posterior distributions of the DLMs with the DFL process, we prove the augmented model representation of the DFL prior as the conditionally dynamic linear models (CDLMs), for which the efficient posterior sampling of state variables by the forward filtering and backward sampling (FFBS, e.g., see West and Harrison 1997) is available. Facilitating the posterior computation by FFBS with the help of the CDLM representation is the standard strategy in the literature of econometrics and forecasting, where the dynamic sparsity has been realized by the hierarchical DLMs (e.g., Frühwirth-Schnatter and Wagner 2010, Belmonte et al. 2014 and Bitto and Frühwirth-Schnatter 2019). This hierarchical version of dynamic linear models is obtained by the natural extension of DLMs with another prior on its scale parameters in the state equation. While the posterior of state variables is easily computed by FFBS, these priors penalize only the distance between the two consecutive state variables, and , which is understood as the special case of the prior of this study with in equation (1). Our approach, in contrast, integrates the additional penalty for the shrinkage toward zero explicitly into the conditional distribution of the prior process. This modeling approach reflects our prior belief that the state variable is likely to be either zero or unchanged from its previous value. The additional shrinkage effect to zero in the DFL prior can also address the problem of the shrinkage effect restricted to be uniform over time, as pointed out by Uribe and Lopes (2017) as “horizontal shrinkage”, by localizing the shrinkage effect at each time by customized latent parameters, in a different way to Kalli and Griffin (2014), Uribe and Lopes (2017) and Kowal et al. (2019).

The conditional normality and linearity of the DFL prior is based on the fact that the prior process is decomposed into two parts: the synthetic likelihood and synthetic prior. These terminologies literally mean that the prior consists of two components, one of which is treated as (part of) likelihood and the other of which serves as the prior. The synthetic prior is just the well-known scale mixture of Gaussian autoregressive process, hence normal and linear in state variables. The synthetic likelihood part is equivalent to observing the artificial data with mean that provides the additional information to shrink the state variable to zero. The posterior of this CDLM is equivalent to that of the original model, which justifies the use of FFBS for the model with the synthetic likelihood. The idea of synthetic likelihood and prior approach has been utilized for the study of optimal portfolios that are sparse and less switching (e.g., Kolm and Ritter 2015 and Irie and West 2019), and this research consider the same idea in the context of statistical modeling.

The conditional density and its unknown normalizing constant are analyzed via the well-known scale mixture representation of the double exponential distributions (

Andrews and Mallows 1974 and West 1987), which has been applied to the Bayesian LASSO models. A methodological alternative is the approach of Hans (2009, 2011) to the Bayesian LASSO and elastic net models in which the conditional posterior density becomes the mixture of truncated normals. While this approach is efficient in computation by avoiding any continuous latent parameters being newly introduced, the scale mixture approach is useful in analytically computing the DFL process in a simple way. By taking the scale-mixture approach, we find that the normalizing constant translates as the synthetic likelihoods, enabling for the straightforward implementation of Gibbs sampler. The truncated distributions in the high-dimensional parameter space are computationally costly and should be avoided, which also motivates the scale-mixture approach.

The rest of the paper is structured as follows. Section 2 introduces the class of the new Markov processes in the general form, including the DFL process as its special case. Section 3 focuses on the DFL prior in (1) and proves its CDLM representation that helps understanding the implied prior belief about shrinkage effects in this model. Section 4 provides the MCMC algorithm for the DLMs with the DFL prior by using the properties proven in the previous sections, in addition to discussing the estimation of hyperparameters and observational variances. In Section 5, the proposed model is applied to the simulation data for illustration (Section 5.1) and to the US macroeconomic time series for the comparison with the model of the variable selection type (Section 5.2). The paper is concluded in Section 6 with the potential future research.

Notations:

The density of the univariate normal distribution with mean

and variance evaluated at is denoted by . The double-exponential density with parameter is denoted by

. The gamma distribution with shape

and rate is written as with mean

. The generalized inverse Gaussian distribution is denoted by

, the density of which is proportional to .

2 A new Markov process

2.1 General form of the process

A new class of univariate Markov processes is defined by its conditional density of transition

or, equivalently, the loss function tied to the density by

, ignoring the constant. This transformation of probabilistic models to loss functions has been studied for statistical analysis and decision making problems (e.g., see Müller, 1999). The loss functions proposed in this research consist of two penalty terms in the additive form as , where functions for satisfy for any real

. The exponential transformation of this loss function defines the conditional probability density,

(2)

This study focuses on the special case of this class of models that defines the loss functions by and for some nonnegative weights , which implies the process in equation (1). The two different types of shrinkage effects are clearly seen in this functional form; one is the shrinkage of to the previous point and the other to zero. It is important to note that the normalizing constant, which is abbreviated in (2), does depend on , which complicates the joint density of .

The exponentiated loss functions are assumed to be integrable. Equivalently, they are proportional to some probability density functions denoted by

where and are probability density functions of . Also, is assumed to be symmetric; for any pair , by definition. Also, the normalizing constant of density is independent of , i.e., the integral of in does not depend on . The process in equation (2) is written as

(3)

The analytical expression of for the loss functions is discussed later in Proposition 3.1.

Denote the marginal distribution of by . The condition for this process to be stationary is

and one solution of this functional equation is , with which the process also becomes reversible. In the following, we assume the stationarity and the marginal density of this form.

Furthermore, assume that and are the scale mixture of normals,

for some densities and . Many statistical models, that include double-exponential, horseshoe and other distributions, fall in this class of loss functions , or densities and . Using this scale-mixture representation, we have

hence the normalizing function is

(4)

i.e., the scale mixture of normals with the convolution of and . The conditional density also has the following mixture representation,

(5)

i.e., the location-scale mixture of normals. Note that the integrand, the normal density of , is the regression of on with coefficient . Conditioned by the latest state , the current state is shrunk toward . The amount of this conditional shrinkage is determined by the balance of two conflicting loss functions, and .

2.2 Joint distribution and examples

Consider a Markov process whose transition is given in equation (3) with the initial distribution . The joint distribution of can be decomposed into two parts– the synthetic likelihood and prior– as

If one ignores the “likelihood” above, the transition simply comprises , where the explicit shrinkage of is set only toward a single point , not toward both and zero. The Markov process with such a “prior” is widely used in the state space modeling, for the assumption that has the mixture representation simplifies the posterior computation. The “likelihood” involves the reciprocal of and feeds the synthetic prior. Hence, the proposed Markov process is characterized as the “posterior” of this synthetic model.

Example 1: Gaussian random walk is obtained by and . Gaussian AR(1) process for is realized by and , where and . These two processes are the typical examples of the prior in DLMs.

Example 2: Bayesian elastic net and . This process is considered in Hans (2011) as the prior of static linear models without the use of scale mixture representation.

Example 3: DFL, or the two penalties: and . The focus of this research is on the process of this type. In the probability representation, and , both of which are the scale mixture of Example 1, where the scale parameter follows the gamma distribution with shape 1 and rate or . We refer this model as dynamic fused LASSO, or DFL prior/process. If , the model is greatly simplified and easily applied to state space modeling as the dynamic extension of Bayesian LASSO.

Example 4: The horseshoe prior (Carvalho et al. 2009, 2010) can be used in modeling this process. The density involves the upper incomplete gamma functions as

. The horseshoe prior is also the scale mixture of normals; now the scale follows the half-Cauchy distribution (

Gelman 2006) or, equivalently, the gamma mixture of gamma distributions. If , the posterior computation is straightforward. The further extension can be considered as the gamma mixture with the general shape parameters, known as Bayesian bridge models (Polson et al., 2014).

3 Dynamic fused LASSO

3.1 Double-exponential models

In the rest of the paper, we focus on the DLF prior of Example 3 in Section 2.2, where the loss functions are and . These loss functions imply the double-exponential densities,

for some non-negative weights and . Unless otherwise specified, we assume to avoid the excess shrinkage effect to zero. The other cases, and , are also discussed just for the completeness of theory.

The computation of normalizing function is straightforward.

Proposition 3.1

The normalizing density is, if ,

(6)

and, if ,

The density of is obtained by exchanging and in equation (6).

Proof.  The following computation can be used for both and . In the mixture representation of the normalizing constant in equation (4), apply the change of variables as and . Integrate out first, then

. The resulted function involves the Bessel function of the second kind with half-odd integers, which is simplified into the form above.

Proposition 3.1 enables the evaluation of densities involving , such as the conditional density and the stationary density . In Figure 1, the conditional density with and is plotted for the different choices of . As weight increases, the shrinkage effect to zero becomes visually clear. When , the two shrinkage effects are completely balanced, which is expressed in the functional form of the density as the plateau between and zero. With this density as the prior, one does not discriminate any state between these two points a priori. In practice, however, we do not want the extreme amount of shrinkage to zero when transitioning the last state to the next and, for this reason, we assume in application. Figure 2 shows the marginal, stationary density . In general, the smaller

leads to the fatter tails, avoiding the shrinkage effect on the outliers. The spike around zero still exists even for the densities with small

, so the marginal shrinkage effect to zero is preserved. For large , the density becomes more spiky, but it is clearly different from the double-exponential density. These characteristics of priors are revisited from the different viewpoints later in Section 3.2 for the better choice of hyperparameters.

Figure 1: Conditional prior density of state for , and various . If , the plateau appears between two shrinkage points, which means that and the values between and zero are indifferent in the prior. As becomes small, the shrinkage effect to dominates the functional form of the density, while some probability mass still remains around zero.
Figure 2: Marginal prior density of for and various . The smaller is, the fatter tails the prior density becomes. The double-exponential prior density is also shown in the same figure. The density of the stationary distribution of the DFL process is clearly different from the double-exponential density for any .

In the expression of the second line of equation (6), note that because by assumption. It guarantees the absolute convergence of the series expression of the reciprocal normalizing function as

The reciprocal of the double exponential density with parameter seen here will be canceled out with another function in the synthetic likelihood. The rest is the mixture of two components: (i) constant, for , which has no contribution to the posterior as the synthetic likelihood, and (ii) the discrete mixture of double exponential distributions with parameter for . In the second component, the mixture weight is proportional to the probability function of log-geometric distribution with parameter , defined by the series representation of . We summarize this observation as a proposition.

Proposition 3.2

If , the reciprocal of normalizing function is written as

where is the mixture of a constant and the log-geometric mixture of double-exponential distributions,

(7)

where , and .

In the posterior computation, the component of the discrete mixture is interpreted as the double-exponential density with location and parameter , or , evaluated at . In other words, we separate this discrete mixture from the prior and interpret it as the part of the likelihood, or “synthetic” likelihood, where we generate from the log-geometric distribution and, if , we pretend to observe with the double-exponential likelihood above. Note that the resulted posterior distribution is equivalent to the original prior. This transformation of the prior implies the CDLM representation, where we can benefit from the efficient computation by FFBS and the known conjugate modeling of other model parameters, such as stochastic volatility.

Proposition 3.3

If , the joint distribution of the DFL prior can be computed as the posterior of the following conditional dynamic linear model; the state evolution and the process of the associated latent parameter are

(8)

with the initial distribution at ,

The synthetic observation is defined as, at ,

For , the latent count follows the discrete distribution

(9)

where here is the point mass on and is the discrete distribution on positive integers whose probability function is . The quantities such as , and are defined in Proposition 3.2. Conditional on , we additionally have observational equation defined by,

(10)

For , we have no additional observation equation of , and can follow any prior distribution.

In this augmented joint distribution of state variables, the shrinkage effects of the new Markov process gains new interpretation. If is sampled at time , then the model has no synthetic observation at , or is “missing.” If is sampled, then is observed and the state variable is affected by this additional information that encourages the shrinkage to zero at time . This is exactly the “local/vertical” shrinkage effect, that is different from the global/horizontal shrinkage that is uniformly applied to the state variables at all the time points, as pointed out by Uribe and Lopes (2017). The amount of this shrinkage is indirectly controlled by weights and ; the larger is, the more likely it is that , having less shrinkage effect to zero, which is consistent with the interpretation of weights in the loss function.

Proof.  Propositions 3.1 and 3.2 prove the geometric series representation of the transition density as

where is the discrete mixture given in equation (7). With the latent variable in the mixture representation, the transition density is understood as the marginal of . Marginally, with probability , and with probability . Jointly, for ,

and, for ,

For notational convenience, we define by

In the joint density of state and auxiliary variable , observe that appears both in the numerator and denominator for and cancels out one another as

(11)

where means that the product is taken only for that satisfies . Using the scale mixture representation of double exponential distributions, we can read off in equation (11) the conditional distribution,

where is the set of all the latent variables up to time , i.e., . The densities in the first parenthesis imply the linear and Gaussian likelihood, and those in the second parenthesis are the Gaussian AR(1) prior. This is exactly a dynamic linear model conditional on all the latent variables, the joint density of which is given by

As a whole, this is exactly the CDLM of the proposition.

In section 4.2, weight parameters are also estimated. For the computation of the conditional posterior of weights, the state density in equation (11) is simplified as follows.

Proposition 3.4

As the function of , the joint density of states and latent counts is written as

Remark 3.1

If , the reciprocal of the normalizing function has the integral representation,

and is understood as the mixture of double-exponential distribution with the latent parameter . Although not discussed further in this paper, the corresponding synthetic model can be found even in this case and utilized for the posterior computation.

3.2 Subjective elicitation of weight parameters

The weights and determine the structure of sparsity in the model. As the hyperparameters, these weights must be carefully chosen so that the prior appropriately reflects one’s belief on the dynamics and sparsity of the state variables. One approach to the choice of hyperparameters is to specify the conditional mean of the state transition in equation (5). This density is the location-scale mixture of the Gaussian AR(1) process,

(12)

where the mixing distribution is

(13)

which is obtained by change of variables with and . The conditional expectation is one of the useful information that statisticians can interpret as “conditional shrinkage effect” and elicit their prior belief based on this quantity. The distribution itself is analytically available as well and provides further information on one’s prior belief.

Proposition 3.5

For DFL prior with , the conditional transition density in equation (12) has the mean , where

(14)

The joint distribution in equation (13) is decomposed into the compositional form. The conditional distribution of is and the marginal density of is

(15)

See the appendix for the computation to prove these results.

The conditional mean in equation (12) is plotted against in Figure 2(a). When is small, the conditional means are almost on the diagonal line, showing more shrinkage effect to as in the random walk models. In contrast, for large , the shrinkage effect to zero becomes strong, making the conditional means off the diagonal line in the figure. The conditional mean is the non-linear function of as shown in equation (14), and this non-linearity is also visually confirmed in the figure. Figure 2(b) depicts the density of shrinkage effect in equation (15), conditional on , with marginalized out. For all the values of examined here, the prior mass concentrates around , implying the dominating conditional shrinkage effect is directed to , not to zero. For smaller , however, the more probability mass is placed on the smaller values of as well.

These two figures are just one aspect of the prior structure; for different choices of , the conditional prior mean and density look completely different, and it is difficult to summarize those differences in a simple manner. This might be the potential difficulty in subjectively specifying the prior, which motivates the estimation of hyperparameters. The automatic adjustment of hyperparameters via posterior analysis is discussed later in Section 4.2, but the choice of hyperparameters based on the discussion here still remains important.

(a) Conditional prior mean of state
(b) Prior density of conditional shrinkage
Figure 3: Left: Conditional prior mean for and various as the function of . The larger is, the more shrinkage to zero is observed. Conversely, if is small enough, then . The shrinkage effect is also dependent on in non-linear way. Right: Prior density of for and various with marginalized out. Small implies with high probability, having little shrinkage to zero. For large , it is still likely that is close to unity, but more probability mass is on smaller values of , resulting in the additional shrinkage to zero.

3.3 Simulation from the prior

Simulating the random variable

from the conditional distribution is an important step at the one-step and multi-step ahead predictions. The location-scale mixture representation in equation (12) enables the simulation from the normal distribution , given that is sampled from . Proposition 3.5 provides the compositional form of this joint density, in which is the generalized inverse Gaussian distribution and relatively easy to sample from (e.g., Devroye 2014).

The problem remains in the sampling of from marginal in equation (15). The density of is transformed into the relatively simple form by taking another scale.

Proposition 3.6

The marginal density of is expressed in the scale of as

(16)

where . The distribution function is

(17)

The explicit formula of distribution function helps computing the inverse distribution function numerically, thus allows the inverse sampling. See more details in the appendix.

3.4 Interpretation of shrinkage effects

The DFL prior is characterized by the two conflicting shrinkage effects to the previous state and zero that are not seen in the other shrinkage priors used in the time series analysis. Proposition 3.3 gives these shrinkage effects the new interpretation; the shrinkage to is based on the state equation (8), while the shrinkage to zero is achieved by the synthetic observations in (10). The former has been seen in the literature of DLMs with shrinkage priors, where the state transition is defined by

If follows the gamma distribution with shape 1, then this is the special case of DFL prior with . Another example is the case where and follows the half-Cauchy prior or its mixture (Frühwirth-Schnatter and Wagner 2010, Belmonte et al. 2014 and Bitto and Frühwirth-Schnatter 2019). As discussed already, the shrinkage effect of this prior is limited for its constant scale ; it has been improved by Kalli and Griffin (2014), Uribe and Lopes (2017) and Kowal et al. (2019). In all the examples here, the prior consists of a single shrinkage effect to the previous state. The DFL prior is clearly different from those in adding the new shrinkage effect to zero.

Another shrinkage model that should be considered for the comparison with the DFL prior, is the existing fused LASSO prior in Kyung et al. (2010) applied to the joint distribution of state variables (the joint fused LASSO prior, or JFL). The joint distribution of states variables is modeled as

(18)

This density has the similar, but simpler, synthetic model representation as

where for all . The model becomes a CDLM with the scale mixture augmentation of the double-exponential distributions. However, the JFL model lacks several desired properties that the proposed DFL possesses. First, the flexibility of the JFL prior in shrinkage effect is limited. This is indicated in the difference of the DFL/JFL priors in the synthetic likelihood; while the DFL prior allows for the possibility of missing the synthetic observation and the shrinkage effect to zero can vary across time based on the value of , the latent in the JFL model is always observed, and its shrinkage effects are equally controlled by parameter at any time point. In this sense, the flexibility of the JFL model in local shrinkage is limited. Second, the normalizing constant of the JFL prior in (18) is unknown. This does not involve the state variables, but weight parameters , so the the posterior analysis on weights is infeasible at this point, unlike the model with the DFL prior. At last, but most importantly, predictive analysis is not available with the JFL prior. The joint density of the JFL prior does not specify the evolution of state variables and the existence of state variables after in the model is not assumed. If one defines the joint distributions of and by (18) individually, then

and the conditional density is not coherently defined after . This concludes that the JFL prior is not suitable for the formal sequential and predictive analysis.

3.5 Baseline of prior process

The marginal prior mean of states is set to be zero in order that the additional shrinkage is directed to zero. The point of shrinkage can be changed from zero to any value (or baseline) and, in fact, estimated with some prior. Denote this baseline by , and modify the transition density of the DFL process as

The same augmentation of the model can be applied to this case. The only difference is that the value of the synthetic observation is now the baseline, i.e., . The baseline is the location parameter in the synthetic likelihoods, with , and the prior at , . We use the normal prior for the baseline, that is conditionally conjugate in the synthetic model. Alternatively, one can choose the double-exponential/horseshoe prior to introduce the shrinkage effect on the baseline.

4 Application to state-space modeling

4.1 Estimation by Markov chain Monte Carlo

Consider the Gaussian and linear likelihood of the state space model given by

(19)

where is vector of predictors known at time , is the vector of state variables and is the observational variance parameter modeled later in Section 4.3. Each state variable independently follows the DFL process,

(20)

where the baseline and weights are customized for each predictor and denoted by and . The CDLM representation of the prior given in Proposition 3.3 is now combined with the “real” likelihood in (19). Following the notation in Proposition 3.3, the set of all the latent variables introduced for this state variable for the -th state is denoted by . The algorithm of Gibbs sampler for the posterior inference can be derived easily from the CDLM representation.


Gibbs sampler for the Bayesian dynamic fused LASSO models

  1. Sampling by forward filtering and backward sampling (FFBS).

    The conditional posterior of is equivalent to the posterior of the conditionally dynamic linear model with “real” likelihood in (19) and the following “synthetic” likelihoods and prior. For each , if , then the model has the synthetic likelihood,

    At , the model always has the synthetic likelihood,

    The state evolution of the CDLM is defined by the synthetic prior; for ,

    and, for ,

    By FFBS, one can sample from the full posterior of of this CDLM (e.g., West 1984, Chap. 4.8).

  2. Sampling .

    They are independently sampled from