Bayesian Eigenvalue Regularization via Cumulative Shrinkage Process

06/11/2020 ∙ by Masahiro Tanaka, et al. ∙ 0

This study proposes a novel hierarchical prior for inferring possibly low-rank matrices measured with noise. We consider three-component matrix factorization, as in singular value decomposition, and its fully Bayesian inference. The proposed prior is specified by a scale mixture of exponential distributions that has spike and slab components. The weights for the spike/slab parts are inferred using a special prior based on a cumulative shrinkage process. The proposed prior is designed to increasingly aggressively push less important, or essentially redundant, eigenvalues toward zero, leading to more accurate estimates of low-rank matrices. To ensure the parameter identification, we simulate posterior draws from an approximated posterior, in which the constraints are slightly relaxed, using a No-U-Turn sampler. By means of a set of simulation studies, we show that our proposal is competitive with alternative prior specifications and that it does not incur significant additional computational burden. We apply the proposed approach to sectoral industrial production in the United States to analyze the structural change during the Great Moderation period.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of low-rank matrix estimation has been receiving increasing attention in natural and social sciences and other fields. In many applications, the data approximately exist in a low-dimensional linear subspace. Low-rank matrix factorization (decomposition) is often used to recover the low-rank structure, remove measurement noise, and complete missing entries (see, e.g., Candes and Plan, 2010; Shi et al., 2017

for a survey of the literature). Low-rank matrix factorization serves several purposes, such as dimension reduction, data imputation, and structural analysis. It also has a close ties to more elaborate models for high-dimensional data, e.g., static factor models

(Geweke and Zhou, 1996), reduced rank regression (Geweke, 1996), and factor regression (West, 2003). As discussed in Tanaka (2020), matrix completion can also be used for causal inference.

Assessment of the uncertainty in inference is an integral component of scientific research and many practical applications. Non-Bayesian approaches to matrix factorization struggle to quantify uncertainty (e.g., Chen et al., 2019). In contrast, fully Bayesian approaches can straightforwardly obtain credible sets as a byproduct of posterior simulation. Therefore, there is a good reason to employ a fully Bayesian approach in matrix factorization.

In this study, we consider fully Bayesian inference of three-component factorization of matrices, as in singular value decomposition. A noisy matrix is modeled as

where is a matrix of measurement errors. Moreover, some elements of may be missing. Two types of constraints are needed for the exact identification of the matrix factorization. First, and are supposed to be unitary (or orthonormal) matrices, i.e., . Second, is assumed to be diagonal, and its diagonal elements are nonnegative and arranged in order of size.

In the non-Bayesian literature, low-rank matrix estimation/completion using the nuclear norm penalty has been intensively studied (e.g., Keshavan et al., 2010; Rohde and Tsybakov, 2011; Negahban and Wainwright, 2011; Koltchinskii et al., 2011; Klopp, 2014; Gaïffas and Klopp, 2017). This strategy amounts to estimation of a matrix while penalizing based on its rank since the nuclear norm of a matrix is a convex relaxation of its rank (Fazel et al., 2001). An estimator using the nuclear norm penalty coincides with the posterior mode under two prior specifications. First, for three-component factorization, an exponential prior for the eigenvalues, the diagonal elements of , corresponds to the nuclear norm penalty. Second, for two-component factorization, the use of an independent normal prior is tantamount to the use of the nuclear norm penalty (e.g., Salakhutdinov and Mnih, 2008; Babacan et al., 2012; Fazayeli et al., 2014).

The main contribution of this study is to propose a novel shrinkage prior for the diagonal elements of

, i.e., the eigenvalues. The proposed prior is specified by a scale mixture of exponential distributions with spike and slab components. The spike part is given by an exponential distribution with a large rate parameter; that is, both the mean and variance are close to zero. The slab part is specified as a scale mixture of exponential distributions whose mixing distribution is a gamma distribution; thus, the slab part is distributed according to a Lomax distribution. The distribution of the weights of the spike part is specified based on a cumulative shrinkage process

(Legramanti et al., 2019). This prior specification of weights is designed to increasingly shrink less important, or essentially redundant, eigenvalues toward zero.

By means of a set of Monte Carlo experiments, we demonstrate that in terms of estimation accuracy, the proposed prior is competitive with alternative choices under various environments. Our proposal is especially beneficial in difficult situations, for example, when the true rank of a matrix is small relative to its size and many entries are missing. While the shrinkage priors including the proposed prior have biases in major eigenvalues, this cost is overwhelmed by the gains from pushing minor eigenvalues toward zero. In regard to this point, the CSPE prior is the best among the alternatives we consider, as it aggressively shrinks insignificant eigenvalues.

To demonstrate the proposed prior, we apply it to sectoral industrial production in the United States. The purpose of this application is to analyze the Great Moderation period, which represented a decrease in the volatility of major macroeconomic indicators after the mid-1980s in the United States and other developed countries (e.g., McConnell and Perez-Quiros, 2000; Summers, 2005). Our application is intended to shed new light on this issue by inferring the eigenvalues of a panel of sectoral industrial production growth for different periods. We find that compared to that in the period before the Great Moderation, in the period during the Great Moderation, the relative contributions of the large eigenvalues to the observed variability are significantly smaller, while the relative contributions of the small eigenvalues and the measurement errors, or idiosyncratic shocks, are more pronounced.

The remainder of this paper is structured as follows. Section 2 introduces the proposed approach. Section 3 conducts a set of simulation studies to compare the proposed approach with other alternatives. Section 4 applies the proposed approach to real data. Finally, Section 5 concludes the paper.

2 Model and Inference

2.1 Model

We infer a -by- possibly rank-deficient matrix observed with additive noise:

where is a -by- matrix of observations, is a -by- matrix to be inferred, and is a -by-

matrix of normally distributed measurement errors with precision

, .111 denotes a matrix whose generic -element is . To induce to be (approximately) rank deficient, we decompose into three parts

where the sizes of , , and are -by-, -by-, and -by-, respectively. When is incompletely observed, missing elements are completed via data augmentation (Tanner and Wong, 1987), more generally, Gibbs sampling. Furthermore, and missing entries in are alternately simulated from the conditional posteriors.

Unlike previous studies, we ensure the identification of the unknown parameters, , , and . Indeed, the exact identification is not necessary for posterior analysis of , but in general, when unknown parameters are not identified, the posterior simulation is numerically inefficient or can even diverge in some situations.222Tang et al. (2019) and Tanaka (2020) use posterior simulators that account for this problem, but they do not resolve it entirely. .

For parameter identification, we impose three types of constraints. First, we choose such that the number of unknown parameters in the factorization does not exceed that in the original matrix :333See Chan et al. (2018) for a related discussion.

or equivalently,

Second, and are supposed to be unitary over , . While these constraints keep the magnitudes of and constant, the column signs of and are not identified. We address this issue by modifying a posterior simulator, as discussed subsequently. Third, the diagonal elements of are assumed to be nonnegative and arranged in descending order, , which can also be written as

Let . Then, we have the following matrix representation:

(1)

Let denote a real space that satisfies the constraints provided by (1), .

2.2 Priors

The prior distribution of is specified by a scale mixture of exponential distributions that has spike and slab components. Following Legramanti et al. (2019), the weights for the spike part follow the stick-breaking construction of the Dirichlet process (e.g., Ishwaran and James, 2001). Ignoring the constraints (1), the prior of is given by the following hierarchy:

where , and

are prefixed hyperparameters,

denotes an exponential distribution with rate parameter , denotes a gamma distribution with shape parameter and rate parameter , and

denotes a beta distribution (of the first kind) with mean

. In the following, we call this hierarchical prior a cumulative shrinkage process exponential (CSPE) prior. The weights are arranged in ascending order, , consistent with the ordering of . Therefore, relatively less important eigenvalues are more likely to belong to the spike part.

The marginal prior of for the slab part is represented by a Lomax distribution; integrating out yields

where is a Lomax distribution with shape , scale

, and probability density function (PDF)

The Lomax distribution is a type I Pareto distribution that has nonnegative support; thus, it has a heavier tail than the exponential distribution. The mean of the slab part is for and is undefined otherwise, and the variance is for , for , and undefined otherwise.

The rate parameter for the spike part is set to a sufficiently large value such that both the mean and variance of the spike part are small, thereby pushing nonsignificant eigenvalues toward zero. Although can be infinite, we can improve the computational efficiency of the posterior simulation by setting it to a finite value (Ishwaran and Rao, 2005). Our default choice is , which implies that the mean and variance of the spike part are 0.1 and 0.01, respectively.

The hyperparameters for the slab part, and , are chosen to make this component sufficiently noninformative. Elicitation of and is not trivial because, for most applications, we have no prior knowledge about . In this paper, our default choice is and , which implies that the mean and variance of the slab part are 20 and , respectively. The PDFs of the spike and slab parts under the default choices are shown in Figure 1. By setting to a large value, we can make the slab part arbitrarily noninformative, but at least for the simulation environments in the subsequent section, we find no significant gain from doing so. Thus, we recommend setting to a moderate value, thereby allowing the slab part to play a role in shrinkage.

Tuning is also nontrivial. When some prior belief about the rank of is available, can be chosen systematically. The expectation of is

We can tune such that the probability of the th eigenvalue being insignificant (i.e., belonging to the spike part) is set to a target value ,

By solving this equation, we obtain

Figure 2 depicts examples of prior means of and for . Panel (a) is an example of a conservative choice, , where the prior mean of the th eigenvalue being insignificant is 0.5. In turn, Panel (b) is an example of an aggressive choice, , where the prior mean of the th eigenvalue being insignificant is 0.9.

For and , we use uniform Haar priors. The prior densities of and are represented as

where denotes the Stiefel manifold with dimensions of -by-. The error precision parameter is inferred using a gamma prior, , where and are shape and rate parameters, respectively.

2.3 Posterior simulation

We conduct a posterior simulation using a hybrid of two algorithms. We sample , , and using a No-U-Turn sampler (Hoffman and Gelman, 2014), which is an adaptive version of the Hamiltonian Monte Carlo method (Duane et al., 1987; Neal, 2011). The remaining parameters are simulated using a Gibbs sampler.

To address the inequality constraints (1), we transform as and sample from the -dimensional nonnegative real space . This transformation does not require any adjustment of the posterior estimates since the Jacobian determinant of is always equal to one,

The prior density of is represented as a function of as

To simulate efficiently, the constraints are relaxed slightly (Duan et al., 2020). For and , we define the relaxation functions as

where is a hyperparameter and denotes the Frobenius norm. The inequality constraints on

are approximated using a sigmoid function:

where is a hyperparameter. Thus, the conditional posterior of is approximated as

As the column signs of and are not identified, in each iteration, we flip the signs of the columns of and if the signs of the first elements of the columns of do not coincide with those of the initial values. The gradient of the log approximate conditional posterior of can then be analytically evaluated (see the Appendix).

While and are supposed to be large values, we can improve the computational efficiency by relaxing the constraints at the beginning of the posterior simulation and then tightening them gradually to a prespecified level. For the th MCMC iteration, is set as

where denotes a prespecified upper bound and and are tuning parameters. This function amounts to the product of and the cumulative probability function of an exponential distribution where the rate parameter is chosen such that reaches in the th iteration. In this paper, we choose , , and .

The remaining parameters are updated via Gibbs steps. The conditionals of are

where denotes the PDF of an exponential distribution with rate evaluated at and denotes the PDF of a Lomax distribution with shape and rate . and are simulated as

The conditional posteriors of and the missing elements of (if any) are

where is a set of indices that corresponds to the missing elements.

2.4 Relation to the existing literature

The exponential prior is a special case of the proposed prior with , . The exponential prior has close ties to the nuclear norm penalty, which has been extensively studied in the non-Bayesian literature on matrix estimation/completion (e.g., Keshavan et al., 2010; Rohde and Tsybakov, 2011; Negahban and Wainwright, 2011; Koltchinskii et al., 2011; Klopp, 2014; Gaïffas and Klopp, 2017). As shown in Fazel et al. (2001), the nuclear norm penalty is a convex relaxation of the rank of a matrix; thus, the nuclear norm penalty acts as a penalty on the rank of a matrix. Furthermore, as discussed in the abovementioned studies, the nuclear norm penalty has favorable theoretical properties, and the choice of the tuning parameter, which corresponds to the rate parameter of the exponential prior, is not trivial. In a non-Bayesian framework, the tuning parameter is usually chosen via cross-validation, but such a strategy ignores the uncertainty about the choice of the tuning parameter.

Some studies consider Bayesian inference of two-component factorization, (e.g., Salakhutdinov and Mnih, 2008; Babacan et al., 2012; Fazayeli et al., 2014). These studies assign independent normal priors to the elements of the unknown matrices. This strategy corresponds to using the nuclear norm penalty or exponential prior for the eigenvalues, i.e., the diagonal elements of , in three-component factorization. In a similar vein to this study, several studies propose Bayesian methods for inferring three-component factorization (e.g., Hoff, 2007; Zhou et al., 2010; Ding et al., 2011). While Hoff (2007) use a normal prior for (in our notation), Zhou et al. (2010) and Ding et al. (2011) model as a product of two variables, , and use normal and Beta-Bernoulli priors for and , respectively. Therefore, their prior can be regarded as a spike-and-slab version of the exponential prior.

When , , the proposed prior coincides with the Lomax prior.444Tang et al. (2019) propose a positive generalized double Pareto prior, but it is mathematically identical to a Lomax prior. The hierarchical adaptive spectral penalty (HASP) (Todeschini et al., 2013) is seen as a non-Bayesian counterpart of the Lomax prior. The HASP (Lomax prior) is more robust than the nuclear norm penalty (exponential prior), in that the latter approach estimates the hyperparameter using a Gamma prior. As suggested by the simulation studies in the next section, in terms of the estimation accuracy, a Lomax prior is not always better than an exponential prior.

Low-rank matrices are similar to static factor models (e.g., Geweke and Zhou, 1996; Arminger and Muthén, 1998) because in a factor model, a matrix of panel data is modeled as the product of matrices of latent factors and factor loadings, that is, a two-component factorization. As in the estimation of low-rank matrices, the parameter identification problem is an important consideration in the estimation of factor models. While some recent studies propose approaches to address parameter identification in factor models (e.g., Aßmann et al., 2016; Chan et al., 2018; Frühwirth-Schnatter and Lopes, 2018), these approaches are limited to specific priors, and their prior specifications are not specifically aimed at regularizing the complexity of the model.

3 Simulation Study

We compare the CSPE prior with alternative prior specifications by means of a simulation study. The true matrix to be estimated is square, , and its rank is . is generated from the following static factor model:

where is a matrix composed of latent factors and is a matrix of factor loadings. is normalized to zero mean and unit variance. is observed with additive normal noise with variance ; thus, the matrix of observations is given by

is chosen such that the signal-to-noise ratio is 10. The dimension of the model is fixed at

, where denotes the floor function. Let denote the number of missing entries. We consider three cases: no missing, 10% missing, and 90% missing. These cases represent estimation of complete matrices, imputation of missing observations (or inference of treatment effects with panel data), and matrix completion, respectively. Missing entries, if any, are chosen randomly.

We compare the proposed prior with four alternative prior specifications. The first is a noninformative prior, , . The second is an independent exponential prior, , , with . The third is an independent Lomax prior, , with and . The PDFs of the alternatives are shown in Figure 3.

The fourth is a spike-and-slab exponential (SSE) prior specified by the following hierarchy:

where

denotes a uniform distribution over an interval

. While a standard spike-and-slab prior is defined by a mixture of two normal distributions with different variances (e.g., Ishwaran et al., 2005), this prior is structured as a mixture of two exponential distributions with different rates. For the SSE prior, we use the same hyperparameters, , , and , as the CSPE prior. The SSE prior is different from the CSPE prior in that the weights are independently and uniformly distributed over a unit interval. Comparison between the CSPE and SSE priors provides information about how the gains from using the prior construction with a cumulative shrinkage process are distinct from those from the spike-and-slab structure.

For all the alternatives, we fix and . For each experiment, a total of 12,000 draws are simulated, and the last 10,000 draws are used for estimation. The performance of the alternative priors is evaluated in terms of the accuracy of the posterior mean estimates of and based on the sum of absolute errors (AE) and the sum of squared errors (SE). Each measure is computed as the mean of 80 experiments.555We wrote all the programs in Matlab R2019b (64 bit) and executed them on an Ubuntu Desktop 18.04 LTS (64 bit), running on AMD Ryzen Threadripper 1950X (4.2GHz).

Table 1 summarizes the results for the case with no missing values. The CSPE prior is likely to outperform all the alternatives in terms of estimation error in and , especially when the true rank is smaller. As the SSE prior falls between the exponential/Lomax priors and the CSPE prior, the CSPE prior works well not only because of its spike-and-slab structure but also the cumulative shrinkage applied to the weights. Although when is smaller, the improvement in the estimation of is likely to lead to more accurate estimates of , the CSPE prior is not significantly inferior to the others in terms of AE/SE for . The results for the 10% missing case are shown in Table 2, which yiels almost the same conclusions as Table 1.

Table 3 reports the results for the 90% missing case. As in the other cases, the CSPE prior is competitive with the alternatives. Compared to the noninformative prior, informative priors tend to have large estimation errors for , but they are likely to estimate more precisely. The CSPE prior outperforms, or is at least comparable to, the other alternatives, especially when the inferential problem is difficult, that is, the true rank of is small and/or many data values are missing.

As evident from Tables 1-3, the relationship between the relative gains in the estimation of and are not simple. A reduction in the estimation errors of does not always imply that the estimation of is improved, and vice versa. Although the reason for such results is not obvious, there is a possible explanation. As and are supposed to be unitary, the precision in the estimation of largely depends on the relative sizes of . Therefore, when insignificant eigenvalues are overestimated, even if the large eigenvalues are relatively precisely estimated, is inaccurately estimated. In other words, for the sake of precise estimation of the whole matrix , it is more important to find the true model, that is, the rank of , than to estimate the large eigenvalues with high accuracy.

To investigate the difference between the alternatives in further detail, we plot the relative accuracy of the posterior mean estimates of the elements in in terms of AE when and 10% of entries are missing (Figure 4). The reported values are normalized such that those for the noninformative prior are 100. Though not reported here, similar patterns are observed for the other settings, and the following discussion is not altered if the performance of the alternative priors is measured based on SE.

As shown in Panel (a), for the exponential prior, irrespective of , the estimation errors of the large, significant eigenvalues are large, whereas the estimation errors of the nonsignificant eigenvalues are slightly smaller than those for the noninformative prior. Thus, a trade-off exists between the relative accuracy in the major and minor eigenvalues. As increases, the estimation errors for the large eigenvalues decreas,e while those for the nonsignificant eigenvalues increase.

Panel (b) of Figure 4 reports the results for the Lomax prior. Again, the same trade-off is observed; however, compared to the results for the exponential prior, the trade-off is mitigated to a certain degree, as the estimation errors of the major eigenvalues are comparatively small.

The results for the SSE and CSPE priors are depicted in Panel (c). While the estimation errors of the large eigenvalues are comparable to those for the Lomax prior, the estimation errors of the insignificant eigenvalues are smaller than those of the Lomax prior. Comparison of the SSE and CSPE priors indicates that the estimation errors of the significant eigenvalues for these specifications are similar, but when the CSPE prior is used, the nonnsignificant eigenvalues are likely to be estimated with higher precision. Therefore, the CSPE prior effectively induces nonsignificant eigenvalues to be approximately zero and minimizes the errors in the estimation of large, significant eigenvalues.

Last, we compare the computational cost of the alternatives.Table 4 summarizes the computation time in seconds per 1,000 iterations, which is measured using the tic/toc functions of Matlab.The obtained computational times differ in a complicated manner depending on the simulation environment because our posterior simulator is based on NUTS, in which number of leap frog steps in each iteration depends on the target kernel. When a prior density substantially complicates the conditional posterior of , the number of leap frog steps tends to be large. While the SSE and CSPE priors have a disadvantage in that they involve additional Gibbs steps for updating the hyperparameters, this approach does not appear to entail a considerable computational burden.

4 Application

For illustration purposes, we apply the proposed approach to sectoral industrial production in the United States (US). From the mid-1980s to the mid-2000s, the volatility of major macroeconomic indicators, such as real gross domestic product growth and price inflation rate, decreased significantly in the US and other developed countries (McConnell and Perez-Quiros, 2000)

. Industrial production is not an exception; as shown in Figure 5, the standard deviation of aggregate industrial production growth decreased in the mid-1980s. In macroeconomics, this phenomenon is called the Great Moderation, a term coined by

Stock and Watson, 2003 as a pun for the Great Depression of 1929 to the mid-1930s. Many hypotheses for the phenomenon, such as improved inventory management, good monetary policy, and mere good luck (see, e.g., Summers (2005); Davis and Kahn (2008) for overview of the literature), have been proposed.

Although these hypotheses have been intensively examined empirically, no consensus exists for the overall picture of the Great Moderation. One unsettled issue is what happened before and after the Great Moderation at a disaggregated level. In what follows, we investigate the changes in the DGP of US industrial production during the Great Moderation by means of Bayesian noisy matrix factorization.

The data used in this section are downloaded from the Federal Reserve System’s website.666https://www.federalreserve.gov/releases/g17/download.htm. Sectors are defined based on the International Standard Industrial Classification Revision 4 (ISIC Rev. 4). The data matrices are three panels of 25 sectors for 11 years (132 months). The time periods are 1973:1–1983:12, 1984:1–1994:12, and 1995:1–2005:12. In what follows (including figures), when there is no fear of misunderstanding, by omitting months, we denote the periods simply as 1973-1983, 1984-1994, and 1995-2005, respectively. The start date is determined by the availability of the data. Despite the fact that they use different statistical methods, two notable early contributions, McConnell and Perez-Quiros (2000); Kim and Nelson (1999), argue that the break date in the volatility of real GDP growth is best estimated as 1984:Q1, which motivates the first splitting point. The lengths of the second and third data are chosen such that all the data span the same time period of 11 years (132 months). The data are standardized by sector for each subsample.

Figure 6 includes scatter plots of the volatilities of sectoral industrial production growth for different time periods. Plot (a) of Figure 6 compares 1973:1-1983:12 and 1984:1-1994:12. Although, as seen in Figure 5, the aggregate volatility decreased during these periods, the volatilities in 20 (of 25) sectors increased. In turn, plot (b) of Figure 6 compares 1984:1-1994:12 and 1995:1-2005:12. The number of sectors that experienced an increase in volatility during this period is 12. Therefore, sectoral data appear to offer a different picture of the Great Moderation than aggregate data.

We infer a three-component matrix factorization of a matrix containing the sectoral industrial production growth for different periods:

With and , we choose . The hyperparameters are set to the same values as in Section 3. In what follows, we report only the results for the conservative choice of : when the aggressive choice is used, the conclusions are virtually the same. A total of 420,000 posterior draws are simulated, and the last 400,000 draws are used for posterior analysis.

An estimated singular value decomposition has no explicit structural interpretation, but it conveys information about the underlying DGP. An estimated singular value decomposition has close ties to a static factor model, and the model is represented as

As and are supposed to be unitary, the th element of can be interpreted as the magnitude of the th “latent factor”, and its contribution to the overall variance is treated as independent from the other latent factors and errors. The variance of the observed data can be decomposed as

Note that the data are standardized; thus, , and , . The contributions of and to the variance of are interpreted in relative terms. Our primary interest is to investigate the difference in the relative contributions of the eigenvalues and the errors for different time periods. for small can be interpreted as the magnitude of shocks that affect broad sectors, while for large can be seen as the magnitude of shocks that affect a portion of sectors. can viewed as the magnitude of idiosyncratic (sector-specific) shocks that do not spillover to other sectors.

Several studies report that the decrease in the volatility of macroeconomic indicators is partly caused by decreases in comovement between sectors or firms, e.g., Comin and Philippon (2006); Irvine and Schuh (2007); Stiroh (2009); Burren and Neusser (2013). While these studies differ in terms of data and method, they all adopt non-Bayesian approaches; thus, they do not provide any uncertainty quantification.

Among the literature, Foerster et al. (2011) is the most closely related to our study in that they estimate a static approximate factor model (Chamberlain and Rothschild, 1983) for data on US sectoral industrial production. They report that nearly all the variability is associated with common factors; thus, the decrease in the volatility of common shocks largely explains the Great Moderation. They also argue that the relative importance of sector-specific shocks increased during the Great Moderation period. Their analysis relies on a non-Bayesian method and does not evaluate the uncertainty about the analysis. As their approximate factor model has only two factors, it is in danger of distorting the overall picture. Our framework can accommodate a large number of factors (eigenvalues) and can serve as a robustness check of the results of Foerster et al. (2011).

Figure 7 plots the posterior estimates of . All the panels include the same lines representing the posterior mean estimates of for different periods. The shaded areas are the corresponding 95% credible sets for each period. There are two results worth noting. First, s with small (largest eigenvalues) for the periods after 1984 are estimated to be smaller than those for the period before 1984, which implies that the variability of shocks that impact many sectors decreased during the Great Moderation period. Second, the posterior mean estimates of s with larger (less important eigenvalues) for the Great Moderation periods are smaller than those for the period before the Great Moderation. Thus, shocks that affect only a portion of sectors became more important during the Great Moderation. These results coincide with the abovementioned studies that report a decrease in comovement between sectors.

Figure 8 plots the posterior density of the error variance for each period. The posterior density of for 1973-1983 is significantly smaller than that for 1984-1994. The posterior means of the signal-to-noise ratio for 1973-1983, 1984-1994, and 1995-2005 are 14.3, 5.2, and 6.8, respectively, which indicates that the relative contribution of idiosyncratic shocks to the variability of sectoral industrial production decreased during the Great Moderation.

Figure 9 shows the differences in the posterior mean estimates of and for consecutive periods. Panel (a) plots the differences between 1973-1983 and 1984-1994. The largest difference is found in (largest eigenvalue), followed by that in . Our results are consistent with those of Foerster et al. (2011). Their approximate factor model has only two latent factors and the remaining less important factors are treated as idiosyncratic shocks. Our results suggest that although Foerster et al. (2011) overestimated the increases in the relative importance of idiosyncratic shocks, their conclusion is largely robust. Panel (b) of Figure 9 plots the differences in the posterior estimates of and between 1984-1994 and 1995-2005. Modest reversions in and are observed, but there is no salient difference for the periods.

5 Concluding Remarks

This study proposes a novel prior specification for low-rank matrix factorization called the cumulative shrinkage process exponential (CSPE) prior. The proposed prior is specified by a scale mixture of exponential distributions with spike and slab components. The weights for the spike/slab parts are inferred via a tailored prior based on a cumulative shrinkage process (Legramanti et al., 2019). The proposed prior increasingly aggressively shrinks less important eigenvalues toward zero. By means of a simulation study, we show that our proposal is competitive with other alternative priors. For illustration, we apply the proposed approach to the US sectoral industrial production to analyze the Great Moderation.

We note two research topics to be studied in the future. First, it would be desirable to develop a more systematic procedure to choose the hyperparameters for the CSPE prior. Although we describe an elicitation strategy, there is room for further development. Second, though not specific to matrix factorization, more efficient methods to sample from constrained spaces are sorely needed. In particular, sampling from a Stiefel manifold (more generally, orthogonal spaces) is challenging, and there are a number of applications that involve constraints of this kind.

Appendix

The gradient of the log of the approximate conditional posterior with respect to is computed as follows:

References

  • (1)
  • Arminger and Muthén (1998) Arminger, G. and B. O. Muthén (1998) “A Bayesian Approach to Nonlinear Latent Variable Models Using the Gibbs Sampler and the Metropolis-hastings Algorithm,” Psychometrika, Vol. 63, No. 3, 271–300.
  • Aßmann et al. (2016) Aßmann, C., J. Boysen-Hogrefe, and M. Pape (2016) “Bayesian Analysis of Static and Dynamic Factor Models: An Ex-post Approach Towards the Rotation Problem,” Journal of Econometrics, Vol. 192, No. 1, 190–206.
  • Babacan et al. (2012) Babacan, S. D., M. Luessi, R. Molina, and A. K. Katsaggelos (2012) “Sparse Bayesian Methods for Low-rank Matrix Estimation,” IEEE Transactions on Signal Processing, Vol. 60, No. 8, 3964–3977.
  • Burren and Neusser (2013) Burren, D. and K. Neusser (2013) “The Role of Sectoral Shifts in the Decline of Real GDP Volatility,” Macroeconomic Dynamics, Vol. 17, No. 3, 477–500.
  • Candes and Plan (2010) Candes, E. J. and Y. Plan (2010) “Matrix Completion with Noise,” Proceedings of the IEEE, Vol. 98, No. 6, 925–936.
  • Chamberlain and Rothschild (1983) Chamberlain, G. and M. Rothschild (1983) “Arbitrage, Factor Structure, and Mean-variance Analysis on Large Asset Markets,” Econometrica, Vol. 51, No. 5, 1281–1304.
  • Chan et al. (2018) Chan, J., R. Leon-Gonzalez, and R. W. Strachan (2018) “Invariant Inference and Efficient Computation in the Static Factor Model,” Journal of the American Statistical Association, Vol. 113, No. 522, 819–828.
  • Chen et al. (2019) Chen, Y., J. Fan, C. Ma, and Y. Yan (2019) “Inference and Uncertainty Quantification for Noisy Matrix Completion,” Proceedings of the National Academy of Sciences, Vol. 116, No. 46, 22931–22937.
  • Comin and Philippon (2006) Comin, D. and T. Philippon (2006) “The Rise in Firm-level Volatility: Causes and Consequences,” NBER Macroeconomics Annual 2005, Vol. 20, 167–201.
  • Davis and Kahn (2008) Davis, S. J. and J. A. Kahn (2008) “Interpreting the Great Moderation: Changes in the Volatility of Economic Activity at the Macro and Micro Levels,” Journal of Economic Perspectives, Vol. 22, No. 4, 155–80.
  • Ding et al. (2011)

    Ding, X., L. He, and L. Carin (2011) “Bayesian Robust Principal Component Analysis,”

    IEEE Transactions on Image Processing, Vol. 20, No. 12, 3419–3430.
  • Duan et al. (2020) Duan, L. L., A. L. Young, A. Nishimura, and D. B. Dunson (2020) “Bayesian Constraint Relaxation,” Biometrika, Vol. 107, No. 1, 191–204.
  • Duane et al. (1987) Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987) “Hybrid Monte Carlo,” Physics Letters B, Vol. 195, No. 2, 216–222.
  • Fazayeli et al. (2014) Fazayeli, F., A. Banerjee, J. Kattge, F. Schrodt, and P. B. Reich (2014) “Uncertainty Quantified Matrix Completion Using Bayesian Hierarchical Matrix Factorization,” in

    2014 13th International Conference on Machine Learning and Applications

    , 312–317, IEEE.
  • Fazel et al. (2001)

    Fazel, M., H. Hindi, and S. P. Boyd (2001) “A Rank Minimization Heuristic with Application to Minimum Order System Approximation,” in

    Proceedings of the American Control Conference, Vol. 6, 4734–4739, Citeseer.
  • Foerster et al. (2011) Foerster, A. T., P.-D. G. Sarte, and M. W. Watson (2011) “Sectoral Versus Aggregate Shocks: A Structural Factor Analysis of Industrial Production,” Journal of Political Economy, Vol. 119, No. 1, 1–38.
  • Frühwirth-Schnatter and Lopes (2018) Frühwirth-Schnatter, S. and H. F. Lopes (2018) “Sparse Bayesian Factor Analysis When the Number of Factors Is Unknown,” arXiv preprint arXiv:1804.04231.
  • Gaïffas and Klopp (2017) Gaïffas, S. and O. Klopp (2017) “High Dimensional Matrix Estimation with Unknown Variance of the Noise,” Statistica Sinica, 115–145.
  • Geweke (1996) Geweke, J. (1996) “Bayesian Reduced Rank Regression in Econometrics,” Journal of Econometrics, Vol. 75, No. 1, 121–146.
  • Geweke and Zhou (1996) Geweke, J. and G. Zhou (1996) “Measuring the Pricing Error of the Arbitrage Pricing Theory,” Review of Financial Studies, Vol. 9, No. 2, 557–587.
  • Hoff (2007) Hoff, P. D. (2007) “Model Averaging and Dimension Selection for the Singular Value Decomposition,” Journal of the American Statistical Association, Vol. 102, No. 478, 674–685.
  • Hoffman and Gelman (2014) Hoffman, M. D. and A. Gelman (2014) “The No-u-turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo,” Journal of Machine Learning Research, Vol. 15, No. 1, 1593–1623.
  • Irvine and Schuh (2007) Irvine, O. and S. Schuh (2007) “The Roles of Comovement and Inventry Investment in the Reduction of Output Volatility,”Technical report, Federal Reserve Bank of San Francisco.
  • Ishwaran and James (2001) Ishwaran, H. and L. F. James (2001) “Gibbs Sampling Methods for Stick-breaking Priors,” Journal of the American Statistical Association, Vol. 96, No. 453, 161–173.
  • Ishwaran et al. (2005) Ishwaran, H., J. S. Rao et al. (2005) “Spike and Slab Variable Selection: Frequentist and Bayesian Strategies,” Annals of Statistics, Vol. 33, No. 2, 730–773.
  • Keshavan et al. (2010) Keshavan, R. H., A. Montanari, and S. Oh (2010) “Matrix Completion from Noisy Entries,” Journal of Machine Learning Research, Vol. 11, No. Jul, 2057–2078.
  • Kim and Nelson (1999) Kim, C.-J. and C. R. Nelson (1999) “Has the Us Economy Become More Stable? a Bayesian Approach Based on a Markov-switching Model of the Business Cycle,” Review of Economics and Statistics, Vol. 81, No. 4, 608–616.
  • Klopp (2014) Klopp, O. (2014) “Noisy Low-rank Matrix Completion with General Sampling Distribution,” Bernoulli, Vol. 20, No. 1, 282–303.
  • Koltchinskii et al. (2011) Koltchinskii, V., K. Lounici, and A. B. Tsybakov (2011) “Nuclear-norm Penalization and Optimal Rates for Noisy Low-rank Matrix Completion,” Annals of Statistics, Vol. 39, No. 5, 2302–2329.
  • Legramanti et al. (2019) Legramanti, S., D. Durante, and D. B. Dunson (2019) “Bayesian Cumulative Shrinkage for Infinite Factorizations,”Technical report, arXiv:1902.04349.
  • McConnell and Perez-Quiros (2000) McConnell, M. M. and G. Perez-Quiros (2000) “Output Fluctuations in the United States: What Has Changed since the Early 1980’s?” American Economic Review, Vol. 90, No. 5, 1464–1476.
  • Neal (2011) Neal, R. M. (2011) “MCMC Using Hamiltonian Dynamics,” in S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng eds.

    Handbook of Markov Chain Monte Carlo

    : Chapman & Hall/CRC, Chap. 5.
  • Negahban and Wainwright (2011) Negahban, S. and M. J. Wainwright (2011) “Estimation of (near) Low-rank Matrices with Noise and High-dimensional Scaling,” Annals of Statistics, 1069–1097.
  • Rohde and Tsybakov (2011) Rohde, A. and A. B. Tsybakov (2011) “Estimation of High-dimensional Low-rank Matrices,” Annals of Statistics, Vol. 39, No. 2, 887–930.
  • Salakhutdinov and Mnih (2008) Salakhutdinov, R. and A. Mnih (2008) “Bayesian Probabilistic Matrix Factorization Using Markov Chain Monte Carlo,” in Proceedings of the 25th international Conference on Machine learning, 880–887, ACM.
  • Shi et al. (2017) Shi, J., X. Zheng, and W. Yang (2017) “Survey on Probabilistic Models of Low-rank Matrix Factorizations,” Entropy, Vol. 19, No. 8, p. 424.
  • Stiroh (2009) Stiroh, K. J. (2009) “Volatility Accounting: A Production Perspective on Increased Economic Stability,” Journal of the European Economic Association, Vol. 7, No. 4, 671–696.
  • Stock and Watson (2003) Stock, J. H. and M. W. Watson (2003) “Has the Business Cycle Changed and Why?” NBER Macroeconomics Annual, Vol. 17, 159–218.
  • Summers (2005) Summers, P. M. (2005) “What Caused the Great Moderation? Some Cross-country Evidence,” Economic Review-Federal Reserve Bank of Kansas City, Vol. 90, No. 3, p. 5.
  • Tanaka (2020) Tanaka, M. (2020) “Bayesian Matrix Completion Approach to Causal Inference with Panel Data,” arXiv preprint arXiv:1911.01287.
  • Tang et al. (2019) Tang, K., Z. Su, J. Zhang, L. Cui, W. Jiang, X. Luo, and X. Sun (2019) “Bayesian Rank Penalization,” Neural Networks, Vol. 116, 246–256.
  • Tanner and Wong (1987) Tanner, M. A. and W. H. Wong (1987) “The Calculation of Posterior Distributions by Data Augmentation,” Journal of the American Statistical Association, Vol. 82, No. 398, 528–540.
  • Todeschini et al. (2013) Todeschini, A., F. Caron, and M. Chavent (2013) “Probabilistic Low-rank Matrix Completion with Adaptive Spectral Regularization Algorithms,” in Advances in Neural Information Processing Systems, 845–853.
  • West (2003) West, M. (2003) “Bayesian Factor Regression Models in the “Large P, Small N” Paradigm,” in J. Bernardo, M. Bayarri, J. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West eds. Bayesian Statistics 7: Oxford University Press, 733–742.
  • Zhou et al. (2010) Zhou, M., C. Wang, M. Chen, J. Paisley, D. Dunson, and L. Carin (2010) “Nonparametric Bayesian Matrix Completion,” in 2010 IEEE Sensor Array and Multichannel Signal Processing Workshop, 213–216, IEEE.