# Bayesian cumulative shrinkage for infinite factorizations

There are a variety of Bayesian models relying on representations in which the dimension of the parameter space is, itself, unknown. For example, in factor analysis the number of latent variables is, in general, not known and has to be inferred from the data. Although classical shrinkage priors are useful in these situations, incorporating cumulative shrinkage can provide a more effective option which progressively penalizes more complex expansions. A successful proposal within this setting is the multiplicative gamma process. However, such a process is limited in scope, and has some drawbacks in terms of shrinkage properties and interpretability. We overcome these issues via a novel class of convex mixtures of spike and slab distributions assigning increasing mass to the spike through an adaptive function which grows with model complexity. This prior has broader applicability, simple interpretation, parsimonious representation, and induces adaptive cumulative shrinkage of the terms associated with redundant, and potentially infinite, dimensions. Performance gains are illustrated in simulation studies.

• 8 publications
• 17 publications
• 85 publications
08/12/2020

### Variational Bayes for Gaussian Factor Models under the Cumulative Shrinkage Process

The cumulative shrinkage process is an increasing shrinkage prior that c...
06/11/2020

### Bayesian Eigenvalue Regularization via Cumulative Shrinkage Process

This study proposes a novel hierarchical prior for inferring possibly lo...
09/26/2017

### On the Model Shrinkage Effect of Gamma Process Edge Partition Models

The edge partition model (EPM) is a fundamental Bayesian nonparametric m...
12/06/2019

### Triple the gamma – A unifying shrinkage prior for variance and variable selection in sparse state space and TVP models

Time-varying parameter (TVP) models are very flexible in capturing gradu...
07/15/2019

### Bayesian Wavelet Shrinkage with Beta Priors

We present a Bayesian approach for wavelet shrinkage in the context of n...
03/03/2019

### Heavy Tailed Horseshoe Priors

Locally adaptive shrinkage in the Bayesian framework is achieved through...
05/18/2022

### Power Transformations of Relative Count Data as a Shrinkage Problem

Here we show an application of our recently proposed information-geometr...

## 1 Introduction

There has been a considerable interest in designing shrinkage priors for high dimensional parameters (Ishwaran and Rao, 2005; Carvalho et al., 2010; Polson and Scott, 2010)

but most of the focus has been on regression, where there is not a natural ordering in the coefficients. There are several settings, however, where an order is present or desirable. For example, in models relying on low-rank factorizations or basis expansions, such as factor models, probabilistic principal component analysis and tensor factorizations, it is natural to expect that additional terms play a progressively less important role in characterizing the data or model structure, and hence the associated parameters have a stochastically decreasing effect.

Motivated by the above discussion, Bhattacharya and Dunson (2011) proposed the multiplicative gamma process as a prior for the column-specific scale parameters of the loadings matrix in infinite factor models. In particular, they define zero-mean Gaussian priors for each factor loading

and incorporate a column-specific variance parameter

having prior

 θh=∏hl=1ϑ−1l(h=1,…,∞),ϑ1∼Ga(a1,1),ϑl∼Ga(a2,1)  (l=2,…,∞). (1)

As is clear from (1), for specific choices of the prior expectation of decreases exponentially with model dimension, thus favoring growing shrinkage of the factor loadings around zero, a priori. This choice also allows uncertainty in the number of latent factors since higher-order columns may have all their elements shrunk close to zero, effectively deleting the associated factors.

Although other strategies, such as point mass mixtures (Carvalho et al., 2008; Knowles and Ghahramani, 2011), continuous shrinkage priors (Zou et al., 2006; Bien and Tibshirani, 2011; Pati et al., 2014) and also more structured solutions (Zhao et al., 2016) have been proposed, the multiplicative gamma process has attracted a growing interest beyond classical factor models (Zhu et al., 2014; Durante et al., 2017; Sarkar et al., 2018) due to its ability of allowing the amount of shrinkage to increase with the model complexity. However, as discussed in Durante (2017), the increasing shrinkage induced by the multiplicative gamma process heavily depends on the hyper-parameters and there is a tendency to sometimes over-shrink, thereby affecting posterior inference in factorizations or expansions having a higher rank.

To overcome these issues, we propose a novel class of convex mixtures of spike and slab distributions assigning increasing mass to the spike through an adaptive function which grows with model dimension via the cumulative sum of stick-breaking probabilities

(Sethuraman, 1994; Ishwaran and James, 2001). This process shrinks increasingly with model complexity and adaptively learns both low and high rank factorizations. As discussed in §2, such a prior has improved interpretability and provides a general formulation which can also be considered in a variety of non-Gaussian expansions; see, for example, Dunson and Xing (2009), Rousseau and Mengersen (2011) and Gopalan et al. (2014).

## 2 Cumulative shrinkage process

We focus here on the general case in which the effect of the -th factor or basis function is controlled by a scalar parameter , so that the deletion of redundant terms can be obtained by progressively shrinking the countable sequence towards an appropriate value. Although this scenario can be easily relaxed, such a setting is commonly found in several factorizations. For example, in infinite factor models, may denote the variance of the loadings associated with the -th factor, for , and the goal is to define a prior on these parameters which stochastically shrinks redundant factors and flexibly adapts to the unknown rank. As discussed in Bhattacharya and Dunson (2011) and Rousseau and Mengersen (2011)

, combining infinite or over-complete representations with shrinkage priors is conceptually appealing in bypassing the estimation of the model dimension while avoiding over-fitting. Motivated by this reasoning, Definition

2.1 characterizes our proposed cumulative shrinkage process.

###### Definition 2.1

Let denote a countable sequence of parameters. We say that is distributed according to a cumulative shrinkage process with parameter , starting slab distribution and target spike if

 (θh∣πh)∼Ph=(1−πh)P0+πhδθ∞,πh=∑hl=1ωl,ωl=vl∏l−1m=1(1−vm), (2)

independently for every , where are independent variables and is a diffuse continuous distribution which can be specified depending on the type of factorization considered.

Equation (2) implies that the probability assigned to the spike increases with the model dimension , and that (Ishwaran and James, 2001). Hence, as complexity grows, increasingly concentrates around , which is specified to facilitate the deletion of redundant terms, while the slab allows flexible modeling of the active ones. Definition 2.1 can be naturally extended to sequences in and the delta mass can be replaced with a continuous distribution, without affecting the basic properties of the prior. Proposition 2.2 provides a formal interpretation for each . All the proofs can be found in the Appendix.

###### Proposition 2.2

Representation (2) defines a probability measure-valued stochastic process travelling between and . The probability can be interpreted as the proportion of the distance between and covered up to step , in the sense that , for , where denotes the total variation distance between two probability measures.

Hence, effectively controls the rate of increasing shrinkage. Using similar arguments, we can obtain analogous expressions for and , which represent the proportions of the total and the remaining path, respectively, travelled between steps and . Specifically, and for any . The expectations of these quantities are explicitly available as

 E(vh)=11+α,E(ωh)=αh−1(1+α)h,E(πh)=1−αh(1+α)h, (3)

for each . Combining (3) with Definition 2.1, the expectation of is, instead,

 E(θh)=E{E(θh∣πh)}={1−E(πh)}θ0+E(πh)θ∞=θ∞+{α(1+α)−1}h(θ0−θ∞), (4)

where denotes the expectation under the slab . Hence, as grows, the prior expectation of collapses exponentially towards the spike location . As stated in Lemma 2.3, this cumulative shrinkage property interestingly holds not only in expectation, but also in distribution, under our prior in (2).

###### Lemma 2.3

Let denote an -neighborhood around with radius , and define with the complement of . Then, for any and ,

 pr(|θh−θ∞|>ε)=P0{¯Bε(θ∞)}{α(1+α)−1}h, (5)

where is the probability assigned to by the slab . Hence, for any , and .

Equations (3)–(5) highlight how the rate of increasing shrinkage is controlled by . In particular, lower induces faster concentration around and, as a consequence, more rapid deletion of redundant terms. This control over the rate of increasing shrinkage via is separated from the specification of the slab , whose shape and location can be specified independently from , thereby allowing flexible modelling of active terms. As discussed in Durante (2017), this is not possible for the multiplicative gamma process in (1), whose hyper-parameters control both the rate of shrinkage and the prior for the active terms. This creates a trade-off between the need to maintain moderately diffuse priors for the active loadings and the attempt to adaptively shrink those corresponding to redundant factors. Such a trade-off is additionally complicated by the fact that the increasing shrinkage holds, in expectation, only for specific choices of , with these settings typically inducing over-shrinkage. As is clear from equations (3)–(5), although the cumulative shrinkage process in Definition 2.1 is similar to the multiplicative gamma process in inducing exponential shrinkage, such a prior achieves an improved flexibility by ensuring increasing shrinkage in distribution for any , without affecting the ability of the slab to model active terms.

Another advantage of our proposed process is that the shrinkage parameter coincides with the prior expectation of the total number of active terms in modelled via the diffuse slab . This can be shown after noticing that in (2) can be alternatively obtained by marginalizing out the augmented indicator in , for each . According to this representation, counts the number of active elements in , and its prior mean is

 E(H∗) =∑∞h=1E(ch) =∑∞h=1E{E(ch∣πh)} =∑∞h=1E(1−πh) =∑∞h=1{α(1+α)−1}h = α.

Hence, in prior specification, should coincide with the expected total number of active components modeled via the slab . This distribution should be, instead, sufficiently diffuse to capture active terms in the factorization considered, whereas has to be chosen to facilitate deletion of redundant terms.

Before concluding our analysis of the probabilistic properties of the prior in Definition 2.1, let us study how a finite version of (2) for a truncated sequence approximates the prior on the infinite sequence. Indeed, when performing posterior computation it is customary to rely on a finite-dimensional factorization which reasonably approximates the infinite representation through a conservative but finite number of components. Such a finite representation is typically implemented by substituting the complete sequence with the truncated sequence obtained by fixing to zero all the elements having index . For instance, in factor models, the deletion of factors beyond the -th one can be obtained by setting to zero the variances of the zero-mean Gaussian priors for the loadings for . Theorem 2.4

provides an upper bound for the prior probability of the truncated sequence

being arbitrarily far from the complete sequence , in sup-norm.

###### Theorem 2.4

If the infinite sequence has prior (2), then, for any truncation index and ,

 pr{d∞(θ,θ(H))>ε}=pr{sup(|θh|:h=H+1,…,∞)>ε}≤P0{¯Bε(0)}α{α(1+α)−1}H,

where is the sup-norm distance and is the complement of .

Hence, the prior probability of being close to converges to one at a rate which is exponential in , thus justifying posterior inference under finite sequences based on a conservative . Although the above bound holds for , in general is set close to zero and hence Theorem 2.4 is valid for small .

## 3 Application to Bayesian infinite factor models

### 3.1 Cumulative shrinkage process for infinite factor models

The prior in §2 can be considered in many factorizations under appropriate choices of and . To facilitate comparison with the multiplicative gamma process we first focus on infinite factor models. Indeed, Bhattacharya and Dunson (2011)

are motivated by Bayesian inference on the covariance matrix

of the data and rely on the expansion

which arises by marginalizing out the infinite-dimensional vector

of latent variables in the factor model , where and . The authors perform Bayesian inference by assuming priors for each , and consider independent priors for each factor loading . The local scales are assigned priors, whereas the global variances have the multiplicative gamma process prior in (1) to facilitate growing shrinkage towards low-rank factorizations.

Motivated by the issues with this approach discussed in §2, we instead let and, adapting (2) to this setting, we assume

 (θh∣πh)∼(1−πh)InvGa(aθ,bθ)+πhδθ∞,πh=∑hl=1ωl,ωl=vl∏l−1m=1(1−vm), (6)

independently for , where are independent variables. Although additional flexibility can be incorporated by assuming hyper-priors for , and , we leverage the results in §2 and consider fixed default values coherent with the interpretable role of these hyper-parameters in the cumulative shrinkage process. An additional result which can drive prior specification is that, integrating out from , the factor loadings have marginal priors for any and , where is a Student- distribution with degrees of freedom, location and scale . Hence, should be set close to zero to allow effective shrinkage of redundant factors, whereas should be specified to induce a moderately diffuse prior with scale for the active loadings. Exploiting these marginal priors, it also follows that , for each , , and , when . This guarantees cumulative shrinkage in distribution also for the factor loadings.

### 3.2 Posterior computation via Gibbs sampling

Posterior inference for the infinite factor model in §3.1, truncated at terms with and prior (6) for the factor loadings, proceeds via the Gibbs sampler in Algorithm 1. This routine incorporates a data augmentation step guaranteeing conjugate full-conditionals. In fact, a finite version of (6), with , can be also obtained by marginalizing out the independent indicators having probabilities , for each , in prior

 (θh∣zh)∼{1−1(zh≤h)}% InvGa(aθ,bθ)+1(zh≤h)δθ∞,

where if and otherwise. As is clear from Algorithm 1, conditioned on , it is possible to efficiently sample from the full-conditionals via classical conjugacy results, whereas the updating of the augmented data relies on the full-conditional distribution

 pr(zh=l∣−)∝{ωlNp(λh;0,θ∞Ip)if \ l≤hωlt2aθ{λh;0,(bθ/aθ)Ip}if \ l>h (7)

for , where and are densities of -variate Gaussian and Student- distributions, respectively, evaluated at .

### 3.3 Empirical studies

We consider illustrative simulations to understand whether our prior provides improved performance in recovering the true underlying covariance matrix , compared to the multiplicative gamma process. To address this goal, we focus on covariance matrices of dimension factorized as , with , where induces three different rank structures. The entries in are drawn independently from , whereas is set at . For each of the three values of , we generate 25 datasets made of observations from and, for each of the 25 replicates, we perform posterior inference on via the Bayesian factor model in §3.1, considering both prior (1) and (6). Motivated by the results in §2, we rely on a finite factor model with conservative truncation at , and facilitate collapsing of redundant dimensions along with flexible modeling of active loadings by setting , and . We considered these default choices in all simulations and found the results robust to several settings, although in practical applications additional care should be paid to the scale of the data, which can be normalized before inference, if necessary. This problem is, however, not specific to our cumulative shrinkage process, but common to any prior specification. For the multiplicative gamma process, we follow Durante (2017) by considering , and set as done by Bhattacharya and Dunson (2011) in their simulation studies. Finally, , which are common to both models, are fixed at as in Bhattacharya and Dunson (2011).

Bayesian inference for both models relies on samples from the posterior of produced by Algorithm 1 and by the Gibbs sampler in Bhattacharya and Dunson (2011), respectively, with a burn-in of and thinning every five samples. Based on the trace-plots of the chains for the entries in , these settings were sufficient for convergence and reasonable mixing. Leveraging the posterior samples of from both models, we compute for the 25 simulations within each scenario a Monte Carlo estimate of the averaged mean square error , calculated with respect to the posterior of . Table 1 displays the quartiles of these averaged mean square errors for each scenario under both models. As is clear from Table 1, our prior is comparable to the multiplicative gamma process in low-rank settings, and outperforms it in higher-rank factorizations. Such a result is consistent with the tendency of the multiplicative gamma process to induce over-shrinkage, thus failing to account for higher-rank factorizations. Exploiting the data augmentation described in §3.2, it is further possible to quantify uncertainty in the number of active factors via the posterior distribution of . Approximating this posterior based on the samples of the augmented data , we obtained concentration around the true value in each of the three scenarios, thus suggesting that may provide a useful proxy for learning the number of active factors. These empirical results motivate future theoretical studies on posterior consistency in recovering the true covariance matrix and its unknown rank , when grows with . A possibility to address this goal is to adapt recent results in Pati et al. (2014).

## 4 Application to Bayesian infinite Poisson factorization

### 4.1 Cumulative shrinkage process for infinite Poisson factorization

We now turn our attention to Bayesian infinite Poisson factorization (Gopalan et al., 2014). This model is particularly useful in recommendation systems based on count data measuring, for example, the total purchases of the item made by user . In this context, Poisson factorization assumes , independently for every pair , and expresses each rate as the quadratic combination between user-specific preferences and item-specific attributes . Setting to leads to an infinite factorization of the Poisson rates matrix via , where and are matrices having rows and , respectively. Although there are different alternatives for Bayesian learning of (Gopalan et al., 2014), a natural option, which exploits gamma-Poisson and gamma-gamma conjugacy, is to assume priors for each and and induce cumulative shrinkage over the columns of and using either the multiplicative gamma process (1) or the proposed cumulative shrinkage process (6) on the sequence of scales .

### 4.2 Posterior computation via Gibbs sampling

Posterior inference for the infinite Poisson factorization model in §4.1, truncated at terms with prior (6) on the column-specific scales , proceeds via the Gibbs sampler in Algorithm 2. This routine combines conjugate full-conditional results for Poisson factorization outlined in Gopalan et al. (2014) with the same data augmentation step employed in Algorithm 1 and described in §3.2. Here, the updating of the augmented data relies on the full-conditional distribution

 pr(zh=l∣−)∝⎧⎪⎨⎪⎩ωl∏Kk=1Ga(ξkh;a,θ−1∞)if \ l≤hωlΓ(aθ+Ka)Γ(aθ)Γ(a)Kbaθθ(∏Kk=1ξkh)a−1(bθ+∑Kk=1ξkh)−(aθ+Ka)if \ l>h (8)

for every , where and . The equation for the case in (8) can be easily obtained by marginalizing out

in the joint distribution

. Gibbs sampling under the multiplicative gamma process (1) requires a simple combination of steps 1-2 in Algorithm 2 and step 5 in Bhattacharya and Dunson (2011).

### 4.3 Empirical studies

As for the factor model, we consider illustrative simulation studies to understand whether our prior provides improved performance in recovering the true underlying Poisson rates matrix , compared to the multiplicative gamma process. Consistent with this goal, we focus on Poisson rate matrices of dimension factorized as , with and , where induces three different rank structures. The entries in and are drawn from . As done in §3.3, for each value of , we generate 25 data matrices with entries from a for every and . For each of the 25 replicates, we perform posterior inference on via the Bayesian Poisson factorization model described in §4.1, based on our cumulative shrinkage process (6) and on the multiplicative gamma process (1). Motivated by the results in §2, we consider a finite Poisson factorization based on a conservative truncation at . In specifying the hyper-parameters, we follow Gopalan et al. (2014) and define exponentially shaped Gamma priors for the elements in and by setting . The remaining hyper-parameters for the priors in (1) and (6) are specified as in §3.3, with the only exception of which is set at to obtain a marginal variance under the spike that is similar to the one for the factor loadings in §3.3. Moderate changes in did not substantially affect the final results, although excessively low values could lead to difficulties in attracting inactive components under the spike, thus making convergence of the algorithm slightly slower.

Bayesian inference for both priors relies on posterior samples of , with a burn-in of and thinning every five samples. Based on the trace-plots of the chains for the entries in , these settings were sufficient for convergence and mixing, although slightly unstable chains were found for the posterior under the multiplicative gamma process. Leveraging the posterior samples of from both models, we compute for each of the 25 simulations within every scenario a Monte Carlo estimate of the averaged mean square error , calculated with respect to the posterior of . The quartiles of these averaged mean square errors for each of the three scenarios under both models are displayed in Table 2. As for factor model, our prior is comparable to the multiplicative gamma process in low-rank settings, but achieves improved performance in higher-rank factorizations.

## Appendix

Proof of Proposition 2.2 The proof of Proposition 2.2 adapts the one of Theorem 1 in Canale et al. (2018). In fact, under the prior in Definition 2.1, the distance on the Borel -algebra in is equal to

 supA∈B(R)|P0(A)−Ph(A)|=supA∈B(R)|P0(A)−(1−πh)P0(A)−πhδθ∞(A)|= πhsupA∈B(R)|P0(A)−δθ∞(A)|.

Hence, , thus proving Proposition 2.2.

Proof of Lemma 2.3 To prove the Lemma, notice that . Hence, . Replacing with its expression in equation (3) leads to (5). To prove it suffices to note that .

Proof of Theorem 2.4 The proof follows after noting that and that for . Hence, adapting the proof of Lemma 2.3, we obtain

 pr{∪h>H(|θh|>ε)}≤∑∞h=H+1pr(|θh|>ε)=P0{¯Bε(0)}∑∞h=H+1{α(1+α)−1}h.

To conclude the proof, notice that .

## References

• Bhattacharya and Dunson (2011) Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models. Biometrika, 98:291–306.
• Bien and Tibshirani (2011) Bien, J. and Tibshirani, R. J. (2011). Sparse estimation of a covariance matrix. Biometrika, 98:807–820.
• Canale et al. (2018) Canale, A., Durante, D., and Dunson, D. B. (2018). Convex mixture regression for quantitative risk assessment. Biometrics, 74:1331–1340.
• Carvalho et al. (2008) Carvalho, C. M., Chang, J., Lucas, J. E., Nevins, J. R., Wang, Q., and West, M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. J. Am. Statist. Assoc., 103:1438–1456.
• Carvalho et al. (2010) Carvalho, C. M., Polson, N. G., and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97:465–480.
• Dunson and Xing (2009) Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Am. Statist. Assoc., 104:1042–1051.
• Durante (2017) Durante, D. (2017). A note on the multiplicative gamma process. Statist. Probabil. Lett., 122:198–204.
• Durante et al. (2017) Durante, D., Dunson, D. B., and Vogelstein, J. T. (2017). Nonparametric Bayes modeling of populations of networks. J. Am. Statist. Assoc., 112:1516–1530.
• Gopalan et al. (2014) Gopalan, P., Ruiz, F. J., Ranganath, R., and Blei, D. (2014). Bayesian nonparametric Poisson factorization for recommendation systems. J. Mach. Learn. Res. W&CP, 33:275–283.
• Ishwaran and James (2001) Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. J. Am. Statist. Assoc., 96:161–173.
• Ishwaran and Rao (2005) Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist and Bayesian strategies. Ann. Statist., 33:730–773.
• Knowles and Ghahramani (2011) Knowles, D. and Ghahramani, Z. (2011). Nonparametric Bayesian sparse factor models with application to gene expression modeling. Ann. Appl. Statist., 5:1534–1552.
• Pati et al. (2014) Pati, D., Bhattacharya, A., Pillai, N. S., and Dunson, D. B. (2014). Posterior contraction in sparse Bayesian factor models for massive covariance matrices. Ann. Statist., 42:1102–1130.
• Polson and Scott (2010) Polson, N. G. and Scott, J. G. (2010). Shrink globally, act locally: Sparse Bayesian regularization and prediction. , 9:501–538.
• Rousseau and Mengersen (2011) Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Statist. Soc. B, 73:689–710.
• Sarkar et al. (2018) Sarkar, A., Pati, D., Chakraborty, A., Mallick, B. K., and Carroll, R. J. (2018). Bayesian semiparametric multivariate density deconvolution. J. Am. Statist. Assoc., 113:401–416.
• Sethuraman (1994) Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica, 4:639–650.
• Zhao et al. (2016) Zhao, S., Gao, C., Mukherjee, S., and Engelhardt, B. E. (2016). Bayesian group factor analysis with structured sparsity. J. Mach. Learn. Res., 17:6868–6914.
• Zhu et al. (2014) Zhu, H., Khondker, Z., Lu, Z., and Ibrahim, J. G. (2014). Bayesian generalized low rank regression models for neuroimaging phenotypes and genetic markers. J. Am. Statist. Assoc., 109:977–990.
• Zou et al. (2006) Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. J. Comp. Graph. Statist., 15:265–286.