 # Empirical Bayes Matrix Factorization

Matrix factorization methods - including Factor analysis (FA), and Principal Components Analysis (PCA) - are widely used for inferring and summarizing structure in multivariate data. Many matrix factorization methods exist, corresponding to different assumptions on the elements of the underlying matrix factors. For example, many recent methods use a penalty or prior distribution to achieve sparse representations ("Sparse FA/PCA"). Here we introduce a general Empirical Bayes approach to matrix factorization (EBMF), whose key feature is that it uses the observed data to estimate prior distributions on matrix elements. We derive a correspondingly-general variational fitting algorithm, which reduces fitting EBMF to solving a simpler problem - the so-called "normal means" problem. We implement this general algorithm, but focus particular attention on the use of sparsity-inducing priors that are uni-modal at 0. This yields a sparse EBMF approach - essentially a version of sparse FA/PCA - that automatically adapts the amount of sparsity to the data. We demonstrate the benefits of our approach through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that concords with known relationships among human tissues. Software implementing our approach is available at https://github.com/stephenslab/flashr.

## Code Repositories

### flashr

R package for Empirical Bayes Factor Analysis.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Matrix factorization methods are widely used for inferring and summarizing structure in multivariate data. In brief, these methods represent an observed data matrix as:

 Y=LTF+E (1.1)

where is an matrix, is a matrix, and is an matrix of residuals. Here we adopt the notation and terminology of factor analysis, and refer to as the “loadings” and as the “factors”.

The model (1.1) can be motivated in many ways, but one interpretation is that each row of can be approximated by a linear combination of underlying “factors” (rows of ), which – ideally – have some intuitive or scientific interpretation. For example, suppose represents the rating of a user for a movie . Each factor might represent a genre of movie (“comedy”, “drama”, “romance”, “horror” etc), and the ratings for a user could be written as a linear combination of these factors, with the weights (loadings) representing how much individual likes that genre. Or, suppose represents the expression of gene in sample . Each factor might represent a module of co-regulated genes, and the data for sample could be written as a linear combination of these factors, with the loadings representing how active each module is in each sample. Many other examples could be given across many fields, including psychometrics, economics, text modeling, and population genetics.

Many commonly-encountered statistical tools can be viewed as arising from (1.1) with different assumptions on and

. For example Principal Components Analysis (PCA) – or, more precisely, truncated Singular Value Decomposition (SVD) – can be interpreted as fitting (

1.1) by least squares, assuming that columns of are orthogonal and columns of are orthonormal (Eckart and Young, 1936). Non-negative matrix factorization (NMF) (Lee and Seung, 1999 Oct 21) assumes elements of and are non-negative. Grade of membership models (Erosheva, 2002), also known as admixture models (Pritchard et al., 2000), Latent Dirichlet Allocation (Blei et al., 2003), or topic models, also assume that columns of are non-negative, and additionally require that they sum to one. Simple cluster models can be interpreted as further requiring that exactly one element in each column of is 1. Classical factor analysis (FA) arises from assuming that the elements of

are independent standard normal and allowing different residual variances for each column of

(Rubin and Thayer, 1982). Bayesian variations on these ideas arise from placing prior distributions on and/or (Bishop, 1999; Attias, 1999; Salakhutdinov and Mnih, 2008). And many methods – both Bayesian and other – assume sparsity of and/or (West, 2003; Sabatti and James, 2005; Zou et al., 2006; Pournara and Wernisch, 2007; Carvalho et al., 2008; Witten et al., 2009; Engelhardt and Stephens, 2010; Knowles and Ghahramani, 2011; Mayrink et al., 2013; Yang et al., 2014; Gao et al., 2016; Hore et al., 2016).

It is hard to overstate the collective impact of these methods: many of the papers referenced above have hundreds or thousands of citations. Given the existence of many related methods that differ primarily in their assumptions on and , it seems natural to ask whether one can learn from the data which assumptions best fit a given dataset. Here we propose a general framework, Empirical Bayes Matrix Factorization (EBMF), to help address this problem. The key property of EBMF is that assumptions on both and are codified by prior distributions which are themselves estimated from the data. Different versions of EBMF arise from putting different restrictions on the prior distributions. Indeed, certain restrictions yield existing methods: for example, restricting priors to be normal with mean 0 yields the methods from Bishop (1999). However, our general formulation of EBMF allows much more flexible prior families, including sparse “spike-and-slab” distributions (Clyde and George, 2000; Johnstone et al., 2004), non-parametric unimodal (“adaptive shrinkage”) distributions (Stephens, 2017), or even entirely non-parametric distributions (Koenker and Mizera, 2014a). Furthermore, we provide a modular fitting procedure, based on a variational approximation (Bishop, 1999), that can handle all of these cases and more. This procedure fits EBMF by repeatedly solving a much simpler problem – the “normal means” problem. In essence our procedure turns any EB method for solving the normal means problem into a corresponding (iterative) method for fitting EBMF.

After outlining the general EBMF framework and fitting procedure, we narrow our focus to a case that we view as particularly useful in practice: the use of shrinkage-oriented priors. In this case EBMF yields flexible matrix factorization algorithms that allow for sparsity and/or shrinkage on both the factors and loadings, while not assuming it a priori. That is, the amount of sparsity and shrinkage for each factor and loading is learned from the data, and our methods can provide very sparse factors or loadings (like the sparse methods mentioned above), or denser factors and loadings (more like PCA or classical FA), depending on what the data support. Furthermore, unlike most existing approaches, our methods allow for any combination of dense and sparse loadings and/or factors: some may be dense whereas others may be sparse.

We have implemented these methods in software, flash (Factors and Loadings by Adaptive SHrinkage). We demonstrate the utility of these methods through both numerical comparisons with competing methods and through a scientific application: analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons flash often provides more accurate inferences than other methods, while remaining computationally tractable for moderate-sized matrices (millions of entries). In the GTEx data, flash highlight both effects that are shared across many tissues (“dense” factors) and effects that are specific to a small number of tissues (“sparse” factors). These sparse factors often highlight similarities between tissues that are known to be biologically related, providing external support for the reliability of the results.

## 2 The single-factor Empirical Bayes Matrix Factorization model

To simplify exposition we begin with a single factor (“rank 1”) EBMF model; see Section 3 for the extension to multiple factors.

We define the single factor EBMF model as:

 Y =lfT+E (2.1) l1,…,ln ∼\it iidgl,gl∈G (2.2) f1,…,fp ∼\it iidgf,gf∈G (2.3) Eij ∼N(0,1/τij) with τ:=(τij)∈T. (2.4)

Here is the data matrix, is an

is a -vector (the “factor”), is some specified (possibly non-parametric) family of distributions, and are unknown “prior” distributions that are to be estimated, is an matrix of independent error terms, and is an unknown matrix of precisions () which is assumed to lie in some space . (This allows structure to be imposed on , such as constant precision, , or column-specific precisions, , for example.) Our methods allow that some elements of may be “missing”, and can estimate the missing values (Section 4.1).

There are many possible choices of distributional family , and our formulation here is deliberately general. However, to give one concrete example, could be the family of non-parametric distributions that are unimodal at 0, as in Stephens (2017). See Section 4 for further examples.

The role of the prior distributions is to impose some kind of regularization – for example, shrinkage or sparsity – on the loadings and factors. A key feature of the EB approach is that these distributions are estimated from the data, and in this way the EB approach automatically adapts (“tunes”) itself to the data, learning an appropriate amount of sparsity for example. By allowing different distributions, for and for , we allow different amounts of regularization on compared with . This could be important if, for example, is sparse but is not. (It would be straightforward to allow and to belong to different families and , but we do not pursue this here.)

Here it seems helpful to compare and contrast EBMF with approaches based on penalized likelihood (e.g. Penalized Matrix Decomposition, Witten et al., 2009, fits (2.1) with an penalty on and/or ). The EBMF and penalized likelihood approaches share a common goal of regularizing and/or , but there are important differences. Most notably, in EBMF the appropriate amount of regularization is learned by solving an optimization problem (estimating ), and – as we shall see – this can be done for very flexible families resulting in correspondingly flexible regularization methods. In contrast, in penalization-based methods the appropriate amount of regularization must be tuned in other ways – usually cross validation, which tends to be computationally cumbersome, and practical for only a very small number of tuning parameters (typically 1-2).

### 2.1 Fitting the EBMF model

Fitting the EBMF model involves estimating all of . A standard EB approach would be to do this in two steps:

• Estimate and , by maximizing the likelihood:

 L(gl,gf,τ):=∫∫p(Y|l,f,τ)gl(dl1)…gl(dln)gf(df1)…gf(dfp) (2.5)

over and . (This optimum will typically not be unique because of identifiability issues; see Section 3.2.)

• Estimate and using their posterior distribution: .

However, both these two steps are difficult, even for very simple choices of , so we resort to variational approximations which can be thought of as approximating this approach. Variational approximations have often been used in the past for fitting related models (e.g. Bishop, 1999; Ghahramani and Beal, 2000; Stegle et al., 2012; Hore et al., 2016).

#### 2.1.1 A Variational Approximation

The variational approach (see Blei et al. (2016) for review) begins by writing the log of the likelihood (2.5) as:

 l(gl,gf,τ) :=logL(gl,gf,τ) (2.6) =F(q,gl,gf,τ)+DKL(q||p) (2.7)

where

 F(q,gl,gf,τ)=∫q(l,f)logp(Y,l,f|gl,gf,τ)q(l,f)dldf, (2.8)

and

 DKL(q||p)=−∫q(l,f)logp(l,f|Y,gl,gf,τ)q(l,f)dldf (2.9)

is the Kullback–Leibler divergence from

to . This identity holds for any distribution . Because is non-negative, it follows that is a lower bound for the log likelihood:

 l(gl,gf,τ)≥F(q,gl,gf,τ) (2.10)

with equality when .

In other words,

 l(gl,gf,τ)=maxqF(q,gl,gf,τ), (2.11)

where the maximization is over all possible distributions . Maximizing can thus be viewed as maximizing over . However, as noted above, this maximization is difficult. The variational approach simplifies the problem by maximizing but restricting the family of distributions for . Specifically, the most common variational approach – and the one we consider here – restricts to the family of distributions that “fully-factorize”:

 Q={q:q(l,f)=n∏i=1ql,i(li)p∏j=1qf,j(fj)}. (2.12)

The variational approach seeks to optimize over with the constraint . For we can write where and , and we can consider the problem as maximizing .

#### 2.1.2 Alternating optimization

We optimize by alternating between optimizing over variables related to [], over variables related to [], and over . Each of these steps is guaranteed to increase (or, more precisely, not decrease) , and convergence can be assessed by (for example) stopping when these optimization steps yield a very small increase in . Note that may be multi-modal, and there is no guarantee that the algorithm will converge to a global optimum. The approach is summarized in Algorithm 1.

The key steps in Algorithm 1 are the maximizations in Steps 4-6.

Step 4, the update of , involves computing the expected squared residuals:

 \widebarR2ij :=Eql,qf[(Yij−lifj)2] (2.13) =[Yij−Eql(li)Eqf(fj)]2−Eql(li)2Eqf(fj)2+E% ql(l2i)Eqf(f2j). (2.14)

This is straightforward provided the first and second moments of

and are available. (See Appendix A.1 for details of the update.)

Steps 5 and 6 are essentially identical except for switching the role of and . A key result is that each of these steps can be achieved by solving a simpler problem – the Empirical Bayes normal means (EBNM) problem. The next subsection (2.1.3) describes the EBNM problem, and the following subsection (2.1.4) details how this can be used to solve Steps 5 and 6.

#### 2.1.3 The EBNM problem

Suppose we have observations of underlying quantities

, with independent Gaussian errors with known standard deviations

. Suppose further that the elements of are assumed i.i.d. from some distribution, . That is,

 x|θ ∼Nn(θ,diag(s21,…,s2n)) (2.15) θ1,…,θn ∼\it iidg,g∈G, (2.16)

where denotes the

-dimensional normal distribution with mean

and covariance matrix .

By solving the EBNM problem we mean fitting the model (2.15)-(2.16) by the following two-step procedure:

1. Estimate by maximum (marginal) likelihood:

 ^g=argmaxg∈G∏j∫p(xj|θj,sj)g(dθj). (2.17)
2. Compute the posterior distribution for given ,

 p(θ|x,s,^g)∝∏j^g(θj)p(xj|θj,sj). (2.18)

Later in this paper we will have need for the posterior first and second moments, so we define them here for convenience:

 ¯θj :=E(θj|x,s,^g) (2.19) \widebarθ2j :=E(θ2j|x,s,^g). (2.20)

Formally, this procedure defines a mapping (which depends on the family ) from the known quantities , to , where are given in (2.17) and (2.18). We use EBNM to denote this mapping:

 \it EBNM(x,s)=(^g,p). (2.21)
###### Remark 1.

Solving the EBNM problem is central to all our algorithms, so it is worthwhile to spend some time to understand it. A key point is that the EBNM problem provides an attractive and flexible way to induce shrinkage and/or sparsity in estimates of . For example, if is truly sparse, with many elements at or near 0, then the estimate will typically have considerable mass near 0, and the posterior means (2.19) will be “shrunk” strongly toward 0 compared with the original observations. In this sense solving the EBNM problem can be thought of as a model-based analogue of thresholding-based methods, with the advantage that by estimating from the data the EBNM approach automatically adapts to provide an appropriate level of shrinkage. These ideas have been used in wavelet denoising (Clyde and George, 2000; Johnstone et al., 2004; Johnstone and Silverman, 2005a; Xing and Stephens, 2016), and false discovery rate estimation (Thomas et al., 1985; Stephens, 2017) for example. Here we apply them to matrix factorization problems.

#### 2.1.4 Connecting the EBMF and EBNM problems

The EBNM problem is well studied, and can be solved reasonably easily for many choices of (e.g. Johnstone and Silverman, 2005b; Koenker and Mizera, 2014a; Stephens, 2017). In Section 4 we give specific examples; for now our main point is that if one can solve the EBNM problem for a particular choice of then it can be used to implement Steps 5 and 6 in Algorithm 1 for the corresponding EBMF problem. The following Proposition formalizes this for Step 5 of Algorithm 1; a similar proposition holds for Step 6 (see also Appendix A).

###### Proposition 1.

Step 5 in Algorithm 1 is solved by solving an EBNM problem. Specifically

 argmaxql,glF(ql,qf,gl,gf,τ)=\it EBNM(^l(Y,\widebarf,\widebarf2,τ),sl(\widebarf2,τ)) (2.22)

where the functions and are given by

 ^l(Y,v,w,τ)i :=∑jτijYijvj∑jτijwj, (2.23) sl(w,τ)i :=(∑jτijwj)−0.5, (2.24)

and denote the vectors whose elements are the first and second moments of under :

 \widebarf :=(Eqf(fj)) (2.25) \widebarf2 :=(Eqf(f2j)). (2.26)
###### Proof.

See Appendix A. ∎

For intuition into where the EBNM in Proposition 1 comes from, consider estimating in (2.1) with and known. The model then becomes independent regressions of the rows of on , and the maximum likelihood estimate for has elements:

 ^li=∑jτijYijfj∑jτijf2j, (2.27)

with standard errors

 si=(∑jτijf2j)−0.5. (2.28)

Further, it is easy to show that

 ^li∼N(li,s2i). (2.29)

Combining (2.29) with the prior

 l1,…,ln∼\it iidgl,gl∈G (2.30)

yields an EBNM problem.

The EBNM in Proposition 1 is the same as the EBNM (2.29)-(2.30) , but with the terms and replaced with their expectations under . Thus, the update for in Algorithm 1, with fixed, is closely connected to solving the EBMF problem for “known ”.

### 2.2 Streamlined implementation using first and second moments

Although Algorithm 1, as written, optimizes over , in practice each step requires only the first and second moments of the distributions and . For example, the EBNM problem in Proposition 1 involves and and not . Consequently, we can simplify implementation by keeping track of only those moments. In particular, when solving the normal means problem, in (2.21), we need only return the posterior first and second moments (2.19) and (2.20). This results in a streamlined and intuitive implementation, summarized in Algorithm 2.

Algorithm 2 has a very intuitive form: it has the flavor of an alternating least squares algorithm, which alternates between estimating given (Step 6) and given (Step 8), but with the addition of the ebnm step (Steps 7 and 9), which can be thought of as regularizing or shrinking the estimates: see Remark 1. This viewpoint highlights connections with related algorithms. For example, the (rank 1 version of the) SSVD algorithm from Yang et al. (2014) has a similar form, but uses a thresholding function in place of the ebnm function to induce shrinkage and/or sparsity.

## 3 The K-factor EBMF model

We generalize the single factor EBMF model to allow for factors as follows:

 Y =K∑k=1lkfTk+E (3.1) lk1,…,lkn ∼\it iidglk,glk∈G (3.2) fk1,…,fkp ∼\it iidgfk,gfk∈G (3.3) Eij ∼N(0,1/τij) with τ:=(τij)∈T. (3.4)

A key feature of this model is that it has a separate “prior” distribution for each loading and factor. This makes the model very flexible, allowing it to adapt to any combination of sparse and dense loadings and factors. This flexibility is harder to achieve with penalization-based approaches that use CV to select tuning parameters: tuning (or even ) parameters by CV is difficult.

It is straightforward to extend the variational approach to fit this factor model. The details are in Appendix A. In brief, we introduce variational distributions for , and then optimize the objective function . Similar to the rank-1 model, this optimization can be done by iteratively updating parameters relating to a single loading or factor, keeping other parameters fixed. And again we simplify implementation by keeping track of only the first and second moments of the distributions and , which we denote . The updates to (and ) are essentially identical to those for fitting the rank 1 model above, but with replaced with the residuals obtained by removing the estimated effects of the other factors:

 Rkij:=Yij−∑k′≠k\widebarlk′i\widebarfk′j. (3.5)

Based on this approach we have implemented two algorithms for fitting the -factor model. First, a simple “greedy” algorithm, which starts by fitting the rank 1 model, and then adds factors , one at a time, optimizing over the new factor parameters before moving on to the next factor. Second, a “backfitting” algorithm (Breiman and Friedman, 1985), which iteratively refines the estimates for each factor given the estimates for the other factors. Both algorithms are detailed in Appendix A

### 3.1 Selecting K

An interesting feature of EBMF is that it can automatically select the number of factors . This is because the maximum likelihood solution to is sometimes a point mass on 0 (provided the family includes this distribution). Furthermore, the same is true of the solution to the variational approximation (see also Bishop, 1999; Stegle et al., 2012). This means that if is set sufficiently large then some loading/factor combinations will be optimized to be exactly 0. (Or, in the greedy approach, which adds one factor at a time, the algorithm will eventually add a factor that is exactly 0, at which point it terminates.)

Here we note that the variational approximation is expected to result in conservative estimation (i.e. underestimation) of compared with the (intractable) use of maximum likelihood to estimate . For simplicity we focus on the simplest case: comparing vs .

Let denote the degenerate distribution with all its mass at 0. Note that the rank-1 factor model (2.1), with (or ) is essentially a “rank-0” model. Now note that the variational lower bound, , is exactly equal to the log-likelihood when (or ). This is because if the prior is a point mass at 0 then the posterior is also a point mass, which trivially factorizes as a product of point masses, and so the variational family includes the true posterior in this case. Since is a lower bound to the log-likelihood we have the following simple lemma:

If then .

###### Proof.
 l(^gl,^gf,^τ)≥F(^q,^gl,^gf,^τ)>F(δ0,δ0,δ0,^τ0)=l(δ0,δ0,^τ0) (3.6)

Thus, if the variational approximation favors over the rank 0 model, then it is guaranteed that the likelihood would also favor over the rank 0 model. In other words, compared with the likelihood, the variational approximation is conservative in terms of preferring the rank 1 model to the rank 0 model. This conservatism is a double-edged sword. On the one hand it means that if the variational approximation finds structure it should be taken seriously. On the other hand it means that the variational approximation may miss subtle structure, and indeed we have sometimes seen this behavior in simulations (not shown).

In practice Algorithm 2 can converge to a local optimum of that is not as high as the trivial (rank 0) solution, . We can add a check for this at the end of Algorithm 2, and set and when this occurs.

### 3.2 Identifiability

In EBMF each loading and factor is identifiable, at best, only up to a multiplicative constant (provided is a scale family). Specifically, scaling the prior distributions and by and respectively results in the same marginal likelihood, and also results in a corresponding scaling of the posterior distribution on the factors and loadings (e.g. it scales the posterior first moments by and the second moments by ). However, this non-identifiability is not generally a problem, and if necessary it could be dealt with by re-scaling factor estimates to have norm 1.

## 4 Software implementation: flash

We have implemented Algorithms 2, 4 and 5 in an R package, flash (“factors and loadings via adaptive shrinkage”). These algorithms can fit the EBMF model for any choice of distributional family : the user must simply provide a function to solve the EBNM problem for .

One source of functions for solving the EBNM problem is the “adaptive shrinkage” (ashr) package, which implements methods from Stephens (2017). These methods solve the EBNM problem for several flexible choices of , including:

• , the set of all scale mixtures of zero-centered normals;

• , the set of all symmetric unimodal distributions, with mode at 0;

• , the set of all unimodal distributions, with mode at 0;

• , the set of all non-negative unimodal distributions, with mode at 0.

These methods are computationally stable and efficient, being based on convex optimization methods (Koenker and Mizera, 2014b) and analytic Bayesian posterior computations.

We have also implemented functions to solve the EBNM problem for additional choices of in the package ebnm (https://github.com/stephenslab/ebnm). These include being the “point-normal” family (i.e. distributions that are a mixture of a point mass at zero and a normal). This choice is less flexible than those in ashr, and involves non-convex optimizations, but can be faster.

### 4.1 Missing data

If some elements of are missing, then this is easily dealt with. For example, the sums over in (2.23) and (2.24) are simply computed using only the for which is not missing. This corresponds to an assumption that the missing elements of are “missing at random” (Rubin, 1976). In practice we implement this by setting whenever is missing (and filling in the missing entries of to an arbitrary number). This allows the implementation to exploit standard fast matrix multiplication routines, which cannot handle missing data. If many data points are missing then it may be helpful to exploit sparse matrix routines, but we have not yet implemented this.

### 4.2 Initialization

Both the rank 1 algorithm (Algorithm 2) and the greedy algorithm (Algorithm 4) require a rank 1 initialization procedure, init. Here, we use the softImpute function from the package softImpute Mazumder et al. (2010) (with penalty parameter ), which essentially performs SVD when is completely observed, but can also deal with missing values in .

The backfitting algorithm (5) also requires initialization. One option is to use the greedy algorithm to initialize, which we call “greedy+backfitting”.

## 5 Numerical Comparisons

We now compare our methods with several competing approaches. To keep these comparisons manageable in scope we focus attention on methods that aim to capture possible sparsity in and/or . For EBMF we present results for two different shrinkage-oriented prior families, : the scale mixture of normals (), and the point-normal family (see above). We denote these flash and flash_pn respectively when we need to distinguish. In addition we consider Sparse Factor Analysis (SFA) (Engelhardt and Stephens, 2010), SFAmix (Gao et al., 2013) , Nonparametric Bayesian Sparse Factor Analysis (NBSFA) (Knowles and Ghahramani, 2011), Penalized Matrix Decomposition (Witten et al., 2009) (PMD, implemented in the R package PMA), and Sparse SVD (Yang et al., 2014) (SSVD, implemented in R package ssvd). These represent a wide range of different approaches to inducing sparsity: SFA, SFAmix and NBSFA are three Bayesian approaches with quite different approaches to prior specification; PMD is based on a penalized likelihood with penalty on factors and/or loadings; SSVD is based on iterative thresholding of singular vectors. We also compare with softImpute (Mazumder et al., 2010), which does not explicitly model sparsity in and , but fits a regularized low-rank matrix using a nuclear-norm penalty. Finally, for comparison we use standard (truncated) SVD.

All of the Bayesian methods (flash, SFA, SFAmix and NBSFA) are “self-tuning”, at least to some extent, and we applied them here with default values. According to Yang et al. (2014) SSVD is robust to choice of tuning parameters, so we also ran SSVD with its default values, using the robust option (method="method"). The softImpute method has a single tuning parameter (, which controls the nuclear norm penalty), and we chose this penalty by orthogonal cross-validation (OCV; Appendix B). The PMD method can use two tuning parameters (one for and one for ) to allow different sparsity levels in vs . However, since tuning two parameters can be inconvenient it also has the option to use a single parameter for both and . We used OCV to tune parameters in both cases, referring to the methods as PMD.cv2 (2 tuning parameters) and PMD.cv1 (1 tuning parameter).

### 5.1 Simple Simulations

#### 5.1.1 A single factor example

We simulated data with under the single-factor model (2.1) with sparse loadings, and a non-sparse factor:

 li ∼π0δ0+(1−π0)5∑m=115N(0,σ2m) (5.1) fj ∼N(0,1) (5.2)

where denotes a point mass on 0, and . We simulated using three different levels of sparsity on the loadings, using . (We set the noise precision in these three cases to make each problem not too easy and not too hard.)

We applied all methods to this rank-1 problem, specifying the true value . (The NBSFA software does not provide the option to fix , so is omitted here.) We compare methods in their accuracy in estimating the true low-rank structure () using relative root mean squared error:

 RRMSE(^B,B):= ⎷∑i,j(^Bij−Bij)2∑i,jB2ij. (5.3)

Despite the simplicity of this simulation, the methods vary greatly in performance (Figure 1). flash (both versions) consistently outperforms all the other methods across all scenarios (although softImpute performs similarly in the non-sparse case). The next best performances come from softImpute (SI.cv), PMD.cv2 and SFA, whose relative performances depend on the scenario. All three consistently improve on, or do no worse than, SVD. PMD.cv1 performs similarly to SVD. The SFAmix method performs very variably, sometimes providing very poor estimates, possibly due to poor convergence of the MCMC algorithm (it is the only method here that uses MCMC). The SSVD method consistently performs worse than simple SVD, possibly because it is more adapted to both factors and loadings being sparse (and possibly because, following Yang et al. (2014), we did not use CV to tune its parameters). Inspection of individual results suggests that the poor performance of both SFAmix and SSVD is often due to over-shrinking of non-zero loadings to zero. Figure 1: Boxplots comparing accuracy of flash with several other methods in a simple rank-1 simulation. This simulation involves a single dense factor, and a loading that varies from strong sparsity (90% zeros, left) to no sparsity (right). Accuracy is measured by difference in each methods RRMSE from the flash RRMSE, with smaller values indicating highest accuracy. The y axis is plotted on a non-linear (square-root) scale to avoid the plots being dominated by poorer-performing methods.

#### 5.1.2 A sparse bi-cluster example (rank 3)

An important feature of our EBMF methods is that they estimate separate distributions for each factor and each loading, allowing them to adapt to any combination of sparsity in the factors and loadings. This flexibility is not easy to achieve in other ways. For example, methods that use CV are generally limited to one or two tuning parameters because of the computational difficulties of searching over a larger space.

To illustrate this flexibility we simulated data under the factor model (2.1) with , , and:

 l1,i ∼N(0,22)i=1,…,10 (5.4) l2,i ∼N(0,1)i=11,…,60 (5.5) l3,i ∼N(0,1/22)i=61,…,150 (5.6) f1,j ∼N(0,1/22)j=1,…,80 (5.7) f2,j ∼N(0,1)j=81,…,160 (5.8) f3,j ∼N(0,22)j=161,…,240, (5.9)

with all other elements of and set to zero for . This example has a sparse bi-cluster structure where distinct groups of samples are each loaded on only one factor (Figure 2a), and both the size of the groups and number of variables in each factor vary.

We applied flash, softImpute, SSVD and PMD to this example. (We excluded SFA and SFAmix since these methods do not model sparsity in both factors and loadings.) The results (Figure 2) show that again flash consistently outperforms the other methods, and again the next best is softImpute. On this example both SSVD and PMD outperform SVD. Although SSVD and PMD perform similarly on average, their qualitative behavior is different: PMD insufficiently shrink the 0 values, whereas SSVD shrinks the 0 values well but overshrinks some of the signal, essentially removing the smallest of the three loading/factor combinations (Figure 2b). (a) Left: Illustration of the true latent rank-3 block structure used in these simulations. Right boxplots comparing accuracy of flash with several other methods across 100 replicates. Accuracy is measured by the difference of each methods RRMSE from the flash RRMSE, so smaller is better.

### 5.2 Missing data imputation for real datasets

Here we compare methods in their ability to impute missing data using five real data sets. In each case we “hold out” (mask) some of the data points, and then apply the methods to obtain estimates of the missing values. The data sets are as follows:

MovieLens 100K data, an (incomplete) matrix of user-movie ratings (integers from 1 to 5) (Harper and Konstan, 2016). Most users do not rate most movies, so the matrix is sparsely observed (94% missing), and contains about 100K observed ratings. We hold out a fraction of the observed entries and assess accuracy of methods in estimating these. We centered and scaled the ratings for each user before analysis.

GTEx eQTL summary data, a matrix of scores computed testing association of genetic variants (rows) with gene expression in different human tissues (columns). These data come from the Genotype Tissue Expression (GTEx) project (Consortium et al., 2015), which assessed the effects of thousands of “eQTLs” across 44 human tissues. (An eQTL is a genetic variant that is associated with expression of a gene.) To identify eQTLs, the project tested for association between expression and every near-by genetic variant, each test yielding a score. The data used here are the scores for the most significant genetic variant for each gene (the “top” eQTL). See Section 5.3 for more detailed analyses of these data.

Brain Tumor data, a matrix of gene expression measurements on 4 different types of brain tumor (included in the denoiseR package, Josse et al., 2016). We centered each column before analysis.

Presidential address data, a matrix of word counts from the inaugural addresses of 13 US presidents (1940–2009) (also included in the denoiseR package, Josse et al., 2016). Since both row and column means vary greatly we centered and scaled both rows and columns before analysis, using the biScale function from softImpute.

Breast cancer data, a matrix of gene expression measurements from Carvalho et al. (2008), which were used as an example in the paper introducing NBSFA (Knowles and Ghahramani, 2011). Following Knowles and Ghahramani (2011) we centered each column (gene) before analysis. Figure 3: Comparison of the accuracy of different methods in imputing missing data. Each panel shows a boxplot of error rates (RMSE) for 20 simulations based on masking observed entries in a real data set.

Among the methods considered above, only flash, PMD and softImpute can handle missing data. We add NBSFA (Knowles and Ghahramani, 2011) to these comparisons. To emphasize the importance of parameter tuning we include results for PMD and softImpute with default settings (denoted PMD, SI) as well as using cross-validation (PMD.cv1, SI.cv).

For these real data the appropriate value of is, of course, unknown. Both flash and NBSFA automatically estimate . For PMD and softImpute we specified based on the values inferred by flash and NBSFA. (Specifically, we used respectively for the five datasets.)

We applied each method to all 5 data sets, using 10-fold OCV (Appendix B) to mask data points for imputation, repeated 20 times (with different random number seeds) for each dataset. We measure imputation accuracy using root mean squared error (RMSE):

 RMSE(^Y,Y;Ω)=√1|Ω|∑ij∈Ω(Yij−^Yij)2. (5.10)

where is the set of indices of the held-out data points.

The results are shown in Figure 3. Although the ranking of methods varies among datasets, flash, PMD.cv1 and SI.cv perform similarly on average, and consistently outperform NBSFA, which in turn typically outperforms (untuned) PMD and unpenalized softImpute. These results highlight the importance of appropriate tuning for the penalized methods, and also the effectiveness of the EB method in flash to provide automatic tuning.

In these comparisons, as in the simulations, the two flash methods typically performed similarly. The exception is the GTEx data, where the scale mixture of normals (

) performed worse. Detailed investigation revealed this to be due to a very small number of very large “outlier” imputed values, well outside the range of the observed data, which grossly inflated RMSE. These outliers were so extreme that it should be possible to implement a filter to avoid them. However, we did not do this here as it seems useful to highlight this unexpected behavior. (Note that this occurs only when data are missing, and even then only in one of the five datasets considered here.)

### 5.3 Sharing of genetic effects on gene expression among tissues

To illustrate flash in a scientific application, we applied it to the GTEx data described above, a matrix of scores, with reflecting the strength (and direction) of effect of eQTL in tissue . We applied flash with using the greedy+backfitting algorithm (i.e. the backfitting algorithm, initialized using the greedy algorithm).

The flash results yielded 26 factors (Figure 4-5) which summarize the main patterns of eQTL sharing among tissues (and, conversely, the main patterns of tissue-specificity). For example, the first factor has approximately equal weight for every tissue, and reflects the fact that many eQTLs show similar effects across all 44 tissues. The second factor has strong effects only in the 10 brain tissues, from which we infer that some eQTLs show much stronger effects in brain tissues than other tissues.

Subsequent factors tend to be sparser, and many have a strong effect in only one tissue, capturing “tissue-specific” effects. For example, the 3rd factor shows a strong effect only in whole blood, and captures eQTLs that have much stronger effects in whole blood than other tissues. (Two tissues, “Lung” and “Spleen”, show very small effects in this factor but with the same sign as blood. This is intriguing since the lung has recently been found to make blood cells (Lefrançais et al., 2017) and a key role of the spleen is storing of blood cells.) Similarly Factors 7, 11 and 14 capture effects specific to “Testis”, “Thyroid” and “Esophagus Mucosa” respectively.

A few other factors show strong effects in a small number of tissues that are known to be biologically related, providing support that the factors identified are scientifically meaningful. For example, factor 10 captures the two tissues related to the cerebellum, “Brain Cerebellar Hemisphere” and “Brain Cerebellum”. Factor 19 captures tissues related to female reproduction, “Ovary”, “Uterus” and “Vagina”. Factor 5 captures “Muscle Skeletal”, with small but concordant effects in the heart tissues (“Heart Atrial Appendage” and “Heart Left Ventricle”). Factor 4, captures the two skin tissues (“Skin Not Sun Exposed Suprapubic”, “Skin Sun Exposed Lower leg”) and also “Esophagus Mucosa”, possibly reflecting the sharing of squamous cells that are found in both the surface of the skin, and the lining of the digestive tract. In factor 24, “Colon Transverse” and “Small Intestine Terminal Ileum” show the strongest effects (and with same sign), reflecting some sharing of effects in these intestinal tissues. Among the 26 factors, only a few are difficult to interpret biologically (e.g. factor 8).

To highlight the benefits of sparsity, we contrast the flash results with those for softImpute, which was the best-performing method in the missing data assessments on these data, but which uses a nuclear norm penalty that does not explicitly reward sparse factors or loadings. The first eight softImpute factors are shown in Figure 6. The softImpute results – except for the first two factors – show little resemblance to the flash results, and in our view are harder to interpret. Figure 4: Results from running flash on GTEx data (factors 1 - 8). The pve (”Percentage Variance Explained”) for loading/factor k is defined as pvek:=sk/(∑ksk+∑ij1/τij) where sk:=∑ij(\widebarlki\widebarfkj)2. It is a measure of the amount of signal in the data captured by loading/factor k (but its naming as ”percentage variance explained” should be considered loose since the factors are not orthogonal). Figure 5: Results from running flash on GTEx data (factors 15 - 26) Figure 6: Results from running softImpute on GTEx data (factors 1-8). The factors are both less sparse and less interpretable than the flash results.

### 5.4 Computational demands

It is difficult to make general statements about computational demands of our methods, because both the number of factors and number of iterations per factor can vary considerably depending on the data. However, to give a specific example, running our current implementation of the greedy algorithm on the GTEx data (a 16,000 by 44 matrix) takes about 140s (wall time) for point-normal and 650s for (on a 2015 MacBook Air with a 2.2 GHz Intel Core i7 processor and 8Gb RAM). By comparison, a single run of softImpute without CV takes 2-3s, so a naive implementation of 5-fold CV with 10 different tuning parameters and 10 different values of would take over 1000s (although one could improve on this by use of warm starts for example).

## 6 Discussion

Here we discuss some potential extensions or modifications of our work.

### 6.1 Orthogonality constraint

Our formulation here does not require the factors or loadings to be orthogonal. In scientific applications we do not see any particular reason to expect underlying factors to be orthogonal. However, imposing such a constraint could have computational or mathematical advantages. Formally adding such a constraint to our objective function seems tricky, but it would be straightforward to modify our algorithms to include an orthogonalization step each update. This would effectively result in an EB version of the SSVD algorithms in Yang et al. (2014), and it seems likely to be computationally faster than our current approach. One disadvantage of this approach is that it is unclear what optimization problem such an algorithm would solve (but the same is true of SSVD, and our algorithms have the advantage that they deal with missing data.)

### 6.2 Non-negative matrix factorization

We focused here on the potential for EBMF to induce sparsity on loadings and factors. However, EBMF can also encode other assumptions. For example, to assume the loadings and factors are non-negative, simply restrict to be a family of non-negative-valued distributions, yielding “Empirical Bayes non-negative Matrix Factorization” (EBNMF). Indeed, the ashr software can already solve the EBNM problem for some such families , and so flash already implements EBNMF. In preliminary assessments we found that the greedy approach is problematic here: the non-negative constraint makes it harder for later factors to compensate for errors in earlier factors. However, it is straightforward to apply the backfitting algorithm to fit EBNMF, with initialization by any existing NMF method. The performance of this approach is an area for future investigation.

### 6.3 Tensor Factorization

It is also straightforward to extend EBMF to tensor factorization, specifically a CANDECOMP/PARAFAC decomposition

 Yijm =K∑k=1lkifkjhkm+Eijm (6.1) lk1,…,lkn ∼\it iidglk,glk∈G (6.2) fk1,…,fkp ∼\it iidgfk,gfk∈G (6.3) hk1,…,hkr ∼\it iidghk,ghk∈G (6.4) Eijm ∼\it iidN(0,1/τijm). (6.5)

The variational approach is easily extended to this case (a generalization of methods in Hore et al., 2016), and updates that increase the objective function can be constructed by solving an EBNM problem, similar to EBMF. It seems likely that issues of convergence to local optima, and the need for good initializations, will need some attention to obtain good practical performance. However, results in Hore et al. (2016) are promising, and the automatic-tuning feature of EB methods seems particularly attractive here. For example, extending PMD to this case – allowing for different sparsity levels in and – would require 3 penalty parameters even in the rank 1 case, making it difficult to tune by CV.

### 6.4 Non-Gaussian errors

It is also possible to extend the variational approximations used here to fit non-Gaussian models, such as binomial data, using ideas from Jaakkola and Jordan (2000). This extension is detailed in Wang (2017).

## Acknowledgements

We thank P. Carbonetto for computational assistance, and P. Carbonetto, D. Gerard, and A. Sarkar for helpful conversations and comments on a draft manuscript. Computing resources were provided by the University of Chicago Research Computing Center. This work was supported by NIH grant HG02585 and by a grant from the Gordon and Betty Moore Foundation (Grant GBMF #4559).

## Appendix A Variational EBMF with K factors

Here we describe in detail the variational approach to the factor model, including deriving updates that we use to optimize the variational objective. (These derivations naturally include the model as a special case, and our proof of Proposition 2 below includes Proposition 1 as a special case.)

 ql(l1,⋯,lK) =∏kqlk(lk) (A.1) qf(f1,…,fK) =∏kqfk(fk). (A.2)

The objective function (2.8) is thus a function of , , and , as well as the precision :

 F(ql,qf,gl,gf,τ) =∫∏kqlk(lk)qfk(fk)logp(Y,l,f;gl1,gf1,⋯,glK,gfK,τ)∏kqlk(lk)qfk(fk)dlkdfk, (A.3) =Eql,qflogp(Y|l,f;τ)+∑kEqlklogglk(lk)qlk(lk)+∑kEqfkloggfk(fk)qfk(fk). (A.4)

We optimize by iteratively updating parameters relating to , a single loading or factor , keeping other parameters fixed. We simplify implementation by keeping track of only the first and second moments of the distributions and , which we denote . We now describe each kind of update in turn.

### a.1 Updates for precision parameters

Here we derive updates to optimize over the precision parameters . Focusing on the parts of that depend on gives:

 F(τ) =EqlEqf∑ij0.5log(τij)−0.5τij(Yij−∑klkifkj)2+const (A.5) =0.5∑ij[log(τij)+τij\widebarR2ij]+const (A.6)

where is defined by:

 \widebarR2ij :=Eql,qf[(Yij−K∑k=1lkifkj)2] (A.7) =(Yij−∑k\widebarlki\widebarfkj)2−∑k(\widebarlki)2(\widebarfkj)2+∑k\widebarl2ki\widebarf2kj. (A.8)

If we constrain then we have

 ^τ=argmaxτ∈