Deterministic parallel analysis

11/11/2017 ∙ by Edgar Dobriban, et al. ∙ University of Pennsylvania 0

Factor analysis is widely used in many application areas. The first step, choosing the number of factors, remains a serious challenge. One of the most popular methods is parallel analysis (PA), which compares the observed factor strengths to simulated ones under a noise-only model. This paper presents a deterministic version of PA (DPA), which is faster and more reproducible than PA. We show that DPA selects large factors and does not select small factors just like [Dobriban, 2017] shows for PA. Both PA and DPA are prone to a shadowing phenomenon in which a strong factor makes it hard to detect smaller but more interesting factors. We develop a deflated version of DPA (DDPA) that counters shadowing. By raising the decision threshold in DDPA, a new method (DDPA+) also improves estimation accuracy. We illustrate our methods on data from the Human Genome Diversity Project (HGDP). There PA and DPA select seemingly too many factors, while DDPA+ selects only a few. A Matlab implementation is available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Factor analysis is widely used in many application areas (see e.g., Malinowski, 2002; Bai and Ng, 2008; Brown, 2014). Given data describing measurements made on objects or subjects, factor analysis summarizes it in terms of some number of latent factors, plus noise. Finding and interpreting those factors may be very useful. However, it is hard to choose the number of factors to include in the model.

There have been many proposed methods to choose the number

of factors. Classical methods include using an a priori fixed singular value threshold

(Kaiser, 1960) and the scree test, a subjective search for a gap separating small from large singular values (Cattell, 1966). The recent surge in data sets with comparable to or even larger than has been met with new methods in econometrics (Bai and Li, 2012; Ahn and Horenstein, 2013)

and random matrix theory

(Nadakuditi and Edelman, 2008) among many others. In this paper we focus exclusively on the method of parallel analysis (PA), first introduced by Horn (1965) and variations of it.

PA, especially the version by Buja and Eyuboglu (1992), is one of the most popular ways to pick in factor analysis. It is used, for example as a step in bioinformatics papers such as Leek and Storey (2008). It seems that the popularity of PA is because users find it ‘works well in practice’. That property is hard to quantify on data with an unknown ground truth. We believe PA ‘works well’ because it selects factors above the largest noise singular value of the data, as shown by Dobriban (2017). Our goal in this paper is to provide an improved method that estimates this threshold faster and more reproducibly than PA does.

We will describe and survey parallel analysis in detail below. Here we explain how it is based on sequential testing, assuming some familiarity with factor analysis. If there were no factors underlying the data, then the variables would be independent. We can then compare the size of the apparent largest factor in the data matrix (the largest singular value) to what we would see in data with IID observations, each of which consists of

independent but not identically distributed variables. We can simulate such data sets many times, either by generating new random variables or by permuting each column of the original data independently. If the first singular value is above some quantile of the simulated first singular values, then we conclude that

. If we decide that , then we check whether the second factor (second singular value) has significantly large size compared to simulated second factors, and so on. If the ’st factor is not large compared to simulated ones, then we stop with factors.

Our first improvement on PA is to derandomize it. For large and , the size of the largest singular value can be predicted very well using tools from random matrix theory. The prediction depends very little on the distribution of the data values, a fact called universality. For example, random matrix theory developed for Gaussian data (Johnstone, 2001) has been seen to work very well on single nucleotide polymorphism (SNP) data that take values in (Patterson et al., 2006). The recently developed Spectrode tool (Dobriban, 2015) computes the canonical distribution of noise singular values

—the so-called Silverstein-Marchenko-Pastur distribution—without any simulations. We propose to use the upper edge of that distribution as a factor selection threshold.

We call the resulting method DPA for deterministic PA. A deterministic threshold has two immediate advantages over PA. It is reproducible, requiring no randomization. It is also much faster because we do not need to repeat the factor analysis on multiple randomly generated data sets.

DPA also frees the user from choosing an ad hoc significance level for singular values. If there are no factors in the data, then PA will mistakenly choose

with a probability corresponding to the quantile used in simulation, because the top sample singular value has the same distribution as simulated ones. As a result, we need a huge number of simulations to get sufficiently extreme quantiles that we rarely make mistakes. By contrast, DPA can be adjusted so that the probability of false positives vanishes. For instance the user can insist that a potential factor must be

times larger than what we would see in uncorrelated noise.

Zhou et al. (2017) recently developed a related method to derandomize PA, also by employing Dobriban (2015)

’s Spectrode algorithm to approximate the noise level. However, there are several crucial differences between our approaches: First, we use the variances of the columns as the population spectral distribution to be passed to Spectrode, while

Zhou et al. (2017)

fit a five-point mixture to the empirical eigenvalue spectrum, and use the resulting upper edge. Second, we provide theoretical consistency results, while

Zhou et al. (2017) evaluate the methods empirically on some genetic data.

In addition to randomness, PA has a problem with shadowing. If the first factor is very large then the simulations behind PA lump that factor into the noise, raising the threshold for the next factors. Then PA can miss some of the next factors. This was seen empirically by Owen and Wang (2016) and explained theoretically by Dobriban (2017) who coined the term ‘shadowing’. It is fairly common for real data sets to contain one ‘giant factor’ explaining a large proportion of the total variance. In finance, many stock returns move in parallel. In educational testing, the ‘overall ability’ factor of students is commonly large. Such giant factors are usually well known a priori and can shadow smaller factors that might otherwise have yielded interesting discoveries.

We counter shadowing by deflation. If the first factor is large enough to keep, then we estimate its contribution and subtract it from the data. We then look for additional factors within the residual. The resulting deflated DPA (DDPA) is less affected by shadowing.

Deflating the ’th factor counters shadowing by lowering the threshold for the ’st. The result can be a cascade where threshold reductions by DDPA trigger increased

and vice versa. Recent work has shown that the threshold for detecting a factor is lower than the threshold for getting beneficial estimates of the corresponding eigenvectors

(Perry, 2009; Gavish and Donoho, 2014). Because deflation involves subtracting an outer product of weighted eigenvectors, we consider raising the deflation threshold to only keep factors that improve estimation accuracy. We call this method DDPA+.

Vitale et al. (2017)

proposed a method for selecting the number of components using permutations as in PA, including deflation, using a standardized eigenvalue test statistic, and adjusting the permuted residual by a certain projection method. Their first two steps are essentially deflated (but not derandomized) PA. Similarly to DDPA, this suffers from over-selection, and their remaining two steps deal with that problem. Their approach differs from ours in several important aspects, including that our final method is deterministic, and that we prove rigorous theoretical results supporting it.

An outline of this paper is as follows. Section 2 gives some context on factor analysis, random matrices, and parallel analysis. Section 3 presents our deterministic PA (DPA). We show that, similarly to PA, DPA selects perceptible factors and omits imperceptible factors. Section 4 presents our deflated version of DPA, called DDPA that counters the shadowing phenomenon by subtracting estimated factors. We raise the threshold in DDPA to improve estimation accuracy. We call this final version DDPA+. Section 5 presents numerical simulations where we know the ground truth. In our simulations, DPA is more stable than PA and much faster too. DDPA is better able to counter shadowing than DPA but the number of factors it chooses has more variability than with DPA. Raising the decision threshold in DDPA+ appears to reduce its variance. The results are reproducible with software available from github.com/dobriban/DPA. In Section 6 we compare the various methods on the Human Genome Diversity Project (HGDP) dataset (Li et al., 2008). We believe that PA, DPA, and DDPA choose too many factors in that data. The choice for DDPA+ seems to be more accurate but conservative. Section 7 gives a faster way to compute the DPA threshold we need by avoiding the full Spectrode computation as well as a way to compute the DDPA+ threshold. We close with a discussion of future work in Section 8. The proofs are presented in Section 9.

2 Background

In this section we introduce our notation and quote some results from random matrix theory. We also describe goals of factor analysis and survey some of the literature on parallel analysis. Space does not allow a full account of factor analysis.

2.1 Notation

The data are the matrix and contain features of samples. The rows of are IID -dimensional observations , for . We will work with the standard factor model, where observations have the form . Here is a non-random matrix of factor loadings, is the

-dimensional vector of factor scores of the

-th observation, and the -dimensional noise vector of the -th observation.

In matrix form, this can also be written as

(1)

Here, is an random matrix containing the factor values, so is the ’th row of . Also is the matrix containing the noise, so is the ’th row of .

Thus, the noise is modeled as , where is the ’th row of . The matrix contains independent random variables with mean and variance , and is a diagonal matrix of idiosyncratic variances , . The covariance matrix of is where is the covariance of . We define the scaled factor loading matrix as . Note that we could reparametrize the problem by this matrix. The factor model has known identifiability problems, see for instance Bai and Li (2012) for a set of possible resolutions. However, in our case, we only want to estimate the number of large factors, which is asymptotically identifiable.

The model (1) contains factors, and we generally do not know . We use factors and is not necessarily the same as .

We will need several matrix and vector norms. For a vector we use the usual , , and norms , and . An unsubscripted vector norm denotes . For a matrix we use the induced norms

The spectral—or operator—norm is , while and . The Frobenius norm is given by .

We consider to be the signal matrix and to be the noise. It is convenient to normalize the data, via

(2)

The aspect ratio of is . We consider limits with and . In our models, under some precise assumptions stated later, almost surely, for some . We call the size of the noise.

For a positive semi-definite matrix , with eigenvalues , its empirical spectral distribution is

where denotes a point mass distribution at . In our asymptotic setting, converges weakly to a distribution , denoted

. For a general bounded probability distribution

, its upper edge is the essential supremum

The sample covariance matrix is , and we let . If there are no factors, then is a much better estimate of than is. We call the variance distribution. The singular values of a matrix are denoted by . The eigenvalues of are .

2.2 Random matrix theory

Here we outline some basic results needed from random matrix theory, see (Bai and Silverstein, 2010; Paul and Aue, 2014; Yao et al., 2015)

for reference. We focus on the Marchenko-Pastur distribution and the Tracy-Widom distribution.

For the spherically symmetric case we have , but the eigenvalues of the sample covariance matrix do not all converge to unity in our limit when the sample size is only a constant times larger than the dimension. Instead the empirical spectral distrbution of converges to the Marchenko-Pastur (MP) distribution, (Marchenko and Pastur, 1967). If and tend to infinity with then

has probability density function

(3)

where

If , then is a mixture placing probability on the above density and probability on a point mass at . In either case, it is also known that the largest eigenvalue of the sample covariance matrix converges to the upper edge of the MP distribution . If the components of

are IID with kurtosis

, then is the largest of IID random variables with mean and variance . Then will be much closer to than is.

For a general distribution of bounded nonnegative eigenvalues of , there is a generalized Marchenko-Pastur distribution (Bai and Silverstein, 2010; Yao et al., 2015). If and

, then under some moment conditions

.

We can use Spectrode (Dobriban, 2015) to compute and obtain for any . Typically we will use or . It is convenient to use as a shorthand for . If the data are noise without any factors, we can estimate by the diagonal matrix of variances and expect that the largest eigenvalue of the empirical noise covariance matrix should be close to .

The largest eigenvalue of the sample covariance matrix is distributed around the upper edge of . The difference, appropriately scaled, has an asymptotic Tracy-Widom distribution. See Hachem et al. (2016) and Lee and Schnelli (2016)

for recent results. In Gaussian white noise,

Johnstone (2001) shows that the scale parameter is

(4)

so that fluctuations in the largest eigenvalue quickly become negligible.

2.3 Factor analysis

A common approach to factor analysis begins with a principal component analysis (PCA) of the data, and selects factors based on the size of the singular values of . Let have singular values for . Now choosing is equivalent to choosing a threshold and retaining all factors with . There are several different goals for factor analysis and different numbers may be best for each of them. Here we list some goals.

In some cases the goal is to select interpretable factors. In many applications, there are known factors that we expect to see in the data. These are the giant factors mentioned above, such as in finance, where many stock returns move in parallel. In such cases, interest centers on new interpretable factors beyond the already known ones. For instance, we may look for previously unknown segments of the market. The value of the threshold is what matters here, however, users will typically also inspect and interpret the factor loadings.

Another goal is estimation of the signal matrix , without necessarily interpreting all of the factors, as a way of denoising . Then one can choose a threshold by calculating the effect of retained factors on the mean squared error. Methods for estimating the unobserved signal have been developed under various assumptions by Perry (2009), Gavish and Donoho (2014), Nadakuditi (2014) and Owen and Wang (2016). The optimal for estimation can be much smaller than the true value .

In bioinformatics, factor models are useful for decorrelating genomic data so that hypothesis test statistics are more nearly independent. See for example, Leek and Storey (2007), Sun et al. (2012), Gagnon-Bartsch et al. (2013), and Gerard and Stephens (2017). Test statistics that are more nearly independent are better suited to multiplicity adjustments such as the Benjamini-Hochberg procedure.

2.4 Parallel analysis

Parallel analysis was introduced by Horn (1965). The idea is to simulate independent data with the same variances as the original data but without any of the correlations that factors would induce. If the observed top eigenvalue of the sample covariance is significantly large, say above the simulated 95’th percentile, then we include that factor/eigenvector in the model. If the first factors have been accepted, then we accept the ’st one if it is above the 95’th percentile of its simulated counterparts. Horn’s original approach used the mean instead of the 95’th percentile. Horn was aware of the bias in the largest eigenvalue of sample covariance matrices even before Marchenko and Pastur (1967) characterized it theoretically.

There is a lot of empirical evidence in favor of PA. For instance, Zwick and Velicer (1986) compared five methods in some small simulated data sets. Their criterion was choosing a correct not signal recovery. They concluded that “The PA method was consistently accurate.” The errors of PA were typically to overestimate (about 65% of the errors). In addition, Glorfeld (1995) wrote that “Numerous studies have consistently shown that Horn’s parallel analysis is the most nearly accurate methodology for determining the number of factors to retain in an exploratory factor analysis.” He also reported that the errors are mostly of overestimation.

The version of PA by Buja and Eyuboglu (1992) replaces Gaussian simulations by independent random permutations within the columns of the data matrix (See Algorithm 1). Permutations have the advantage of removing the parametric assumption. They are widely used for decorrelation in genomics to identify latent effects such as batch variation.

1:input: Data , centered, containing features of samples. Number of permutations (default = 19). Percentile (default = 100).
2:Generate independent random matrices , in the following way. Permute the entries in each column of (each feature) separately.
3:Initialize: .
4: distribution of the -th largest singular values of , .
5:if  is greater than the -th percentile of  then
6:     
7:     Return to step 4.
8:return: Selected number of factors .
Algorithm 1 PA: Parallel Analysis

PA was developed as a purely empirical method, without theoretical justification. Recently, Dobriban (2017) developed a theoretical understanding: PA selects certain perceptible factors in high-dimensional factor models, as described below. Dobriban (2017) clarified the limitations of PA, including the shadowing phenomenon mentioned above.

3 Derandomization

If there are no factors, so in (1), then is a matrix of uncorrelated noise. We can estimate by and compute the upper edge of the MP distribution induced by with aspect ratio , i.e., . This is a deterministic approximation of the location of the largest noise eigenvalue of .

Our new method of Derandomized Parallel Analysis (DPA) chooses factors where satisfies

The Spectrode method (Dobriban, 2015) computes from which we can obtain this upper edge. We give an even faster algorithm in Section 7.1 that computes directly.

The threshold that DPA estimates is a little different from the PA one. For the ’th singular value, PA is estimating something like the quantile of the generalized Marchenko-Pastur distribution while DPA uses the upper edge. This difference is minor for small .

We say that factor is perceptible if a.s. for some , where is the almost sure limit of defined in Section 2. It is imperceptible if a.s. holds.

Perceptible factors are defined in terms of and , and not in terms of underlying factor signal size. This may seem a bit circular, or even tautological. However, there are two reasons for adopting this definition. First, as Dobriban (2017)

has argued, in certain spiked covariance models one can show that a large underlying signal size leads to separated factors. Second, this is the “right” definition mathematically, leading to elegant proofs. Moreover, the definition is related to the BBP phase transition

(Baik et al., 2005); and it reduces to the BBP phase transition for the usual spiked models. However, our definition also makes sense in other models, where and the number of spikes diverges to infinity. These are not included in usual spiked models. See Dobriban (2017) Section 5, for a detailed explanation.

Theorem 3 below shows that DPA selects all perceptible factors and no imperceptible factors, just like PA does. While the result is similar to those on PA from Dobriban (2017), the proof is different. Let be the symmetric square root of , and recall that we defined the scaled factor loading matrix .

[DPA selects perceptible factors] Let the centered data matrix follow the factor model (1) with IID , . Assume the following conditions (for some positive constants and ):

  1. Factors: The factors have the form , where has independent entries with mean zero, variance one, and bounded moments of order .

  2. Idiosyncratic terms: The idiosyncratic terms are , where is a diagonal matrix, and have independent entries with mean zero, variance one, and bounded moments of order .

  3. Asymptotics: , such that , and there is a bounded distribution with and .

  4. Number of factors: The number of factors is fixed, or grows such that is summable.

  5. Factor loadings: The scaled loadings are delocalized, in the sense that .

Then with probability tending to one, DPA selects all perceptible factors, and no imperceptible factors.

Proof.

See Section 9.1. ∎

This shows that the behavior of DPA is similar to that of PA (Dobriban, 2017). However, there are some important differences. In contrast to the result on PA, Theorem 3 allows both a growing number of factors, as well as growing factor strength. Specifically, the number of factors can grow, as long as is summable. For instance, if we assume that , which translates into the condition that the 8-th moments of the entries of are uniformly bounded, then can grow as fast as for any .

The factor strengths can grow as follows. Let the scaled factor loading matrix have columns for . These columns can have a diverging Euclidean norm so long as the sum of the max-norms tends to zero, . As a result, the Frobenius norm of the scaled loading matrix can grow, subject to . There is a tradeoff between the number of factors and their sizes. For instance, if , then the scaled factors can grow while subject to their Euclidean norm being .

DPA may yield false positives and false negatives stemming from factors that are not perceptible. In particular, there is a positive probability, even asymptotically, for DPA to pick from data with no factors. This may be why Glorfeld (1995) found PA’s errors were ones of overestimation.

We can remove false positives asymptotically by only taking factors above some multiple of the estimated upper edge. That is, we select a factor if . The choice of is problem dependent; we think is large enough to trim out noise in the bioinformatics problems we have in mind. A factor that is only times as large as what we would expect in pure noise is not likely to be worth interpreting.

For estimation of the number of factors, we might want to take in a principled way using the Tracy-Widom distribution. We could take , where is a deterministic sequence and is given by (4). The constant can be chosen as some fixed percentile of the Tracy-Widom distribution. However, this leads to only a tiny increase in the threshold, so we decided it is not worth making the method more complicated with this additional parameter. Therefore, in the following DDPA, DDPA+ algorithms and simulations, .

For estimation of the signal matrix , we will discuss a different choice in Section 4.1.

4 Deflation

To address the shadowing of mid-sized factors, we will deflate

. Let the singular value decomposition (SVD) of

be , with non-increasing . If we decide to keep the first factor, then we subtract from , recompute the upper edge of the MP distribution and repeat the process. This ensures that the very strong factors are removed, and thus they do not shadow mid-sized factors. We call this DDPA for deflated DPA.

1:input: Data , centered, containing features of samples
2:Initialize: .
3:Compute variance distribution: .
4:if , [by default then
5:     
6:      (from the SVD of )
7:     Return to step 3.
8:return: Selected number of factors .
Algorithm 2 DDPA: Deflated Deterministic Parallel Analysis

After deflations, the current residual matrix is . Thus, this algorithm requires only one SVD computation.

Deflation is a myopic algorithm. At each step we inspect the strongest apparent factor. If it is large compared to a null model, we include it. Otherwise we stop. There is no lookahead. There is a potential downside to myopic selection. If some large singular values are close to each other, but well separated from the bulk, the algorithm might stop without taking any of them, even though it could be better to take all of them. We think that this sort of shadowing from below is unlikely in practice.

Deflation must be analyzed carefully, as the errors in estimating singular values and vectors could propagate through the algorithm. To develop a theoretical understanding of deflation, our first step is to show that it works in a setting which parallels Theorem 3. Recall that the scaled factor loading matrix is . The statement of this theorem involves the Stieltjes transform which is defined in more detail in the supplement.

[DDPA selects the perceptible factors] Consider factor models under the conditions 1 through 3 of Theorem 3, with modified conditions 4 and 5 as follows:

  1. Number of factors: The number of factors is fixed.

  2. Factor loadings: The vectors of scaled loadings are delocalized in the sense that, for . They are also delocalized with respect to , in the sense that

    (5)

    uniformly for , , and with fixed.

Then with probability tending to one, DDPA with selects all perceptible factors, and no imperceptible factors.

Proof.

See the supplement. ∎

The proof requires one new key step, controlling the norm of the empirical spike eigenvectors in certain spiked models. The delocalization condition (5) is somewhat technical, and requires that are “random-like” with respect to . This condition holds if are random vectors with IID coordinates, albeit only in the special case where the entries of have a vanishing variability, so that we are in a near-homoskedastic situation. For other matrices, there can of course be other vectors that make this condition hold.

In practice, the advantage of deflation is that it works much better than plain DPA in the “explosive” setting with strong factors. We see this clearly in the simulations of Section 5. Our theoretical results for deflation are comparable to usual DPA, so this empirical advantage is not reflected in Theorem 2. Analyzing the strong-factor model theoretically seems to be beyond what the current proof techniques can accomplish.

As with DPA, we might want to increase the threshold in order to trim factors that are not perceptible, or too small to be interesting or useful.

[Slightly increasing the threshold retains the properties of DDPA] Consider the DDPA algorithm above where in step 4 of algorithm 2 we select if , and . Then the resulting algorithm satisfies Theorem 2.

Proof.

The proof is immediate, by checking that the steps of the proof of Theorem 2 go through. ∎

4.1 DDPA+: Increasing the threshold to account for eigenvector estimation

Every time deflation removes a factor, the threshold for the next factor is decreased. For some data, DDPA will remove many factors one by one. It is worth considering a criterion more stringent than perceptibility.

Deflation involves subtracting from . This matrix is a noisy estimate of the ’th singular space of the signal matrix . A higher standard than perceptibility is to require that be an accurate estimate of the corresponding quantity of the matrix . For this purpose we use a higher threshold for that was obtained by Perry (2009) and also by Gavish and Donoho (2014). This PGD threshold was derived for a white noise model. In Section 7.2 we extend it to a more general factor analysis setup using ideas from the OptShrink algorithm (Nadakuditi, 2014). We call the resulting method DDPA+. We note that the Tracy-Widom scale factor of (4) is too small to use in DDPA+. We provide the detailed algorithm in Algorithm 3.

1:input: Data , centered, containing features of samples
2:Initialize: .
3:Compute singular values , and eigenvalues sorted in decreasing order. Let , .
4:Compute spectral functionals: Stieltjes transforms:
D-transform: , Population spike estimate: , Derivatives of Stieltjes transforms:
and derivative of D-transform .
5:Compute estimators of the squared cosines: , .
6:if then
7:     
8:      (from the SVD of )
9:     Return to step 3.
10:return: Selected number of factors .
Algorithm 3 DDPA+: Deflated Deterministic Parallel Analysis+

5 Numerical experiments

We perform numerical simulations to understand and compare the behavior of our proposed methods. We follow the simulation setup of Dobriban (2017).

(a) Effect of signal strength
(b) Running time
Figure 1: (a) Mean and SD of number of factors selected by PA and DPA as a function of signal strength. All SDs were zero. Small amount of jitter added for visual clarity. (b) Running time (in log-seconds) of PA and DPA.

5.1 DPA versus PA

First we compare DPA to PA. For PA, we use the most classical version, generating 19 permutations, and selecting the -th factor if is larger than the all of the permuted singular values.

We simulate from the factor model . We generate the noise , where

is a diagonal matrix of noise variances uniformly distributed on a grid between one and two. The factor loadings are generated as

, where is a scalar corresponding to factor strength, and is generated by normalizing the columns of a random matrix with IID entries. Also, have IID standard normal entries.

We use a one-factor model, so , and work with sample size and dimension . We report additional experiments with larger and non-Gaussian data in the Supplement. We take , for on a linear grid of points between and inclusive. This is the same simulation setup used in Dobriban (2017). We repeat the simulation times.

The results in Figure 1(a) show that PA and DPA both select the right number of factors as soon as the signal strength is larger than about . This agrees with our theoretical predictions, since both algorithms select the perceptible factors. The close match between PA and DPA confirms our view that PA is estimating the upper edge of the spectrum.

A key advantage of DPA is its speed. PA and DPA both perform a first SVD of to compute the singular values . This takes flops. Afterwards, PA generates independent permutations , and computes their SVD. This takes flops. It would be reasonable to run both PA and DPA with a truncated SVD algorithm computing only the first singular values where is a problem specific a priori upper bound on . DPA would still have a cost advantage as that truncated SVD would only need to be computed once versus times for PA.

After computing the SVD, DPA only needs to compute , the upper edge of the MP distribution. The fast method described in Section 7.1 has an approximate cost of per iteration. We do not know whether the number of iterations required to achieve desired accuracy depends strongly on .

In conclusion, the cost of DPA should be lower by a factor of the order of than that of PA. In empirical applications, the number of SVD computations is typically about 20, thus this represents a meaningful speedup.

DPA is also much faster empirically. In Figure 1(b), we report the results of a simulation where we increase the dimension and sample size keeping , from to in steps of 500. We use permutations in FA. We only obtain one MC sample for each parameter setting. We set the signal strength , so that the problem is “easy” statistically. Both methods are able to select the correct number of factors, regardless of the sample size (data not shown due to space limitations). Figure 1(b) shows that DPA is much faster than PA. On a log-scale, we see that the improvement is between 10x and 15x. The expected improvement without the upper edge computation would be 20x, but the time needed for that computation reduces the improvement. However, DPA is still much faster than PA.

5.2 DDPA versus DPA

(a) 2-Factor model
(b) 3-Factor model
Figure 2: Mean and SD of the number of factors selected by DPA and DDPA as a function of the stronger factor value, with 2 and 3 factors.

Here we show that DPA is affected by shadowing and that DDPA counters it. We consider a two-factor and a three-factor model, with the same parameters as above, including heteroskedastic noise. For the two-factor model, we set the smaller signal strength to , which is a perceptible factor. We vary the larger factor strength on a point grid from to . This is the same simulation setup used in Dobriban (2017). The simulation was repeated times at each level. The results in Figure 2(a) show that DPA correctly finds both factors for small, but not for large, . DPA begins to suffer from shadowing at and shadowing is solidly in place by . In contrast, DDPA on average selects a constant number of factors, only slightly more than the true number, even when the larger factor is very strong.

We see more scatter in the choices of by DDPA than by DPA. Both DPA and DDPA are deterministic functions of , but DDPA subtracts a function of estimated singular vectors. As we mentioned in Section 4.1 those vectors are not always well estimated which could contribute to the variance of DDPA. The SD for DPA is zero when all simulations picked the same .

For the three-factor model, we set , and vary the largest factor on the same point grid between and . In Figure 2(b), we see the same behavior as for the two-factor model. For there is very little shadowing, by one factor is being shadowed and by we start to see a second factor getting shadowed. DDPA counters the shadowing, once again with more randomness than DPA has. In both cases SD is approximately for DDPA.

5.3 DDPA+ versus DDPA

(a) 1-Factor model
(b) 3-Factor model
Figure 3: Mean and SD of the number of factors selected by DDPA as a function of stronger factor value, with 1 and 3 factors. With an increased threshold (DDPA+), and without (DDPA).

Here we examine the behavior of DDPA+. We consider the same multifactor setting as in the previous section, now with a one-factor and a three-factor model.

The results in Figure 3(a) show that the increased threshold leads to some weak factors not being selected by DDPA+. The threshold at which a factor is included is increased compared to DDPA with threshold, as expected. For the three-factor model in Figure 3(b), increasing the threshold counters the variability that we saw for DDPA in Figure 2(b). In conclusion, increasing the threshold has the expected behavior.

Moreover, we see an increased variablity for DDPA+ in the leftmost setting on the right, when . The problem is that the two population singular values are equal, and this is a problematic setting for choosing the number of factors. The problem disappears almost immediately when is just barely larger than so then there is only a small range of signal strengths where this happens. We expect a very asymmetric distribution of there, so perhaps SD does not depict it well. This is a type of ”shadowing from below”.

6 Data analysis

We consider the Human Genome Diversity Project (HGDP) dataset (e.g., Cann et al., 2002; Li et al., 2008). The purpose of collecting this dataset was to evaluate the diversity in the patterns of genetic variation across the globe. We use the CEPH panel, in which SNP data was collected for 1043 samples representing 51 different populations from Africa, Europe, Asia, Oceania and the Americas. We obtained the data from www.hagsc.org/hgdp/data/hgdp.zip. We provide the data and processing pipeline on this paper’s GitHub page.

The data has samples, and we focus on the SNPs on chromosome 22. Thus we have an data matrix , where is the number of copies of the minor allele of SNP  in the genome of individual 

. We standardize the data SNP-wise, centering each SNP by its mean, and dividing by its standard error. For this step, we ignore missing values. Then, we impute the missing values as zeroes, which are also equal to the mean of each SNP.

Figure 4: Singular value histogram of the HGDP data. Overlaid are the thresholds where selection of factors stops for the four methods.

In Figure 9, we show the singular values of the HGDP data, and thresholds for selection of our methods: PA, DPA, DDPA and DDPA+. For PA, we use the sequential version, where the observed singular value must exceed all permuted counterparts. PA selects 212 factors, DPA 122, DDPA 1042, and DDPA+ 4.

Because we have scaled the data, the generalized Marchenko-Pastur distribution here reduces to the usual one. The bulk edge of that distribution is exceeded by all 122 factors selected by DPA. PA selects even more factors. The PA threshold is a noisy estimate of the quantile of the Marchenko-Pastur distribution. Though their thresholds are not very far apart, the density of sample singular values is high in that region, so PA and DPA pick quite different .

Every time DDPA increases by one, the threshold it uses decreases and the cascade terminated with factors. This is the rank of our centered data matrix, and so the residual is zero. There are 51 populations represented in the data and it is plausible that a very large number of factors are present in the genome. They cannot all be relatively important and we certainly cannot estimate all those singular vectors accurately. Restricting to well estimated singular vectors as DDPA+ does leads to factors. The threshold is well separated from the one that would lead to .

Since the ground truth is not known, we cannot be sure which

is best. We can however make a graphical examination. Real factors commonly show some interesting structure, while we expect spurious factors or poorly determined ones to appear Gaussian. Of course, this is entirely heuristic. However, practitioners often attempt to interpret the PC estimators of the factors in exploratory FA, so this is a reasonable check.

Perry and Owen (2010) even base a test for factor structure on the non-Gaussianity of the singular vectors, using random rotation matrices.

Figures 5 and 6 show the left singular vectors of the data corresponding to the top singular values. We see a clear clustering structure in at least the first PCs. DDPA+ selects only factors which we interpret as being conservative about estimating the singular vectors beyond the ’th. While it is visually apparent that there is some structure in eigenvectors through , DDPA+ stopped because that structure could not be confidently estimated. There is much less structure in the PCs beyond the ’th. Thus, we believe that PA and DPA select many more factors than can be well estimated from this data even if that many factors exist. DDPA failed completely on this data, unlike in our simulations, where it was merely variable. Here, the histogram of the bottom singular values never looked like a generalized Marchenko-Pastur. We think DDPA+ is more robust.

To be clear, we think that the main reason why DDPA does not work is that there is a lot of clustering in the data. There are samples from several different populations from Europe, Asia, and Africa. Within each continent, there are samples from different regions, and within those there are samples from different countries. The IID assumption is not accurate. Instead, it would be more accurate to use a hierarchical mixture model. Simulations in the appendix show how DDPA fails under such a model. However, developing methods for such models is beyond our current scope.

Figure 5: Scatterplots of the left singular vectors 1–12 of the HGDP data. Each point represents a sample.

Figure 6: Scatterplots of the left singular vectors 13–24 of the HGDP data.

7 Computing the thresholds

In this section we describe how our thresholds are computed. First we consider a fast method to compute the DPA threshold. Then we describe how to compute the DDPA+ threshold.

7.1 Fast computation of the upper edge

We describe a fast method to compute the upper edge of the Marchenko-Pastur distribution, given the population spectrum and the aspect ratio . Our method described here is, implicitly, one of the steps of the Spectrode method, which computes the density of the whole MP distribution (Dobriban, 2015). However, this subproblem was not considered separately in Dobriban (2015), and thus calling the Spectrode method can be inefficient when we only need the upper edge.

We are given a probability distribution which is a mixture of point masses at , with and . We are also given the aspect ratio . As a consequence of the results in Silverstein and Choi (1995), see e.g., Bai and Silverstein (2010, Lemma 6.2), the upper edge has the following characterization. Let , and assume that , so that is well-defined. Consider the map

On the interval , has a unique minimum . Then, . Moreover, is strictly convex on that interval, and asymptotes to at both endpoints. This enables finding numerically by solving using any one-dimensional root finding algorithm, such as bisection or Brent’s method (Brent, 1973).

7.2 The DDPA+ threshold

For simplicity we consider discerning whether our from which factors have been subtracted contains an additional ’th well estimated factor or not. Let , where is the signal, with unit vectors , , and is the noise. Let , where , be the SVD of .

We compare two estimators of : the first empirical singular matrix , and the zero matrix , which corresponds to a hard thresholding estimator on the first singular value. Now is more accurate than in mean squared error, when

By expanding the square, isolating , and squaring, the criterion becomes

(6)

The theory of spiked covariance models (e.g., Benaych-Georges and Nadakuditi, 2012), shows that as , under the conditions of the theorems in our paper and for fixed factor strengths, the empirical spike singular value and the empirical squared cosines have a.s. limits. Moreover, the OptShrink method (Nadakuditi, 2014) provides consistent estimators of the unknown population spike , and the squared cosines . OptShrink relies on the empirical singular values , . To estimate the unknown population spike , it uses the singular values , , for some prespecified . Here we take .

The specific formulas can be found in Benaych-Georges and Nadakuditi (2012). We provide them here for completeness and they are also in the code at github.com/dobriban/DPA. The full algorithm is stated separately in the paper. Let , and , , where . Then, let , , , , , , and . The estimators of the squared cosines between the true and empirical left and right singular vectors are, respectively, , .

8 Discussion

In this paper we have developed a deterministic counterpart, DPA, to the popular PA method for selecting the number of factors. DPA is completely deterministic and hence reproducible. It is also much faster than PA. Both of these traits make it a suitable replacement for PA when factor analysis is used as one step in a lengthy data analysis workflow. Under standard factor analysis assumptions, DPA inherits PA’s vanishing probability of selecting false positives proved in Dobriban (2017).

PA and DPA have a shadowing problem that we can remove by deflation. The resulting DDPA algorithm counters deflation but chooses a more random number of factors because it subtracts poorly estimated singular vectors. We counter this in our DDPA+ algorithm by raising the threshold to only include factors whose singular vectors are well determined.

Factor analysis is used in data analysis pipelines for multiple purposes, including factor interpretation, data denoising, and data decorrelation. The best choice of is probably different for each of these uses. Denoising has been well studied but we think more work is needed for the other goals.

Acknowledgments

This work was supported by the U.S. NSF under grants DMS-1521145 and DMS-1407397. The authors are grateful to Jeha Yang for pointing out an error in an earlier version of the manuscript.

9 Proof outlines

In this section we provide some proof outlines. The remaining proofs are in the Appendix in the Supplementary Material.

9.1 Proof of Theorem 3

9.1.1 General theory

Recall that for signal and noise matrices and , where a.s. We begin with a technical lemma that provides a simple condition for DPA to select the perceptible factors.

[DPA selects the perceptible factors] If almost surely, then DPA selects all perceptible factors and no imperceptible factors, almost surely.

The proof is immediate. In the remainder, we will provide concrete conditions when holds. Recall that the noise has the form , where the entries of are independent random variables of mean zero and variance one. Then the true (population) variances are the entries of the diagonal matrix , and we let .

[Noise operator norm] Let , where the entries of are independent standardized random variables with bounded -th moment for some , and is a diagonal positive semi-definite matrix with . Let with , while and where the limiting distribution has . Then almost surely.

Proof.

This essentially follows from Corollary 6.6 of Bai and Silverstein (2010). A small modification is needed to deal with the non-IID-ness, as explained in Dobriban et al. (2017). ∎

Using Proposition 9.1.1 we can ensure that , as required in Lemma 9.1.1, by showing that . We now turn to this.

9.1.2 Upper edge of MP law

In this section, we will provide conditions under which . This statement requires a delicate new analysis of the MP law, going back to the first principles of the Marchenko-Pastur-Silverstein equation.

[Convergence of upper edge] Suppose that as , , , and . Then .

Proof.

See the supplement. ∎

Theorem 9.1.2 provides sufficient conditions guaranteeing the convergence of the MP upper edge. For normalized data with , as in the setting of Proposition 9.1.1, these conditions are satisfied provided that the signal columns of have vanishing norms.

Let the scaled data be with noise . Let and obey the conditions of Proposition 9.1.1 except that the moment condition on is increased from to bounded moments for some . Let have columns , and suppose that . Then and .

Proof.

See the supplement. ∎

Theorem 9.1.2 and Proposition 9.1.2 provide a concrete set of assumptions that can be used to prove our main result for factor models.

9.1.3 Finishing the proof of Theorem 3

It remains to check that the conditions of Propositions 9.1.1 and 9.1.2 hold. In matrix form, the factor model reads . We first normalize it to have operator norm of unit order: .

We need to verify conditions on signal and noise. The assumption about the idiosyncratic terms in Theorem 3 satisfies the conditions of Proposition 9.1.2 which are stricter than those of Proposition 9.1.1. So the noise conditions are satisfied by assumption.

Now we turn to the signal . It has columns . For Proposition 9.1.1, we need to show that . Let . Then

Now , where denotes the -th entry of . Thus

Next, by definition which approaches by condition 5 on factor loadings in Theorem 3. It now suffices to show that .

Each has IID entries with mean 0 and variance 1, so a.s. by the LLN. The problem is to guarantee that the maximum of such random variables also converges to 1. If is fixed, then this is clear. A more careful analysis allows to grow subject to the limits in Assumption 4 of the theorem, and using the bounded moments of order for the entries of given by assumption 1. Specifically, with we can write

for some . Thus is a.s. bounded, if the sequence is summable. Summability holds by the assumption that the sequence is summable. This finishes the proof.

10 Appendix

In this section, we prove the remaining theorems and some supporting results. Familiarity with random matrix theory is assumed. See for instance Yao et al. (2015). We also provide some additonal numerical results, and algorithm details.

By we mean the set .

10.1 Proof of Theorem 9.1.2

The Stieltjes transform of is

(7)

and we let be its companion Stieltjes transform. The Silverstein equation (Marchenko and Pastur, 1967; Silverstein and Choi, 1995) for shows that for , is the unique solution in to

(8)

Let be the function defined on the right hand side of (8). Then is the functional inverse of the map on its domain of definition.

From Bai and Silverstein (2010, Chapter 6) it follows that the supremum of the support of can be expressed in a simple form in the following way. Consider intervals for such that for all . It is easy to see that the set of such intervals is nonempty. Let be the infimum of such . Then,

From Bai and Silverstein (2010, Chapter 6), it also follows that the structure of is such that has a singularity at 0, and one or more singularities at a discrete set of values that does not have a largest accumulation point. Among these, let be the largest value. Then, , for all , as or . Moreover, , and for , for .

Based on the above structure, in order to show the convergence