A Matrix–free Likelihood Method for Exploratory Factor Analysis of High-dimensional Gaussian Data

07/27/2019 ∙ by Fan Dai, et al. ∙ 0

This paper proposes a novel profile likelihood method for estimating the covariance parameters in exploratory factor analysis of high-dimensional Gaussian datasets with fewer observations than number of variables. An implicitly restarted Lanczos algorithm and a limited-memory quasi-Newton method are implemented to develop a matrix-free framework for likelihood maximization. Simulation results show that our method is substantially faster than the expectation-maximization solution without sacrificing accuracy. Our method is applied to fit factor models on data from suicide attempters, suicide ideators and a control group.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Factor analysis [1] is a multivariate statistical technique that characterizes dependence among variables using a small number of latent factors. Suppose that we have a sample from the

-variate Gaussian distribution

with mean vector

and a covariance matrix . We assume that where is a matrix of rank

that describes the amount of variance shared among the

coordinates and is a diagonal matrix with positive diagonal entries representing the unique variance specific to each coordinate. Factor analysis of Gaussian data for was first formalized by [2] with efficient maximum likelihood (ML) estimation methods proposed by [3, 4, 5, 1] and others. These methods however do not apply to datasets with that occur in applications such as the processing of sequencing data [6, 7, 8], analyzing the transcription factor activity profiles of gene regulatory networks using massive gene expression datasets [9], portfolio analysis in stock return [10]. In such cases, the available computer memory may be inadequate to store the sample covariance matrix or to make multiple copies of the dataset needed during the computation.

The expectation-maximization (EM) approach of [11] can be applied to datasets with but is computationally slow. So, here we develop a profile likelihood method for high-dimensional Gaussian data. Our method allows us to compute the gradient of the profile likelihood function at negligible additional computational cost and to check first-order optimality, guaranteeing high accuracy. We develop a fast sophisticated computational framework called FAD (Factor Analysis of Data) to compute ML estimates of and . Our framework is implemented in an R [12] package called fad.

The remainder of this paper is organized as follows. Section II describes the factor model for Gaussian data and an ML solution using the EM algorithm, and then proposes the profile likelihood and FAD. The performance of FAD relative to EM is evaluated in Section III. Section IV applies our methodology on a functional magnetic resonance imaging (fMRI) dataset related to suicidal behavior [13]. Section V concludes with some discussion. An online supplement, with sections, tables and figures referenced here with the prefix “S”, is available.

Ii Methodology

Ii-a Background and Preliminaries

Let be the data matrix with as its th row. Then, in the setup of Section I, the ML method profiles out using the sample mean vector and then maximizes the log-likelihood,

(1)

where , , where is the vector of 1s. The matrix is almost surely singular and has rank when . The factor model (1) is not identifiable because the matrices and

give rise to the same likelihood for any orthogonal matrix

So, additional constraints [1, 5] are imposed.

Ii-A1 EM Algorithms for parameter estimation

The EM algorithm [11, 14] exploits the structure of the factor covariance matrix by assuming -variate standard normal latent factors and writing the factor model as where ’s are i.i.d and ’s are independent of ’s. The EM algorithm is easily developed, with analytical expressions for both the expectation (E-step) and maximization (M-step) steps that can be speedily computed (see Section S1.1).

Although EM algorithms are guaranteed to increase the likelihood at each iteration and converge to a local maximum, they are well-known for their slow convergence. Further, these iterations run in a -dimensional space that can be slow for very large . Accelerated variants [15, 16] show superior performance in low-dimensional problems but come with additional computational overhead that dominates the gain in rate of convergence in high dimensions. EM algorithms also compromise on numerical accuracy by not checking for first-order optimality to enhance speed. So, we next develop a fast and accurate method for EFA that is applicable in high dimensions.

Ii-B Profile likelihood for parameter estimation

We start with the common and computationally useful identifiability restriction on that constrains to be diagonal with decreasing diagonal entries. This scale-invariant constraint is completely determined except for changes in sign in the columns of . Under this constraint, can be profiled out for a given as described in the following

Lemma 1.

Suppose that is a given p.d. diagonal matrix. Suppose that the

largest singular values of

are and the corresponding -dimensional right-singular vectors are the columns of Then the function is maximized at where is a diagonal matrix with th diagonal entry as The profile log-likelihood equals,

(2)

where is a constant that depends only on and but not on Furthermore, the gradient of is given by:

Proof.

See Section S1.2. ∎

The profile log-likelihood in (2) depends on only through the largest singular values of So, in order to compute and we need to only compute the largest singular values of and the right singular vectors. For , as is usually the case, these largest singular values and singular vectors can be computed very fast using Lanczos algorithm.

Further, is optimized over a -dimensional space. In the absence of constraints on and analytically tractable optima, this is the largest possible reduction on the dimension of the parameter space for feeding into a numerical optimization method. Further constraints on (e.g. when all diagonals are equal) can be easily incorporated in the method if needed. Overall, this reduction in dimensionality further accelerates convergence. Also, is available in closed form that enables us to check first-order optimality and to ensure high accuracy.

Finally, is expressed in terms of However, ML estimators are scale-invariant, so we can estimate and using the correlation matrix and scale back to . A particular advantage of using the sample correlation matrix is that needs to be optimized over a fixed bounded rectangle that does not depend on the data and is conceivably numerically robust.

Ii-C Matrix–free computations

Ii-C1 A Lanczos algorithm for calculating partial singular values and vectors

In order to compute the profile likelihood and its gradient, we need the largest singular values and right singular vectors of We use the Lanczos algorithm [17, 18] with reorthogonalization and implicit restart. Suppose that and that is any random vector with Let and For let reorthogonalize and set Also within this loop, if update reorthogonalize , and set

Next, consider the bidiagonal matrix with diagonal entries with th entry for and all other entries as 0. Now suppose that are the singular values of and that ’s and ’s are the corresponding right and left singular vectors, which can be computed via a Sturm sequencing algorithm [19]. Also, let and . Then it can be shown that for all and where is the last entry of Because and is approximately the largest singular value of the algorithm stops if holds for where is some prespecified tolerance level, and and are accurate approximations of the largest singular values and corresponding right singular vectors of .

Convergence of the reorthogonalized Lanczos algorithm often suffers from numerical instability that slows down convergence. To resolve this instability, [17] proposed restarting the Lanczos algorithm, but instead of starting from scratch, they initialized with the first singular vectors. To that end, let and reset Then for , let , and reset and Define and For compute and This yields a matrix with entries and for and for and for , and all other entries 0. The matrix

is not bidiagonal but is still small-dimensioned matrix whose singular value decomposition can be calculated very fast. Convergence of the Lanczos algorithm can be checked as before. This restart step is repeated until all the

largest singular values converge.

The only way that enters this algorithm is through matrix-vector products of the forms and , both of which can be computed without explicitly storing Overall, this algorithm yields the largest singular values and vectors in computational cost using only additional memory, resulting in substantial gains over the traditional methods [3, 4]

. These traditional methods require a full eigenvalue decomposition of

that is of computational complexity and requires memory storage space.

Having described a scalable algorithm for computing and , we detail a numerical algorithm for computing the ML estimators.

Ii-C2 Numerical optimization of the profile log-likelihood

On the correlation scale, ’s lie between 0 and 1. Under this box constraint, the factanal function in R and factoran function in MATLAB® employ the limited-memory Broyden-Fletcher-Goldfarb-Shanno quasi-Newton algorithm [20] with box-constraints (L-BFGS-B) to obtain the ML estimator of However, in high dimensions, the advantages of the L-BFGS-B algorithm are particularly prominent. Because Newton methods require the search direction , where is the Hessian matrix of , they are computationally prohibitive in high dimensions in terms of storage and numerical complexity. The quasi-Newton BFGS replaces the computation of the exact search direction by an iterative approximation using the already computed values of and The limited-memory implementation, moreover, uses only the last few (typically less than 10) values of and instead of using all the past values. Overall, L-BFGS-B reduces the storage cost from to and the computational complexity from to . Interested readers are referred to [20, 21] for more details on the L-BFGS-B algorithm.

The L-BFGS-B algorithm requires both and to be computed at each iteration. Because is available as a byproduct while computing (see Section Sections II-B and II-C1), we modify the implementation to jointly compute both quantities with a single call to the Lanczos algorithm at each L-BFGS-B iteration. In comparison to R’s default implementation (factanal) that separately calls and in its optimization routines, this tweak halves the computation time.

Iii Performance evaluations

Iii-a Experimental setup

The performance of FAD was compared to EM using 100 simulated datasets with true or 5 and . For each setting, we simulated the diagonal entries of to be from and entries of and from . We also evaluated performance with to match the settings of our application in Section IV: the true were set to be the ML estimates from that dataset.

For the EM algorithm, was initialized as the first principal components (PCs) of the scaled data matrix computed via the the Lanczos algorithm while was started at . FAD requires only to be initialized, which was done in the same way as the EM. We stopped FAD when the relative increase in was below 100 and where is the machine tolerance, which in our case was approximately . The EM algorithm was terminated if the relative change in was less than and , or if the number of iterations reached 5000. Therefore, FAD and EM had comparable stopping criteria. For each simulated dataset, we fit models with factors and chose the number of factors by the Bayesian Information Criterion (BIC): [22], where is the maximum log-likelihood value with factors. All experiments were done using R [12] on a workstation with Intel E5-2640 v3 CPU clocked @2.60 GHz and 64GB RAM.

Iii-B Results

Because BIC always correctly picked , we evaluated model fit for each method in terms of and where and and are the correlation matrices corresponding to and .

Iii-B1 CPU time

(a)
(b)
Fig. 3: Relative speed of FAD to EM on (left) randomly simulated and (right) data-driven cases.

Figure 3 presents the relative speed of FAD to EM. Specifically, for datasets of size , FAD was 10 to 70 times faster than EM, with maximum speedup at true . This speedup is more pronounced (see Section S2.1) in the data-driven simulations where is much larger.

Iii-B2 Parameter estimation and model fit

The final values from the FAD and EM are identical. Figure S5 shows that the two algorithms also yield essentially identical values of and of under the best fitted models.

Iii-C Additional experiments in high-noise scenarios

We conclude this section by evaluating performance in situations where ostensibly, weak factors are hardly distinguished from high noise by SVD methods and where EM may be preferable [23]. We applied FAD and EM to the simulation setup of [23]

: Here, the uniquenesses were sampled from three inverse Gamma distributions with unit means and variances of

, and , and . Figure S9 shows that our algorithm was substantially faster while having similar accuracy as EM.

Iv Suicide ideation study

We applied EFA to data [13] from an fMRI study conducted while 20 words connoting negative affects were shown to 9 suicide attempters, 8 suicide non-attempter ideators and 17 non-ideator control subjects. For each subject-word combination, [13] provided voxel-wise per cent changes in activation relative to the baseline in image volumes. Restricting attention to the 24547 in-brain voxels yields datasets for the attempters, ideators and controls of sizes . We assumed each dataset to be a random sample from the multinormal distribution. Our interest was in determining if the variation in the per cent relative change in activation for each subject type can be explained by a few latent factors and whether there are differences in these factors between the three groups of subjects.

For each dataset, we performed EFA with factors and using both FAD and EM. Table I demonstrates the computational benefits of using FAD over EM. We also used BIC to decide on the optimal () and obtained 2-factor models for both suicide attempters and ideators, and a 4-factor model for the control subjects. Figure 12 provides voxel-wise displays of the factor

1 2 3 4 5 6 7 8 9 10
Attempters FAD 3 3 4 5 5 6 6 7 9 9
EM 146 173 207 198 229 236 228 250 239 254
Ideators FAD 4 4 5 6 6 6 6 9 9 10
EM 118 197 207 200 222 244 241 226 258 258
Controls FAD 5 5 8 7 8 8 9 10 12 13
EM 300 451 456 407 426 461 483 438 566 519
TABLE I: CPU times (rounded to the nearest seconds) for FAD and EM applied to the suicide ideation study dataset.

loadings, obtained using the quartimax criterion [24], for each type of subject. All the factor loadings are non-negligible only in voxels around the ventral attention network (VAN) which represents one of two sensory orienting systems that reorient attention towards notable stimuli and is closely related to involuntary actions [25]. However, there are differences between the factor loadings in each group and also among them.

[title=Attempters,nobeforeafter,width=.46boxsep=0pt,left=0pt,right=0pt,top=0pt,bottom=0pt, remember as=one] [title=Ideators,nobeforeafter,width=.46boxsep=0pt,left=0pt,right=0pt,top=0pt,bottom=0pt, remember as=two] [title=Controls,nobeforeafter,width=.92boxsep=0pt,left=0pt,right=0pt,top=0pt,bottom=0pt, remember as=three] (a) First factor (b) Second factor (c) First factor (d) Second factor (e) First factor (f) Second factor (g) Third factor (h) Fourth factor
Fig. 12: Loading values of fitted factors for suicide attempters, ideators and controls.

For instance, for the suicide attempters, each factor is a contrast between different areas of the VAN, but the contrasts themselves differ between the two factors. The first factor for the ideators is a weighted mean of the voxels while the second factor is a contrast of the values at the VAN voxels. For the controls, the first three factors are different contrasts of the values at different voxels while the fourth factor is more or less a mean of the values at these voxels. Further, the factor loadings in the control group are more attenuated than for either the suicide attempters or ideators. While a detailed analysis of our results is outside the purview of this paper, we note that EFA has provided us with distinct factor loadings that potentially explains the variation in suicide attempters, non-attempter ideators and controls. However, our analysis assumed that the image volumes are independent and Gaussian: further approaches relaxing these assumptions may be appropriate.

V Discussion

In this paper, we propose a new ML-based EFA method called FAD using a sophisticated computational framework that achieves both high accuracy in parameter estimation and fast convergence via matrix-free algorithms. We implement a Lanczos method for computing partial singular values and vectors and a limited-memory quasi-Newton method for ML estimation. This implementation alleviates the computational limitations of current state-of-the-art algorithms and is capable of EFA for datasets with . Although not demonstrated in this paper, FAD is also well-suited for distributed computing systems because it only uses the data matrix for computing matrix-vector products. FAD paves the way to develop fast methods for mixtures of factor analyzers and factor models for non-Gaussian data in high-dimensional clustering and classification problems.

References

  • [1] T. W. Anderson, An Introduction to Multivariate Statistical Analysis

    , ser. Wiley Series in Probability and Statistics.   Wiley, 2003. [Online]. Available:

    https://books.google.com/books?id=Cmm9QgAACAAJ
  • [2] D. N. Lawley, “The estimation of factor loadings by the method of maximum likelihood,” Proceedings of the Royal Society of Edinburgh, vol. 60, pp. 64–82, 01 1940.
  • [3] K. G. Jöreskog, “Some contributions to maximum likelihood factor analysis,” Psychometrika, vol. 32, pp. 443–482, 12 1967.
  • [4] D. N. Lawley and A. E. Maxwell, “Factor analysis as a statistical method,” Journal of the Royal Statistical Society. Series D (The Statistician), vol. 12, pp. 209–229, 1962.
  • [5] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate analysis.   Elsevier Amsterdam, 2006.
  • [6] J. T. Leek and J. D. Storey, “Capturing heterogeneity in gene expression studies by surrogate variable analysis,” PLoS Genetics, vol. 3, no. 9, p. e161, 2007.
  • [7] J. T. Leek, “Svaseq: Removing batch effects and other unwanted noise from sequencing data,” Nucleic Acids Research, vol. 42, no. 21, p. e161, 2014.
  • [8] F. Buettner, N. Pratanwanich, D. J. McCarthy, J. C. Marioni, and O. Stegle, “f-sclvm: Scalable and versatile factor analysis for single-cell rna-seq,” Genome biology, vol. 18, no. 1, p. 212, 2017.
  • [9] I. Pournara and L. Wernisch, “Factor analysis for gene regulatory networks and transcription factor activity profiles,” BMC Bioinformatics, vol. 8, p. 61, 02 2007.
  • [10] C. T. Ng, C. Y. Yau, and N. H. Chan, “Likelihood inferences for high-dimensional factor analysis of time series with applications in finance,” Journal of Computational and Graphical Statistics, vol. 24, pp. 866–884, 11 2014.
  • [11] D. B. Rubin and D. T. Thayer, “Em algorithms for ml factor analysis,” Psychometrika, vol. 47, pp. 69–76, 02 1982.
  • [12] R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2019. [Online]. Available: https://www.R-project.org/
  • [13]

    M. A. Just, L. A. Pan, V. L. Cherkassky, D. L. McMakin, C. B. Cha, M. K. Nock, and D. P. Brent, “Machine learning of neural representations of suicide and emotion concepts identifies suicidal youth,”

    Nature Human Behaviour, vol. 1, pp. 911–919, 2017.
  • [14] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions, ser. Wiley Series in Probability and Statistics.   Wiley, 1996. [Online]. Available: https://books.google.com/books?id=iRSWQgAACAAJ
  • [15] C. Liu and D. B. Rubin, “Maximum likelihood estimation of factor analysis using the ecme algorithm with complete and incomplete data,” Statistica Sinica, vol. 8, pp. 729–747, 08 2002.
  • [16] R. Varadhan and C. Roland, “Simple and globally convergent methods for accelerating the convergence of any em algorithm,” Scandinavian Journal of Statistics, vol. 35, pp. 335–353, 06 2008.
  • [17] J. Baglama and L. Reichel, “Augmented implicitly restarted lanczos bidiagonalization methods,” SIAM Journal on Scientific Computing, vol. 27, no. 1, pp. 19–42, 2005.
  • [18] S. Dutta and D. Mondal, “An h-likelihood method for spatial mixed linear model based on intrinsic autoregressions,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 77, pp. 699–726, 09 2014.
  • [19]

    J. H. Wilkinson, “The calculation of the eigenvectors of codiagonal matrices,”

    Computer Journal, vol. 1, pp. 90–96, 02 1958.
  • [20] R. H. Byrd, J. N. P. Lu, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on Scientific Computing, vol. 16, pp. 1190–1208, 1995.
  • [21] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “Algorithm 778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization,” ACM Transactions on Mathematical Software, vol. 23, pp. 550–560, 1994.
  • [22] G. E. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, pp. 461–464, 03 1978.
  • [23] A. B. Owen and J. Wang, “Bi-cross-validation for factor analysis,” Statistical Science, vol. 31, pp. 119–139, 03 2015.
  • [24] A. B. Costello and J. Osborne, “Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis,” Practical Assessment, Research & Evaluation, vol. 10, pp. 1–9, 01 2005.
  • [25] S. Vossel, J. J. Geng, and G. R. Fink, “Dorsal and ventral attention systems: Distinct neural circuits but collaborative roles,” The Neuroscientist, vol. 20, no. 2, pp. 150–159, 2014.

S1 Supplementary materials for Methodology

s1.1 The EM algorithm for factor analysis on Gaussian data

The complete data log-likelihood function is

(S1)

where is a constant that does not depend on the parameters.

s1.11 E-Step computations

Since the ML estimate of is , at the current estimates and , the expected complete log-likelihood or so called function is given by

(S2)

Since . Then,

(S3)

and

(S4)

s1.12 M-Step computations

The parameters and are obtained by maximizing following equation S2. Specifically, given , and , the maximizer is given by

(S5)

where . By Woodbury matrix identity (Henderson and Searle, 1981), , so S5 can be simplified as

(S6)

Next, given , and , the ML estimate of is given by

(S7)

Substitute with S5, we get

(S8)

s1.2 Proof of Lemma 1

From equation 1, the ML estimates of and are obtained by solving the score equations

(S9)

From , we have

(S10)

Suppose that and that the diagonal elements in are in decreasing order with . Let with and containing the largest eigenvalues . The corresponding eigenvectors forms columns of matrix so that . Then, if , S10 shows that

(S11)

The square roots of are the largest singular values of and columns in are then the corresponding right-singular vectors. Hence, conditional on , is maximized at , where is a diagonal matrix with elements .

From the construction of and , we have and hence, .

Let . Then and

(S12)

and

(S13)

Based on S12 and S13 and equation 2, the profile log-likelihood is given by

(S14)

S2 Additional results for simulation studies

s2.1 Average CPU time

1 2 3 4 5 6
FAD 0.101 0.092 0.096 0.116 0.122 0.128
EM 0.170 0.152 0.546 0.136 0.178 0.208
FAD 0.639 0.514 0.486 0.841 0.966 1.025
EM 0.808 0.962 15.363 0.970 1.435 1.679
FAD 2.933 2.658 2.580 7.135 7.863 8.59
EM 57.052 57.527 81.463 49.689 48.508 49.504
TABLE S1: Average CPU time (in seconds) of FAD and EM applied with 1-10 factors for randomly simulated datasets where .
1 2 3 4 5 6 7 8 9 10
FAD 0.102 0.095 0.096 0.096 0.094 0.119 0.124 0.128 0.134 0.137
EM 0.191 0.165 0.104 0.149 0.408 0.163 0.208 0.224 0.254 0.308
FAD 0.667 0.513 0.501 0.507 0.497 0.828 0.919 1.039 1.108 1.143
EM 0.614 0.591 0.610 0.654 11.534 1.301 1.489 1.559 1.927 2.027
FAD 2.937 2.687 2.553 2.583 2.590 7.114 8.167 9.238 10.426 11.157
EM 47.200 47.333 49.867 49.956 71.308 47.469 47.119 43.828 45.304 44.141
TABLE S2: Average CPU time (in seconds) of FAD and EM applied with 1-10 factors for randomly simulated datasets where .
1 2 3 4 5 6 7 8
FAD 5.007 4.222 10.835 13.636
EM 253.021 304.909 311.916 303.712
FAD 4.927 4.104 10.411 12.058
EM 287.824 345.504 331.919 314.723
FAD 6.645 7.121 7.449 6.688 22.294 26.575 31.109 34.208
EM 648.759 734.226 745.902 735.263 767.010 789.614 802.502 748.395
TABLE S3: Average CPU time (in seconds) of FAD and EM applied for data-driven models.

s2.2 Evaluations in terms of and

(a)
(b)
(c)
(d)
Fig. S5: Relative Frobenius errors of FAD and EM for (a) correlation matrix and of (b) signal matrix on randomly simulated cases, and for (c) correlation matrix and (d) signal matrix on data-driven cases.

s2.3 Performance of FAD compared to EM for high-noise models

(a)
(b)
(c)
Fig. S9: Relative speed of FAD to EM on dataset with high noise (a). Relative Frobenius errors of FAD and EM for (b) and (c) .

References

Henderson, H. V. and Searle, S. R. (1981), “On deriving the inverse of a sum of matrices,” SIAM Review, 23, 53–60.