Adaptive Randomized Dimension Reduction on Massive Data

The scalability of statistical estimators is of increasing importance in modern applications. One approach to implementing scalable algorithms is to compress data into a low dimensional latent space using dimension reduction methods. In this paper we develop an approach for dimension reduction that exploits the assumption of low rank structure in high dimensional data to gain both computational and statistical advantages. We adapt recent randomized low-rank approximation algorithms to provide an efficient solution to principal component analysis (PCA), and we use this efficient solver to improve parameter estimation in large-scale linear mixed models (LMM) for association mapping in statistical and quantitative genomics. A key observation in this paper is that randomization serves a dual role, improving both computational and statistical performance by implicitly regularizing the covariance matrix estimate of the random effect in a LMM. These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 15

page 16

11/07/2012

Randomized Dimension Reduction on Massive Data

Scalability of statistical estimators is of increasing importance in mod...
04/15/2015

Theory of Dual-sparse Regularized Randomized Reduction

In this paper, we study randomized reduction methods, which reduce high-...
10/27/2021

Poisson PCA for matrix count data

We develop a dimension reduction framework for data consisting of matric...
09/10/2021

Low-rank statistical finite elements for scalable model-data synthesis

Statistical learning additions to physically derived mathematical models...
11/28/2019

Distributed estimation of principal support vector machines for sufficient dimension reduction

The principal support vector machines method (Li et al., 2011) is a powe...
03/12/2018

Scalable Algorithms for Learning High-Dimensional Linear Mixed Models

Linear mixed models (LMMs) are used extensively to model dependecies of ...
09/30/2019

A Quotient Space Formulation for Statistical Analysis of Graphical Data

Complex analyses involving multiple, dependent random quantities often l...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the current era of information, large amounts of complex high dimensional data are routinely generated across science and engineering disciplines. Estimating and exploring the underlying low-dimensional structure in data and using this latent structure to study scientific problems is of fundamental importance in a variety of applications. As the size of the data sets increases, the problem of statistical inference and computational feasibility become inextricably linked. Dimension reduction is a natural approach to summarizing massive data and has historically played a central role in data analysis, visualization, and predictive modeling. It has had significant impact on both statistical inference (Adcock, 1878; Edegworth, 1884; Fisher, 1922; Hotelling, 1933; Young, 1941), as well as on numerical analysis research and applications (Golub, 1969; Golub and Van Loan, 1996; Gu and Eisenstat, 1996; Golub et al., 2000); for a recent review see (Mahoney, 2011). Historically, statisticians have focused on the study of theoretical properties of estimators often in the context of asymptotically large number of samples. Numerical analysts and computational mathematicians, on the other hand, have been instrumental in the development of useful and tractable algorithms with provable stability and convergence guarantees. Naturally, many of these algorithms have been successfully applied to compute estimators grounded on solid statistical foundations. A classic example of this interplay is principal component analysis (PCA) (Hotelling, 1933)

. In PCA, an objective function is defined based on statistical considerations about the sample variance in the data, which can then be efficiently computed using a variety of singular value decomposition (SVD) algorithms developed by the numerical analysis community.

In this paper, we consider the problem of dimension reduction, focusing on the integration of statistical considerations of estimation accuracy and out-of-sample prediction error of matrices with latent low-rank with computational considerations of run time and numerical accuracy. The methodology that we develop builds on a classical approach to modeling large data, which first compresses the data, minimizing the loss of relevant information, and then applies statistical estimators appropriate for small-scale problems. In particular, we focus on dimension reduction via generalized eigendecomposition as the means for data compression, and on out-of-sample residual error as the measure of information loss. The scope of this work includes applications to a large number of dimension reduction methods, which can be implemented as solutions to truncated generalized eigendecomposition problems (Hotelling, 1933; Fisher, 1936; Li, 1991; Wu et al., 2010). In this paper our first focus is on the increasing need to compute an SVD of massive data using randomized algorithms developed in the numerical analysis community (Drineas et al., 2006; Sarlos, 2006; Liberty et al., 2007; Boutsidis et al., 2009; Rokhlin et al., 2009; Halko et al., 2011) to simultaneously reduce the dimension and regularize or control the impact of independent random noise.

The second focus in this paper is to provide efficient solvers for the linear mixed models that arise in statistical and quantitative genomics. In high-throughput genomics experiments, a vast amount of sequencing data is collected—on the order of tens of millions of genetic variants. The goal of genome-wide association studies is to test for a statistical association at each variant (polymorphic position) to a response of interest (e.g., gene expression levels or disease status) in a sample cohort. However, as the size of these genomic data and sample sizes continue to increase, there is an urgent need to improve the statistical and computational performance of standard tests.

It is typical to collect several thousand individuals for one study. These individuals may come from several genetically heterogeneous populations. It has been recognized since 2001 (Pritchard and Donnelly, 2001)

that the ancestry makeup of the individuals in the study has great potential to influence study results—in particular, spurious associations arise when genetic variants with differential frequencies may appear to be associated with the biased response variable via latent population structure.

The earliest methods (e.g., genomic control) accounted for population structure by using covariate estimates to correct for these confounding signals. More recently, linear mixed models have been used successfully to correct spurious results in the presence of population structure. LMMs have been shown to improve power in association studies while reducing false positives (Yang et al., 2014). However, mixed models incur a high computational cost when performing association studies because of the computational burden of computing a covariance matrix for the random effect controlling for population structure. Significant work has gone into mitigating such costs using spectral decompositions for efficient covariance estimation (Kang et al., 2008, 2010; Yang et al., 2011; Zhou and Stephens, 2012; Listgarten et al., 2012).

In this work, we show, using simulations of genomic data with latent population structure and real data from large-scale genomic studies that adaptive randomized SVD (ARSVD) developed here performs with both computational efficiency and numerical accuracy. Under certain settings, we find that the LMM using ARSVD outperforms current state-of-the-art approaches by implicitly performing regularization on the covariance matrix.

There are three key contributions of this paper:

  1. We develop an adaptive algorithm for randomized singular value decomposition (SVD) in which both the number of relevant singular vectors and the number of iterations of the algorithm are inferred from the data based on informative statistical criteria.

  2. We use our adaptive randomized SVD (ARSVD) algorithm to construct truncated generalized eigendecomposition estimators for PCA and linear mixed models (LMMs) (Listgarten et al., 2012; Zhou and Stephens, 2012).

  3. We demonstrate on simulated and real data examples that the randomized estimators provide a computationally efficient solution, and, furthermore, often improve statistical accuracy of the predictions. We show that in an over-parametrized situation this improvement in accuracy is due to implicit regularization imposed by the randomized approximation.

In Section 2, we describe the adaptive randomized SVD procedure we use for the various dimension reduction methods. In Section 2.5, we provide randomized estimators for linear mixed models used in quantitative genetics. In Section 3, we validate the proposed methodology on simulated and real data and compare our approach with state-of-the-art approaches. In particular, we show results from our approach for estimating low dimensional geographic structure in genomic data and for genetic association mapping applications.

2 Randomized algorithms for dimension reduction

In this section we develop algorithmic extensions for PCA. We state an algorithm that provides a numerically efficient and statistically robust estimate of the highest variance directions in the data using a randomized algorithm for singular value decomposition (Randomized SVD) (Rokhlin et al., 2009; Halko et al., 2011). In this problem, the objective is linear unsupervised dimension reduction with the low-dimensional subspace estimated via an eigendecomposition. Randomized SVD will serve as the core computational engine for the other estimators we develop in this paper.

2.1 Notation

Given positive integers and with , denotes the class of all matrices with real entries of dimension , and () denotes the sub-class of symmetric positive definite (semi-definite) matrices. For , span() denotes the subspace of spanned by the columns of . A basis matrix for a subspace is any full column rank matrix such that , where . In the case of sample data , eigen-basis(

) denotes the orthonormal left eigenvector basis. Denote the data matrix by

, where each sample is a assumed to be generated by a

-dimensional probability distribution

. In the case of supervised dimensions reduction, denote the response vector to be (quantitative response) or (categorical response with categories), and

. The data and the response are assumed to have a joint distribution

.

2.2 Computational considerations

The main computational tool we use is a randomized algorithm for approximate eigendecompositon, which factorizes a matrix of rank in time rather than the time required by approaches that do not take advantage of the special structure in the input. This is particularly relevant to statistical applications in which the data is high dimensional but reflects a highly constrained process, e.g., from biology or finance applications, which suggests it has low intrinsic dimensionality, i.e., . An appealing characteristic of our randomized algorithm is the explicit control of the trade-off between estimation accuracy relative to the exact estimates and computational efficiency. Rapid convergence to the exact estimates was shown both empirically as well as in theory in (Rokhlin et al., 2009).

2.3 Statistical considerations

A central concept in this paper is that the randomized approximation algorithms that we use for statistical inference implicitly impose regularization constraints. This idea is best understood by putting the randomized algorithm in the context of a statistical model. The numerical analysis and theoretical computer science perspective is deterministic and focuses on optimizing the discrepancy between the randomized and the exact solution, which is estimated and evaluated on a fixed data set. However, in many practical applications, the data are noisy random samples from a population. Hence, when the interest is in dimension reduction, the relevant error comparison is between the true subspace that captures information about the population and the corresponding algorithmic estimates. The true subspace is typically unknown, so we can validate the estimated subspace by quantifying the out-of-sample generalized performance. A key parameter of the randomized estimators, described in detail in Section 2.4, is the number of power iterations used to estimate the span of the data. Increasing values for this parameter provide intermediate solutions to the factorization problem, which converges the to the exact answer (Rokhlin et al., 2009). Fewer iterations correspond to reduced run time but also to larger deviation from the sample estimates and hence stronger regularization. We show in Section 3 the regularization effect of randomization on both simulated and real data sets. We argue that, in certain contexts, fewer iterations may be justified both by computational and statistical considerations.

2.4 Adaptive randomized low-rank approximation

In this section we provide a brief description of a randomized estimator for the best low-rank matrix approximation, introduced by (Rokhlin et al., 2009; Halko et al., 2011), which combines random projections with numerically stable matrix factorization. We consider this numerical framework as implementing a computationally efficient shrinkage estimator for the subspace capturing the largest variance directions in the data, particularly appropriate when applied to input matrix of (approximately) low rank. Detailed discussion of the estimation accuracy of Randomized SVD in the absence of noise is provided in (Rokhlin et al., 2009). The idea of random projection was first developed as a proof technique to study the distortion induced by low dimensional embedding of high-dimensional vectors in the work of (Johnson and Lindenstrauss, 1984) with much literature simplifying and sharpening the results (Frankl and Maehara, 1987; Indyk and Motwani, 1998; Dasgupta and Gupta, 2003; Achlioptas, 2001). More recently, the theoretical computer science and the numerical analysis communities discovered that random projections can be used for efficient approximation algorithms for a variety of applications (Drineas et al., 2006; Sarlos, 2006; Liberty et al., 2007; Boutsidis et al., 2009; Rokhlin et al., 2009; Halko et al., 2011). We focus on one such approach proposed by (Rokhlin et al., 2009; Halko et al., 2011) which targets the accurate low-rank approximation of a given large data matrix . In particular, we extend the randomization methodology to the noise setting, in which the estimation error is due to both the approximation of the low-rank structure in , as well as the added noise. A simple working model capturing this scenario is as follows: , where , rank captures the low dimensional signal, while is independent additive noise.

Algorithm.

Given an upper bound on the target rank and on the number of necessary power iterations ( is sufficient in most cases), the algorithm proceeds in two stages: (1) estimate a basis for the range of , (2) project the data onto this basis and apply SVD:

Algorithm: Adaptive Randomized SVD(X, , , )

  1. Find orthonormal basis for the range of ;

    1. Set the number working directions: ;

    2. Generate random matrix:

      with ;

    3. Construct blocks: with for ;

    4. Select optimal block and rank estimate , using a stability criterion and Bi-Cross-Validation (see Section 2.4.1);

    5. Estimate basis for selected block using SVD: ;

  2. Project data onto the range basis and compute SVD;

    1. Project onto the basis: ;

    2. Factorize: , where ;

    3. Rank approximation:


      ;

In stage (1) we set the number of working directions to be the sum of the upper bound on the rank of the data, and a small oversampling parameter , which ensures more stable approximation of the top sample variance directions; the estimator tends to be robust to changes in , so we use as a suggested default. In step (1.iii) the random projection matrix is applied to powers of

to randomly sample linear combinations of eigenvectors of the data weighted by powers of the eigenvalues:

The purpose of the power iterations is to increase the decay of the noise portion of the eigen-spectrum while leaving the eigenvectors unchanged. This is effectively a form of regularization as we are shrinking eigenvalues. Note that each column of corresponds to a draw from a

-dimensional Gaussian distribution:

, with the covariance structure more strongly biased towards higher directions of variation as increases. This type of regularization is analogous to the local shrinkage term developed in prior work (Polson and Scott, 2010). In step (iv), we select an optimal block for and estimate an orthonormal basis for the column space. In previous work (Rokhlin et al., 2009), the authors assumed fixed target rank and approximated , rather than . They showed that the optimal strategy for this purpose is to set , which typically achieves excellent -rank approximation accuracy for , even for relatively small values of . In this work we focus on the noisy case, where , and propose to adaptively set both and , aiming to optimize the generalization performance of the randomized estimator. The estimation strategy for and is described in detail in Section 2.4.1.

In stage (2) we rotate the orthogonal basis computed in stage (1) to the canonical eigenvector basis and scale according to the corresponding eigenvalues. In step (2.i) the data is projected onto the low dimensional orthogonal basis . Step (2.ii) computes the exact SVD in the projected space.

Computational complexity.

The computational complexity of the randomization step is and the factorizations in the lower dimensional space have complexity . With small relative to and , the run time in both steps is dominated by the multiplication by the data matrix; in the case of sparse data, fast multiplication can further reduce the run time. We use a normalized version of the above algorithm that has the same run time complexity but is numerically more stable (Martinsson et al., Vancouver, Canada, 2010).

2.4.1 Adaptive method to estimate and

We propose to use ideas of stability under random projections in combination with cross-validation to estimate the intrinsic dimensionality of the dimension reduction subspace as well as the optimal value of the eigenvalue shrinkage parameter .

Stability-based estimation of :

First, we assume the regularization parameter is fixed and describe the estimation of the rank parameter , using a stability criterion based on random projections of the data. We start with rough upper-bound estimate for the dimension parameter . We then apply a small number () of independent Gaussian random projections , , for . Given the projections we compute an estimate of the eigenvector basis of the column space onto the projected data. We then denoise the estimate by raising all the eigenvalues to the power :

The -th principal basis left singular vector estimate () is assigned a stability score:

Here is the estimate of the principal eigenvector of based on the -th random projection and denotes the Spearman rank-sum correlation between and . Eigenvector directions that are not dominated by independent noise are expected to have higher stability scores. When the data has approximately low-rank we expect a sharp transition in the eigenvector stability between the directions corresponding to signal and to noise. In order to estimate this change point, we apply a non-parametric location shift test (Wilcoxon rank-sum) to each of the stability score partitions of eigenvectors with larger versus smaller eigenvalues. The subset of principal eigenvectors that can be stably estimated from the data for the given value of is determined by the change point with smallest p-value among all non-parametric tests.

where p-value(k,t) is the p-value from the Wilcoxon rank-sum test applied to the and .

Estimation of :

In this section we describe a procedure for selecting an optimal value for based on the reconstruction accuracy under Bi-Cross-Validation for SVD (Owen and Perry, 2009), using the generalized Gabriel holdout pattern (Gabriel, 2002). The rows and columns of the input matrix are randomly partitioned into and groups respectively, resulting in a total of sub-matrices with non-overlapping entries. We apply Adaptive Randomized SVD to factorize the training data from each combination of row blocks and column blocks. In each case the submatrix block with the omitted rows and columns is approximated using its modified Schur complement in the training data. The cross-validation error for each data split corresponds to the Frobenius norm of the difference between the held-out submatrix and its training-data-based estimate. For each sub-matrix approximation we estimate the rank using the stability-based approach from the previous section. As suggested in previous work (Owen and Perry, 2009), we fix , in which case Bi-Cross-Validation error corresponding to holding out the top left block of a given block-partitioned matrix becomes . Here is the Moore-Penrose pseudoinverse of . For fixed value of we estimate and factorize using Adaptive Randomized SVD(t, d(t), ) and denote the Bi-Cross-Validation error by . The same process is repeated for the other three blocks B, C, and D. The final error and rank estimates are defined to be the medians across all blocks and are denoted as and , respectively. We optimize over the full range of allowable values for to arrive at the final estimates

2.5 Fast linear mixed models

Multivariate linear mixed models (LMMs) have been a workhorse in statistical genetics and quantitative genetics because they allow for the regression of explanatory on outcome variables while modeling relatedness between samples (Henderson, 1984; Price et al., 2011; Krote et al., 2012). In the context of mapping traits LMMs are used to correct for unobserved sources of statistical confounding in genetic association studies—particularly the presence of population structure amongst samples.

The linear mixed models in this paper take the form

where is the response, are the covariates, are the fixed effects, is the random effects design matrix, and and captures unobserved covariates with and . Here, is the estimated kinship or genetic relatedness matrix, and is the proportion of variance in the phenotypes explained by genetic factors. In the standard application to genetic association studies, the response is the quantitative trait across individuals and is a -vector of genotypes of those individuals across many genomic locations. The statistical test then is to determine whether the coefficient for a particular genetic variant, , is equal to zero, indicating no genetic association, or alternatively whether

, indicating support for a possible genetic association at this genetic locus with the trait. The likelihood for this model follows from a multivariate normal distribution:

The standard procedure to correct for sample (population) structure using an LMM proceeds in three main steps: (1) empirical estimation (construction) of the genetic relatedness matrix (GRM) amongst individuals based on held-out genetic data; (2) estimation of variance components to account for sample structure contributing to response; and (3) computation of an association statistic at each genotype position.

Many methods have focused on reducing the complexity of one or more of these steps to scale LMMs to current study sizes. The EMMA method (Kang et al., 2008) uses spectral decomposition to jointly estimate the likelihood of the response and random effects together instead of obtaining estimates of random effects a priori, which incurs a greater computation cost. EMMAX (Kang et al., 2010) improves upon EMMA by estimating variance components only once (independent of the number of genetic variants tested), making the assumption that most variants have a small effect on the phenotype (the pylmm software used here, along with our adaption, is based on EMMAX). FaST-LMM (Lippert et al., 2011)

, in addition to using a heuristic low-rank spectral decomposition to construct an orthogonal representation of the GRM, uses rotations of the genetic variants and response to perform computation of the likelihood function in linear time. Further methods 

(Lippert et al., 2013) use more systematic methods for subsetting genetic variants that are relevant to the phenotype of interest in order to produce accurate variance component estimations in a LMM.

Our contribution to accelerating parameter estimates in this model is to use ARSVD to address the complexity of estimating the design matrix . In general, the GRM is computed using a sample-by-sample covariance of the design matrix (), which has a computational complexity of to build, where typically dominates in genomic studies. Decomposing this matrix into an exact orthogonal representation using an eigendecomposition has complexity in the worst case. We forgo explicit computation of the GRM, and instead perform ARSVD on the original design matrix, with a computational complexity of . By using ARSVD to decompose the design matrix, we automatically detect low-rank structure present in the GRM, and we avoid the need to manually or heuristically subset the data to achieve a low-rank representation. We note in the simulated data and in the genomic study results that this has both a substantial acceleration in the computational speed and also the implicit regularization of the design matrix

from our approach improves the type I error substantially.

3 Results on simulated data

We use real and simulated data to demonstrate three major contributions of this work:

  1. In the presence of interesting low-rank structure in the data, the randomized algorithms tend to be much faster than the exact methods with minimal loss in approximation accuracy.

  2. The rank and the subspace containing the information in the data can be reliably estimated and used to provide efficient solutions to dimension reduction methods based on the truncated (generalized) eigendecomposition formulation.

  3. The randomized algorithms implicitly impose regularization, which can be adaptively controlled in a computationally efficient manner to produce improved out-of-sample performance.

3.1 Unsupervised dimension reduction

We begin with unsupervised dimension reduction of data with low-rank structure contaminated with Gaussian noise, and we focus on evaluating the application of Adaptive Randomized SVD for PCA (see Section 2.4). In particular, we demonstrate that the proposed method estimates the sample singular values with exponentially decreasing relative error in . Then we show that achieving similar low-rank approximation accuracy to a state-of-the-art Lanczos method requires the same run time complexity, which scales linearly in both dimensions of the input matrix. This makes our proposed method applicable to large data matrices. Lastly, we demonstrate the ability to adaptively estimate the underlying rank of the data, given a coarse upper bound. For the evaluation of the randomized methods, based on all our simulated and real data, we assume a default value for the oversampling parameter .

Simulation setup

The data matrix , is generated as follows: , where . The columns of and are drawn uniformly at random from the corresponding unit sphere and the singular values S = are randomly generated starting from a baseline value, which is a fraction of the maximum noise singular value, with exponential increments separating consecutive entries:

The noise is iid Gaussian: . The sample variance has the SVD decomposition , where are the singular values in decreasing order. The signal-to-noise relationship is controlled by , large values of which correspond to increased separation between the signal and the noise.

Results

In our first simulation we set and assume the rank to be given and fixed to . The main focus is on the effect of the regularization parameter controlling the singular value shrinkage, with larger values corresponding to stronger preference for the higher versus the lower variance directions. The simulation uses an input matrix of dimension . Studying the estimates of the percent relative error of the singular values averaged over ten random data sets, we observed exponential convergence to the sample estimates with increasing (Table 1). This suggests that the variability in the sample data directions is approximated well for large data sets at the cost of few data matrix multiplications.

t 1 2 3 4 5
% relative error
[mean]
Table 1: Singular value estimation using Adaptive Randomized SVD. We report the mean relative error (standard error) averaged across ten random replicates of dimension , with rank 50. Linear increase in the regularization parameter results in exponential decrease in the percent relative error.

In addition to minor relative error compared with exact approaches, we further investigate the source of putative error (Figure 1). Data were generated from and with true rank . We note our method is extremely precise for the greatest singular values but has a slightly increased decay rate of singular values compared to exact methods, where less overall variance is captured in directions present in the data that are more influenced by noise. Our method has larger standard error and tends to slightly underestimate the smallest singular values, which we interpret as providing regularization to directions of noise, as further explored in Section 3.2.

Figure 1: Singular value accuracy of ARSVD. Simulation results from random data sets of dimension and . Singular values (

) obtained from exact SVD are point estimates in red. Singular value confidence intervals are twice the standard error in estimates from ten estimations using ARVSD.

Similar ideas have been proposed before in the absence of randomization. Perhaps the most well-established and computationally efficient methods are based on Lanczos Krylov Subspace estimation, which also operate on the data matrix only through matrix multiplies (Saad, 1992; Lehoucq et al., 1998; Stewart, 2001; Baglama and Reichel, 2006). These methods are also iterative in nature and tend to converge quickly with run time complexity, typically scaling as the product of the data dimensions , with small.

In order to further investigate the practical run time behavior of Adaptive Randomized SVD we used one such state-of-the art low-rank approximation algorithm, Blocked Lanczos (Baglama and Reichel, 2006). Here we use the same simulation setting as before with fixed , but we vary the regularization parameter to achieve comparable percent relative low-rank Frobenius norm reconstruction error (to 1 d.p.). In Table 2 we report the ratio of the run times of the two approaches based on ten random data sets. Notice that the relative run time remains approximately constant with simultaneous increase in both data dimensions, which suggests similar order of complexity for both methods. In the presence of low-rank structure, Adaptive Randomized SVD is feasible for larger data problems where full-factorization methods (e.g. (Anderson et al., 1990)) are not; moreover, Adaptive Randomized SVD achieves good approximation accuracy for the top sample singular values and singular vectors.

n + p 6,000 7,500 9,000 10,500 12,000
relative time
Table 2: Run time ratio between Adaptive Randomized SVD and Lanczos. Results are across different data dimensions and and rank . The Lanczos implementation is from the irlba CRAN package (Baglama and Reichel, 2006). The Adaptive Randomized SVD was run until the percent relative low-rank reconstruction error was equal to the Lanczos error to 1 d.p. For each consecutive simulation scenario is incremented by and is incremented by . We report the sample mean and standard error based on ten random replicate data sets.

In many modern applications the data have latent low-rank structure of unknown dimension. In Section 2.4.1, we address the issue of estimating the rank using a stability-based approach. Next, we studied the ability of our randomized method to perform rank estimation to identify the dimensionality in simulated low-rank data. For that purpose, we generated random data sets, , , and . We set an initial rank upper bound estimate to be and used the adaptive method ARSVD (Section 2.4.1) to estimate both optimal and the corresponding . Figure 2 plots the estimated versus the true rank and the corresponding estimates of the regularization parameter for two different signal-to-noise scenarios. In both scenarios, the rank estimates showed good agreement with the true rank values. When the smallest “signal” direction has the same variance as the largest variance “noise” direction, which causes our approach to slightly underestimate the largest ranks. This is due to the fact that the few smallest variance signal directions tend to be difficult to distinguish from the random noise and hence less stable under random projections. We observe that our approach tends to select small values for , especially when there is a clear separation between the signal and the noise (right panel of Figure 2). This results in a reduced number of matrix multiplications and hence fast computation.

Figure 2: Rank estimation of ARSVD. Simulation results from 50 random data sets of dimension and , with true rank . Panel A,C: True rank () on the x-axis, and estimated rank () on the y-axis for signal-to-noise-parameters , , respectively. Panel B,D: estimations of with max set to ten for signal-to-noise parameters , , respectively.
Figure 3: CPU time for three matrix decomposition methods with respect to number of features. Simulation results from pseudo-random data sets with range of number of features, 2,000 to 100,000 (rsvd), 80,000 (eig), and 40,000 (svd). The number of samples, is one tenth the number of features. Run time is reported in terms of CPU-time (seconds) versus total number of features, . Both axes scale with the log of the axis metric. SVD is performed on the complete matrix. Eigendecomposition is performed with the sample covariance, that is, after multiplying design matrix by and performing a decomposition on the resulting matrix. RSVD estimates the top 100 principal components.

3.2 Results on simulated data with latent population structure

To simulate a genomic association study with latent population structure, we generate genotypes and phenotypes that are conditionally independent given the true population structure, as described in (Mimno et al., 2014). Each individual’s genotype is admixed, containing a proportion of ancestry from each population . The admixture proportions for each individual , are drawn . Since we only consider two alleles at each position in the genome, the frequency of each variant in population is drawn representing a uniform allele frequency. To generate latent assignments of each variant to a population, we create variant-population assignments by generating and from . The final generation of genotypes draws each variant from individual as . To generate a phenotype (response) vector for the population, we draw .

Results are generated for four distinct scenarios representing data with differing structure. The simulations presented here have no true associations between genotypes (features) and phenotypes (response), but are dependent conditional on latent population structure. This situation mimics the commonly observed confounding effect in genomic data, and motivates the need for LMMs and regularization in genome-wide studies. Given the simulations are entirely under the null hypothesis of no association between genotype and phenotype, it is expected that p-values follow a uniform distribution; we consider any result exceeding some

-threshold a false positive, occurring at rate .

In each simulation, the p-value distribution from an LMM association without ARSVD is reported as the full data matrix rank (the largest rank on the x-axis). The top left simulation uses (samples or genetic variants) and (responses or values of the phenotype), representing a situation with few samples. Our method performs some slight regularization beyond that of a standard LMM with rank close to full data rank (Figure 4). The next simulation (top right, and ) is the most common case in genomics with

, or a large set of features compared to responses. Our method performs equally well as current methods but at a fraction of the computational cost, as illustrated by the lowest-rank estimation that still yields uniform p-values and low false positive rate. In the bottom left panel we generate a case of an over parametrized model (

and ), for which our method produces significant false positives compared to a standard LMM; however, we note that if we know a priori that our model is over parametrized, we can modify our decomposition to perform in the sample space instead of feature space and achieve comparable results to a standard LMM. In the last setting (bottom right, and ) again our method performs similarly to the standard method but with gains in computational efficiency. In general, we expect the tradeoff between computational efficiency and accuracy to be dependent on the data analyzed. In the case of structured genomic data, we report in these simulations and in Section 4 that massive computational savings are accompanied by numerical accuracy.

Figure 4: Regularization simulations. X-axis is latent rank used to project the original matrix into a lower subspace. Y-axis is p-values across tests for association. Each box plot is colored by the fraction of false positives in the association results for a given rank. Top left: and . Top right: and . Bottom left: and . Bottom right: and .

4 Applications to large scale genomic data

4.1 Results on WTCCC association mapping

We test our approach that incorporates ARSVD in LMM association studies on the Wellcome Trust Case Control Consortium (WTCCC) data (Consortium, 2007)

that includes 4,684 individuals (half case individuals with Crohn’s disease and half control individuals without Crohn’s disease) and 478,765 genetic variants across the 22 autosomal chromosomes in the genome. We compare our method ARSVD incorporated in the next generation LMM software after EMMAX (pylmm) to GEMMA. Performing ARSVD on the whole genome took 82.2 secs, while the traditional eigendecomposition of the covariance matrix in pylmm took 88 mins 23.9 sec. In order to most accurately control for the test statistic computed in a LMM and achieve maximal statistical power, it is suggested that a covariance matrix is constructed once per (22) chromosomes, performed by holding out the test chromosome and concatenating the remaining chromosomes 

(Yang et al., 2014). Our method performs the 22 decompositions in a total of 5 mins 4.8 secs, while the traditional method performs the same decomposition in 4 hrs 24 mins.

GEMMA is executed with the -lmm option to most closely resemble the analysis that pylmm performs. Despite having discrete phenotypes (disease status {0,1}), we note that estimating coefficients

(effect sizes) with linear regression and logistic regression have nearly indistinguishable impacts on computing association statistics 

(Zhou et al., 2013). Estimates of for pylmm and GEMMA have a high correlation, yet GEMMA shows no enrichment for significant p-values. On the other hand, the distribution of p-values estimated from pylmm is closer to what we expect from an analysis on true associations, showing that our method, while providing some regularization, does not impact enrichment of true signal.

Figure 5: Genome-wide association study results for pylmm and GEMMA. Panel A: of every analyzed variant in the genome, GEMMA versus pylmm. Panel B: Histogram of p-values for each method.

We investigated the most significant genetic associations identified by our method and compared to previous results from a large-scale meta-analysis of Crohn’s disease. This meta-analysis had a much larger sample size than the WTCCC: it included 6,333 affected individuals (cases) and 15,056 controls and followed up the top association signals in 15,694 cases, 14,026 controls, and 414 parent-offspring trios. Table 3 reports the curated list of associated genetic variants reported in this meta-analysis through various sources (Listgarten et al., 2012). This includes genetic variants identified by meta-analysis (Franke et al., 2010), the original WTCCC study (Consortium, 2007), or those that were confirmed more recently via several methods and known to correspond to a region associated with Crohn’s disease risk (Listgarten et al., 2012). For each analysis type, we report the number of variants found significant using this analysis, the subset of those variants that lie within the top 0.5% of our results, and the subset of variants that reach significance at a local false discovery rate (lfdr) of 5% (Strimmer, 2008).

Furthermore, we identify several genetic variants associated with Crohn’s that were previously unidentified. In particular, there are a few genetic variants within the 3.6MB region that defines the MHC on chromosome 6 in the human genome (sequencing consortium, 1999) at an lfdr threshold of 10%. In particular, genetic variant rs9269186 () passes below an lfdr threshold of 1%, and lies within 5KB of the start of the HLA-DRB5 gene, putatively acting to regulate transcription of this protein-coding gene that produces a membrane-bound class II molecule. The HLA-DRB5 protein is an antigen that has an important role in the human immune system, and thus may play a role in an autoimmune disorder such as Crohn’s disease.

source significant variants top 0.5% lfdr 5%
All 151 61 35
MA + WTCCC 93 48 30
MA 81 47 29
Table 3: Enrichment of genetic associations found in our results among related Crohn’s disease studies. Number of genetic variants associated with Crohn’s disease risk identified by previous studies that are present in the top 0.5% of the results obtained from our method. MA = meta-analysis. MHC = major histocompatibility complex, which is a region previously identified as associated with Crohn’s and autoimmune diseases more generally. WTCCC = original analysis performed on data. All = MT + MHC + WTCCC.

5 Discussion

Massive high dimensional data sets are ubiquitous in modern data analysis settings. In this paper we address the problem of tractably and accurately performing eigendecompositions when analyzing such data. The main computational tool we use is based on recent randomized algorithms developed by the numerical analysis community. Taking into account the presence of noise in the data we provide an adaptive method for estimation of both the rank and number of Krylov iterations for the randomized approximate version of SVD. Using this adaptive estimator of low-rank structure, we implement efficient algorithms for PCA and linear mixed models. Perhaps, the most interesting statistical observation is related to the regularization that randomization implicitly imposes on the resulting eigendecomposition estimates.

In simulated experiments we show high accuracy in recovering the true (latent) rank of matrices with low-rank substructure, without the need for many Krylov iterations. Additionally, we show in simulations that our method performs implicit regularization and improves quantitative properties of results under various settings of data structure. Furthermore, our results on modern genome-wide association studies show that our approach to using ARSVD in linear mixed models fills a critical need for methods with computational efficiency that do not sacrifice the desirable statistical properties of these tests, but instead have the potential to improve upon these properties. Some important open questions still remain:

  1. There is need for a theoretical framework to quantify what generalization guarantees the randomization algorithm can provide on out-of-sample data, and the dependence of this bound on the noise and the structure in the data on one hand and on the parameter settings on the other.

  2. A probabilistic interpretation of the algorithm could contribute additional insights into the practical utility of the proposed approach under different assumptions. In particular it would be interesting to relate our work to a Bayesian model with posterior modes that correspond to the subspaces estimated by the randomized approach.

  3. The implicit regularization on latent factors imposed by ARSVD should be further explored with respect to the structure of the noise, and its impact on estimates of random effects in LMMs (Runcie and Mukherjee, 2013).

SM would like to acknowledge Lek-Heng Lim, Michael Mahoney, Qiang Wu, and Ankan Saha. SM is pleased to acknowledge support from grants NIH (Systems Biology): 5P50-GM081883, AFOSR: FA9550-10-1-0436, NSF CCF-1049290, and NSF-DMS-1209155. BEE is pleased to acknowledge support from grants NIH R00 HG006265 and NIH R01 MH101822. SG would like to acknowledge Uwe Ohler, Jonathan Pritchard and Ankan Saha.

Appendix A

a.1 Generalized eigendecomposition and dimension reduction

The appendix states a variety of dimension reduction methods supervised, unsupervised, and nonlinear can use the ARSVD engine to scale to massive data. The key requirement is a a formulation of the truncated generalized eigendecomposition problem that can be implemented by the Adaptive Randomized SVD from Section 2.4. The dimension reduction methods we will focus on are sliced inverse regression (SIR) and localized sliced inverse regression (LSIR).

a.1.1 Problem Formulation.

Assume we are given that characterize pairwise relationships in the data and let be the ”intrinsic dimensionality” of the information contained in the data. In the case of supervised dimension reduction methods this corresponds to the dimensionality of the linear subspace to which the joint distribution of assigns non-zero probability mass. Our objective is to find a basis for that subspace. For SIR and LSIR this corresponds to the span of the generalized eigenvectors with largest eigenvalues :

(1)

An important structural constraint we impose on , which is applicable to a variety of high-dimensional data settings, is that it has low-rank: . It is this constraint that we will take advantage of in the randomized methods. In the case of (unsupervised case) .

a.1.2 Sufficient dimension reduction

Dimension reduction is often a first step in the statistical analysis of high-dimensional data and could be followed by data visualization or predictive modeling. If the ultimate goal is the latter, then the statistical quantity of interest is a low dimensional summary

which captures all the predictive information in relevant to :

Sufficient dimension reduction (SDR) is one popular approach for estimating , which (Li, 1991; Cook and Weisberg, 1991; Li, 1992; Li et al., 2005; Nilsson et al., 2007; Sugiyama, 2007; Cook, 2007; Wu et al., 2010). In this appendix we focus on linear SDRs: , which provide a prediction-optimal reduction of

We will consider two specific dimension reduction methods: Sliced Inverse Regression (SIR) (Li, 1991) and Localized Sliced Inverse Regression (LSIR) (Wu et al., 2010). SIR is effective when the predictive structure in the data is global, i.e., there is single predictive subspace over the support of the marginal distribution of . In the case of local or manifold predictive structure in the data, LSIR can be used to compute a projection matrix that contains this non-linear (manifold) structure.

a.1.3 Efficient solutions and approximate SVD

SIR and LSIR reduce to solving a truncated generalized eigendecomposition problem in (1). Since we consider estimating the dimension reduction based on sample data we focus on the sample estimators and , where is symmetric and encodes the method-specific grouping of the samples based on the response . In the classic statistical setting, when , both and are positive definite almost surely. Then, a typical solution proceeds by first sphering the data: , e.g., using a Cholesky or SVD representation . This is followed by eigendecomposition of Li (1991); Wu et al. (2010) and back-transformation of the top eigenvectors directions to the canonical basis. The computational time is . When , and are rank-deficient and a unique solution to the problem (1) does not exist. One widely-used approach, which allows us to make progress in this problematic setting, is to restrict our attention to the directions in the data with positive variance. Then we can proceed as before, using an orthogonal projections onto the span of the data. The total computation time in this case is . In many modern data analysis applications both and are very large, and hence algorithmic complexity of could be prohibitive, rendering the above approaches unusable. We propose an approximate solution that explicitly recovers the low-rank structure in using Adaptive Randomized SVD from Section 2.4. In particular, assume rank (where is the dimensionality of the optimal dimension reduction subspace). Then , where . The generalized eigendecomposition problem (1) solution becomes restricted to the subspace spanned by the columns of :

(2)

The dimension reduction subspace is contained in the , where .


References

  • Achlioptas (2001) D. Achlioptas. Database-friendly random projections. In Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’01, pages 274–281, New York, NY, USA, 2001. ACM. ISBN 1-58113-361-8.
  • Adcock (1878) R.J. Adcock. A problem in least squares. The Analyst, 5:53–54, 1878.
  • Anderson et al. (1990) E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. Lapack: a portable linear algebra library for high-performance computers. In Proceedings of the 1990 ACM/IEEE conference on Supercomputing, Supercomputing ’90, pages 2–11, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. ISBN 0-89791-412-0.
  • Baglama and Reichel (2006) J. Baglama and L. Reichel. Restarted block Lanczos bidiagonalization methods. Numerical Algorithms, 43(3):251–272, 2006.
  • Boutsidis et al. (2009) C. Boutsidis, M.W. Mahoney, and P. Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’09, pages 968–977. Society for Industrial and Applied Mathematics, 2009.
  • Consortium (2007) The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678, 2007.
  • Cook (2007) R.D. Cook. Fisher Lecture: Dimension Reduction in Regression. Statistical Science, 22(1):1–26, 2007.
  • Cook and Weisberg (1991) R.D. Cook and S. Weisberg. Discussion of Li (1991). J. Amer. Statist. Assoc., 86:328–332, 1991.
  • Dasgupta and Gupta (2003) S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
  • Drineas et al. (2006) P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix. SIAM J. Comput., 36:158–183, July 2006.
  • Edegworth (1884) F.Y. Edegworth. On the reduction of observations. Philosophical Magazine, pages 135–141, 1884.
  • Fisher (1922) R.A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Statistical Society A, 222:309–368, 1922.
  • Fisher (1936) R.A. Fisher. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2):179–188, 1936.
  • Franke et al. (2010) A. Franke, D.P.B. McGovern, J.C Barrett, K. Wang, G.L Radford-Smith, T. Ahmad, C.W Lees, T. Balschun, J. Lee, R. Roberts, C.A. Anderson, J.C. Bis, S. Bumpstead, D. Ellinghaus, E.M. Festen, M. Georges, T. Green, T. Haritunians, L. Jostins, A. Latiano, C.G. Mathew, G.W Montgomery, N.J. Prescott, S. Raychaudhuri, J.I. Rotter, P. Schumm, Y. Sharma, L.A. Simms, D. Wijmenga C. Taylor, K.D. Whiteman, R.N. Baldassano, M. Barclay, T.M. Bayless, S. Brand, C. Buning, A. Cohen, J-F. Colombel, M. Cottone, L. Stronati, T. Denson, M. De Vos, R. D’Inca, M. Dubinsky, C. Edwards, T. Florin, D. Franchimont, R. Gearry, J. Glas, A. Van Gossum, S.L. Guthery, J. Halfvarson, H.W. Verspaget, A. Hugot, J-P. Karban, D. Laukens, I. Lawrance, M. Lemann, A. Levine, C. Libioulle, E. Louis, C. Mowat, W. Newman, J. Panes, A. Phillips, D.D. Proctor, M. Regueiro, R. Russell, P. Rutgeerts, J. Sanderson, M. Sans, F. Seibold, A.H. Steinhart, P.C.F. Stokkers, L. Torkvist, G. Kullak-Ublick, D. Wilson, T. Walters, S.R. Targan, S.R. Brant, J.D. Rioux, M. D’Amato, R.K. Weersma, S. Kugathasan, A.M. Griffiths, J.C. Mansfield, S. Vermeire, R.H. Duerr, M.S. Silverberg, J. Satsangi, S. Schreiber, J.H. Cho, V. Annese, H. Hakonarson, M.J. Daly, and M. Parkes. Genome-wide meta-analysis increases to 71 the number of confirmed crohn’s disease susceptibility loci. Nature genetics, 42(12):1118–1125, 2010.
  • Frankl and Maehara (1987) P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J. Comb. Theory Ser. A, 44:355–362, June 1987.
  • Gabriel (2002) K.R. Gabriel. Le biplot - outil d’exploration de données multidimensionnelles. Journal de la société française de statistique, 143(3-4):5–55, 2002.
  • Golub (1969) G.H. Golub. Matrix decompositions and statistical calculations. Technical report, Stanford, CA, USA, 1969.
  • Golub and Van Loan (1996) G.H. Golub and C.F. Van Loan. Matrix computations. John Hopkins University Press, Baltimore, MD, 3rd edition, 1996. ISBN 0-8018-5413-X.
  • Golub et al. (2000) G.H. Golub, K. Sløna, and P. Van Dooren. Computing the SVD of a General Matrix Product/Quotient. SIAM J. Matrix Anal. Appl, 22:1–19, 2000.
  • Gu and Eisenstat (1996) M. Gu and S.C. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput., 17(4):848–869, July 1996.
  • Halko et al. (2011) N. Halko, P-G. Martinsson, and J.A. Tropp. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2):217–288, 2011.
  • Henderson (1984) C.R. Henderson. Applications of Linear Models in Animal Breeding. University of Guelph, 1984.
  • Hotelling (1933) H. Hotelling. Analysis of a complex of statistical variables in principal components. Journal of Educational Psychology, 24:417–441, 1933.
  • Indyk and Motwani (1998) P. Indyk and R. Motwani.

    Approximate nearest neighbors: towards removing the curse of dimensionality.

    In

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    , STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM.
    ISBN 0-89791-962-9.
  • Johnson and Lindenstrauss (1984) W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189–206. American Mathematical Society, 1984.
  • Kang et al. (2008) H.M. Kang, N.A. Zaitlen, C.M. Wade, A. Kirby, D. Heckerman, MJ. Daly, and E. Eskin. Efficient control of population structure in model organism association mapping. Genetics, 178(3):1709–1723, 2008.
  • Kang et al. (2010) H.M. Kang, J.H. Sul, S.K. Service, N.A. Zaitlen, S. Kong, N.B. Freimer, C. Sabatti, and E. Eskin. Variance component model to account for sample structure in genome-wide association studies. Nature genetics, 42(4):348–354, 2010.
  • Krote et al. (2012) A. Krote, B.J. Vilhjálmsson, V. Segura, A. Platt, Q. Long, and M. Nordburg. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nature genetics, 44(9):1066–10711, 2012.
  • Lehoucq et al. (1998) R.R.B. Lehoucq, D.D.C. Sorensen, and C-C. Yang. Arpack User’s Guide: Solution of Large-Scale Eigenvalue Problems With Implicitly Restorted Arnoldi Methods, volume 6. SIAM, 1998.
  • Li et al. (2005) B. Li, H. Zha, and F. Chiaromonte. Contour Regression: A General Approach to Dimension Reduction. The Annals of Statistics, 33(4):1580–1616, 2005.
  • Li (1991) K.C. Li. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc., 86:316–342, 1991.
  • Li (1992) K.C. Li. On principal Hessian directions for data visulization and dimension reduction: another application of Stein’s lemma. J. Amer. Statist. Assoc., 87:1025–1039, 1992.
  • Liberty et al. (2007) E. Liberty, F. Woolfe, P-G. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the low-rank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
  • Lippert et al. (2011) C. Lippert, J. Listgarten, Y. Liu, C.M. Kadie, R.I. Davidson, and D. Heckerman. Fast linear mixed models for genome-wide association studies. Nature Methods, 8(10):833–835, 2011.
  • Lippert et al. (2013) C. Lippert, G. Quon, E.Y. Kang, C.M. Kadie, J. Listgarten, and D. Heckerman. The benefits of selecting phenotype-specific variants for applications of mixed models in genomics. Scientific reports, 3, 2013.
  • Listgarten et al. (2012) J. Listgarten, C. Lippert, C.M. Kadie, R.I. Davidson, E. Eskin, and D. Heckerman. Improved linear mixed models for genome-wide association studies. Nature methods, 9(6):525–526, 2012.
  • Mahoney (2011) M.W. Mahoney. Randomized Algorithms for Matrices and Data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
  • Martinsson et al. (Vancouver, Canada, 2010) P.-G. Martinsson, A. Szlam, and M. Tygert. Normalized power iterations for the computation of SVD. NIPS workshop on low-rank methods for large-scale machine learning, Vancouver, Canada, 2010.
  • Mimno et al. (2014) D. Mimno, D.M. Blei, and B.E. Engelhardt. Posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure. arXiv preprint arXiv:1407.0050, 2014.
  • Nilsson et al. (2007) J. Nilsson, F. Sha, and M.I. Jordan. Regression on Manifolds Using Kernel Dimension Reduction. In Proceedings of the 24th International Conference on Machine Learning, 2007.
  • Owen and Perry (2009) A.B. Owen and P.O. Perry. Bi-cross-validation of the SVD and the nonnegative matrix factorization. The Annals of Applied Statistics, 3:564–594, 2009. doi: 10.1214/08-AOAS227.
  • Polson and Scott (2010) N.G. Polson and J.G. Scott. Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction. In Bayesian Statistics 9. Oxford University Press, 2010.
  • Price et al. (2011) A.L. Price, A. Helgason, G. Thorleifsson, S.A. McCarroll, A. Kong, and K. Stefansson. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet, 7, 02 2011.
  • Pritchard and Donnelly (2001) J.K. Pritchard and P. Donnelly. Case–control studies of association in structured or admixed populations. Theoretical population biology, 60(3):227–237, 2001.
  • Rokhlin et al. (2009) V. Rokhlin, A. Szlam, and M. Tygert. A Randomized Algorithm for Principal Component Analysis. SIAM J. Matrix Anal. Appl., 31(3):1100–1124, August 2009.
  • Runcie and Mukherjee (2013) D.E. Runcie and S. Mukherjee. Dissecting high-dimensional phenotypes with bayesian sparse factor analysis of genetic covariance matrices. Genetics, 194(3):753–767, 2013.
  • Saad (1992) Y. Saad. Numerical methods for large eigenvalue problems, volume 158. SIAM, 1992.
  • Sarlos (2006) T. Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS ’06. 47th Annual IEEE Symposium on, pages 143 –152, October 2006.
  • sequencing consortium (1999) The MHC sequencing consortium. Complete sequence and gene map of a human major histocompatibility complex. Nature, 401(6756):921–923, 1999.
  • Stewart (2001) G.W. Stewart. Matrix Algorithms: Volume 2, Eigensystems, volume 2. SIAM, 2001.
  • Strimmer (2008) K. Strimmer. fdrtool: a versatile r package for estimating local and tail area-based false discovery rates. Bioinformatics, 24(12):1461–1462, 2008.
  • Sugiyama (2007) M. Sugiyama. Dimension reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, 8:1027–1061, 2007.
  • Wu et al. (2010) Q. Wu, F. Liang, and S. Mukherjee. Localized Sliced Inverse Regression. Journal of Computational and Graphical Statistics, 19(4):843–860, 2010.
  • Yang et al. (2011) J. Yang, S.H. Lee, M.E. Goddard, and Peter M Visscher. GCTA: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
  • Yang et al. (2014) J. Yang, N.A. Zaitlen, M.E. Goddard, P.M. Visscher, and A.L. Price. Advantages and pitfalls in the application of mixed-model association methods. Nature genetics, 46(2):100–106, 2014.
  • Young (1941) G. Young. Maximum likelihood estimation and factor analysis. Psychometrika, 6:49–53, 1941.
  • Zhou and Stephens (2012) X. Zhou and M. Stephens. Genome-wide efficient mixed-model analysis for association studies. Nature genetics, 44(7):821–824, 2012.
  • Zhou et al. (2013) X. Zhou, P. Carbonetto, and M. Stephens. Polygenic modeling with Bayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264, 2013.