1 Introduction
In the current era of information, large amounts of complex high dimensional data are routinely generated across science and engineering disciplines. Estimating and exploring the underlying lowdimensional structure in data and using this latent structure to study scientific problems is of fundamental importance in a variety of applications. As the size of the data sets increases, the problem of statistical inference and computational feasibility become inextricably linked. Dimension reduction is a natural approach to summarizing massive data and has historically played a central role in data analysis, visualization, and predictive modeling. It has had significant impact on both statistical inference (Adcock, 1878; Edegworth, 1884; Fisher, 1922; Hotelling, 1933; Young, 1941), as well as on numerical analysis research and applications (Golub, 1969; Golub and Van Loan, 1996; Gu and Eisenstat, 1996; Golub et al., 2000); for a recent review see (Mahoney, 2011). Historically, statisticians have focused on the study of theoretical properties of estimators often in the context of asymptotically large number of samples. Numerical analysts and computational mathematicians, on the other hand, have been instrumental in the development of useful and tractable algorithms with provable stability and convergence guarantees. Naturally, many of these algorithms have been successfully applied to compute estimators grounded on solid statistical foundations. A classic example of this interplay is principal component analysis (PCA) (Hotelling, 1933)
. In PCA, an objective function is defined based on statistical considerations about the sample variance in the data, which can then be efficiently computed using a variety of singular value decomposition (SVD) algorithms developed by the numerical analysis community.
In this paper, we consider the problem of dimension reduction, focusing on the integration of statistical considerations of estimation accuracy and outofsample prediction error of matrices with latent lowrank with computational considerations of run time and numerical accuracy. The methodology that we develop builds on a classical approach to modeling large data, which first compresses the data, minimizing the loss of relevant information, and then applies statistical estimators appropriate for smallscale problems. In particular, we focus on dimension reduction via generalized eigendecomposition as the means for data compression, and on outofsample residual error as the measure of information loss. The scope of this work includes applications to a large number of dimension reduction methods, which can be implemented as solutions to truncated generalized eigendecomposition problems (Hotelling, 1933; Fisher, 1936; Li, 1991; Wu et al., 2010). In this paper our first focus is on the increasing need to compute an SVD of massive data using randomized algorithms developed in the numerical analysis community (Drineas et al., 2006; Sarlos, 2006; Liberty et al., 2007; Boutsidis et al., 2009; Rokhlin et al., 2009; Halko et al., 2011) to simultaneously reduce the dimension and regularize or control the impact of independent random noise.
The second focus in this paper is to provide efficient solvers for the linear mixed models that arise in statistical and quantitative genomics. In highthroughput genomics experiments, a vast amount of sequencing data is collected—on the order of tens of millions of genetic variants. The goal of genomewide association studies is to test for a statistical association at each variant (polymorphic position) to a response of interest (e.g., gene expression levels or disease status) in a sample cohort. However, as the size of these genomic data and sample sizes continue to increase, there is an urgent need to improve the statistical and computational performance of standard tests.
It is typical to collect several thousand individuals for one study. These individuals may come from several genetically heterogeneous populations. It has been recognized since 2001 (Pritchard and Donnelly, 2001)
that the ancestry makeup of the individuals in the study has great potential to influence study results—in particular, spurious associations arise when genetic variants with differential frequencies may appear to be associated with the biased response variable via latent population structure.
The earliest methods (e.g., genomic control) accounted for population structure by using covariate estimates to correct for these confounding signals. More recently, linear mixed models have been used successfully to correct spurious results in the presence of population structure. LMMs have been shown to improve power in association studies while reducing false positives (Yang et al., 2014). However, mixed models incur a high computational cost when performing association studies because of the computational burden of computing a covariance matrix for the random effect controlling for population structure. Significant work has gone into mitigating such costs using spectral decompositions for efficient covariance estimation (Kang et al., 2008, 2010; Yang et al., 2011; Zhou and Stephens, 2012; Listgarten et al., 2012).
In this work, we show, using simulations of genomic data with latent population structure and real data from largescale genomic studies that adaptive randomized SVD (ARSVD) developed here performs with both computational efficiency and numerical accuracy. Under certain settings, we find that the LMM using ARSVD outperforms current stateoftheart approaches by implicitly performing regularization on the covariance matrix.
There are three key contributions of this paper:

We develop an adaptive algorithm for randomized singular value decomposition (SVD) in which both the number of relevant singular vectors and the number of iterations of the algorithm are inferred from the data based on informative statistical criteria.

We demonstrate on simulated and real data examples that the randomized estimators provide a computationally efficient solution, and, furthermore, often improve statistical accuracy of the predictions. We show that in an overparametrized situation this improvement in accuracy is due to implicit regularization imposed by the randomized approximation.
In Section 2, we describe the adaptive randomized SVD procedure we use for the various dimension reduction methods. In Section 2.5, we provide randomized estimators for linear mixed models used in quantitative genetics. In Section 3, we validate the proposed methodology on simulated and real data and compare our approach with stateoftheart approaches. In particular, we show results from our approach for estimating low dimensional geographic structure in genomic data and for genetic association mapping applications.
2 Randomized algorithms for dimension reduction
In this section we develop algorithmic extensions for PCA. We state an algorithm that provides a numerically efficient and statistically robust estimate of the highest variance directions in the data using a randomized algorithm for singular value decomposition (Randomized SVD) (Rokhlin et al., 2009; Halko et al., 2011). In this problem, the objective is linear unsupervised dimension reduction with the lowdimensional subspace estimated via an eigendecomposition. Randomized SVD will serve as the core computational engine for the other estimators we develop in this paper.
2.1 Notation
Given positive integers and with , denotes the class of all matrices with real entries of dimension , and () denotes the subclass of symmetric positive definite (semidefinite) matrices. For , span() denotes the subspace of spanned by the columns of . A basis matrix for a subspace is any full column rank matrix such that , where . In the case of sample data , eigenbasis(
) denotes the orthonormal left eigenvector basis. Denote the data matrix by
, where each sample is a assumed to be generated by adimensional probability distribution
. In the case of supervised dimensions reduction, denote the response vector to be (quantitative response) or (categorical response with categories), and. The data and the response are assumed to have a joint distribution
.2.2 Computational considerations
The main computational tool we use is a randomized algorithm for approximate eigendecompositon, which factorizes a matrix of rank in time rather than the time required by approaches that do not take advantage of the special structure in the input. This is particularly relevant to statistical applications in which the data is high dimensional but reflects a highly constrained process, e.g., from biology or finance applications, which suggests it has low intrinsic dimensionality, i.e., . An appealing characteristic of our randomized algorithm is the explicit control of the tradeoff between estimation accuracy relative to the exact estimates and computational efficiency. Rapid convergence to the exact estimates was shown both empirically as well as in theory in (Rokhlin et al., 2009).
2.3 Statistical considerations
A central concept in this paper is that the randomized approximation algorithms that we use for statistical inference implicitly impose regularization constraints. This idea is best understood by putting the randomized algorithm in the context of a statistical model. The numerical analysis and theoretical computer science perspective is deterministic and focuses on optimizing the discrepancy between the randomized and the exact solution, which is estimated and evaluated on a fixed data set. However, in many practical applications, the data are noisy random samples from a population. Hence, when the interest is in dimension reduction, the relevant error comparison is between the true subspace that captures information about the population and the corresponding algorithmic estimates. The true subspace is typically unknown, so we can validate the estimated subspace by quantifying the outofsample generalized performance. A key parameter of the randomized estimators, described in detail in Section 2.4, is the number of power iterations used to estimate the span of the data. Increasing values for this parameter provide intermediate solutions to the factorization problem, which converges the to the exact answer (Rokhlin et al., 2009). Fewer iterations correspond to reduced run time but also to larger deviation from the sample estimates and hence stronger regularization. We show in Section 3 the regularization effect of randomization on both simulated and real data sets. We argue that, in certain contexts, fewer iterations may be justified both by computational and statistical considerations.
2.4 Adaptive randomized lowrank approximation
In this section we provide a brief description of a randomized estimator for the best lowrank matrix approximation, introduced by (Rokhlin et al., 2009; Halko et al., 2011), which combines random projections with numerically stable matrix factorization. We consider this numerical framework as implementing a computationally efficient shrinkage estimator for the subspace capturing the largest variance directions in the data, particularly appropriate when applied to input matrix of (approximately) low rank. Detailed discussion of the estimation accuracy of Randomized SVD in the absence of noise is provided in (Rokhlin et al., 2009). The idea of random projection was first developed as a proof technique to study the distortion induced by low dimensional embedding of highdimensional vectors in the work of (Johnson and Lindenstrauss, 1984) with much literature simplifying and sharpening the results (Frankl and Maehara, 1987; Indyk and Motwani, 1998; Dasgupta and Gupta, 2003; Achlioptas, 2001). More recently, the theoretical computer science and the numerical analysis communities discovered that random projections can be used for efficient approximation algorithms for a variety of applications (Drineas et al., 2006; Sarlos, 2006; Liberty et al., 2007; Boutsidis et al., 2009; Rokhlin et al., 2009; Halko et al., 2011). We focus on one such approach proposed by (Rokhlin et al., 2009; Halko et al., 2011) which targets the accurate lowrank approximation of a given large data matrix . In particular, we extend the randomization methodology to the noise setting, in which the estimation error is due to both the approximation of the lowrank structure in , as well as the added noise. A simple working model capturing this scenario is as follows: , where , rank captures the low dimensional signal, while is independent additive noise.
Algorithm.
Given an upper bound on the target rank and on the number of necessary
power iterations ( is sufficient in most cases),
the algorithm proceeds in two stages: (1) estimate a basis for the range
of , (2) project the data onto this basis and apply SVD:
Algorithm: Adaptive Randomized SVD(X, , , )

Find orthonormal basis for the range of ;

Set the number working directions: ;

Construct blocks: with for ;

Select optimal block and rank estimate , using a stability criterion and BiCrossValidation (see Section 2.4.1);

Estimate basis for selected block using SVD: ;


Project data onto the range basis and compute SVD;

Project onto the basis: ;

Factorize: , where ;

Rank approximation:
;

In stage (1) we set the number of working directions to be the sum of the upper bound on the rank of the data, and a small oversampling parameter , which ensures more stable approximation of the top sample variance directions; the estimator tends to be robust to changes in , so we use as a suggested default. In step (1.iii) the random projection matrix is applied to powers of
to randomly sample linear combinations of eigenvectors of the data weighted by powers of the eigenvalues:
The purpose of the power iterations is to increase the decay of the noise portion of the eigenspectrum while leaving the eigenvectors unchanged. This is effectively a form of regularization as we are shrinking eigenvalues. Note that each column of corresponds to a draw from a
dimensional Gaussian distribution:
, with the covariance structure more strongly biased towards higher directions of variation as increases. This type of regularization is analogous to the local shrinkage term developed in prior work (Polson and Scott, 2010). In step (iv), we select an optimal block for and estimate an orthonormal basis for the column space. In previous work (Rokhlin et al., 2009), the authors assumed fixed target rank and approximated , rather than . They showed that the optimal strategy for this purpose is to set , which typically achieves excellent rank approximation accuracy for , even for relatively small values of . In this work we focus on the noisy case, where , and propose to adaptively set both and , aiming to optimize the generalization performance of the randomized estimator. The estimation strategy for and is described in detail in Section 2.4.1.In stage (2) we rotate the orthogonal basis computed in stage (1) to the canonical eigenvector basis and scale according to the corresponding eigenvalues. In step (2.i) the data is projected onto the low dimensional orthogonal basis . Step (2.ii) computes the exact SVD in the projected space.
Computational complexity.
The computational complexity of the randomization step is and the factorizations in the lower dimensional space have complexity . With small relative to and , the run time in both steps is dominated by the multiplication by the data matrix; in the case of sparse data, fast multiplication can further reduce the run time. We use a normalized version of the above algorithm that has the same run time complexity but is numerically more stable (Martinsson et al., Vancouver, Canada, 2010).
2.4.1 Adaptive method to estimate and
We propose to use ideas of stability under random projections in combination with crossvalidation to estimate the intrinsic dimensionality of the dimension reduction subspace as well as the optimal value of the eigenvalue shrinkage parameter .
Stabilitybased estimation of :
First, we assume the regularization parameter is fixed and describe the estimation of the rank parameter , using a stability criterion based on random projections of the data. We start with rough upperbound estimate for the dimension parameter . We then apply a small number () of independent Gaussian random projections , , for . Given the projections we compute an estimate of the eigenvector basis of the column space onto the projected data. We then denoise the estimate by raising all the eigenvalues to the power :
The th principal basis left singular vector estimate () is assigned a stability score:
Here is the estimate of the principal eigenvector of based on the th random projection and denotes the Spearman ranksum correlation between and . Eigenvector directions that are not dominated by independent noise are expected to have higher stability scores. When the data has approximately lowrank we expect a sharp transition in the eigenvector stability between the directions corresponding to signal and to noise. In order to estimate this change point, we apply a nonparametric location shift test (Wilcoxon ranksum) to each of the stability score partitions of eigenvectors with larger versus smaller eigenvalues. The subset of principal eigenvectors that can be stably estimated from the data for the given value of is determined by the change point with smallest pvalue among all nonparametric tests.
where pvalue(k,t) is the pvalue from the Wilcoxon ranksum test applied to the and .
Estimation of :
In this section we describe a procedure for selecting an optimal value for based on the reconstruction accuracy under BiCrossValidation for SVD (Owen and Perry, 2009), using the generalized Gabriel holdout pattern (Gabriel, 2002). The rows and columns of the input matrix are randomly partitioned into and groups respectively, resulting in a total of submatrices with nonoverlapping entries. We apply Adaptive Randomized SVD to factorize the training data from each combination of row blocks and column blocks. In each case the submatrix block with the omitted rows and columns is approximated using its modified Schur complement in the training data. The crossvalidation error for each data split corresponds to the Frobenius norm of the difference between the heldout submatrix and its trainingdatabased estimate. For each submatrix approximation we estimate the rank using the stabilitybased approach from the previous section. As suggested in previous work (Owen and Perry, 2009), we fix , in which case BiCrossValidation error corresponding to holding out the top left block of a given blockpartitioned matrix becomes . Here is the MoorePenrose pseudoinverse of . For fixed value of we estimate and factorize using Adaptive Randomized SVD(t, d(t), ) and denote the BiCrossValidation error by . The same process is repeated for the other three blocks B, C, and D. The final error and rank estimates are defined to be the medians across all blocks and are denoted as and , respectively. We optimize over the full range of allowable values for to arrive at the final estimates
2.5 Fast linear mixed models
Multivariate linear mixed models (LMMs) have been a workhorse in statistical genetics and quantitative genetics because they allow for the regression of explanatory on outcome variables while modeling relatedness between samples (Henderson, 1984; Price et al., 2011; Krote et al., 2012). In the context of mapping traits LMMs are used to correct for unobserved sources of statistical confounding in genetic association studies—particularly the presence of population structure amongst samples.
The linear mixed models in this paper take the form
where is the response, are the covariates, are the fixed effects, is the random effects design matrix, and and captures unobserved covariates with and . Here, is the estimated kinship or genetic relatedness matrix, and is the proportion of variance in the phenotypes explained by genetic factors. In the standard application to genetic association studies, the response is the quantitative trait across individuals and is a vector of genotypes of those individuals across many genomic locations. The statistical test then is to determine whether the coefficient for a particular genetic variant, , is equal to zero, indicating no genetic association, or alternatively whether
, indicating support for a possible genetic association at this genetic locus with the trait. The likelihood for this model follows from a multivariate normal distribution:
The standard procedure to correct for sample (population) structure using an LMM proceeds in three main steps: (1) empirical estimation (construction) of the genetic relatedness matrix (GRM) amongst individuals based on heldout genetic data; (2) estimation of variance components to account for sample structure contributing to response; and (3) computation of an association statistic at each genotype position.
Many methods have focused on reducing the complexity of one or more of these steps to scale LMMs to current study sizes. The EMMA method (Kang et al., 2008) uses spectral decomposition to jointly estimate the likelihood of the response and random effects together instead of obtaining estimates of random effects a priori, which incurs a greater computation cost. EMMAX (Kang et al., 2010) improves upon EMMA by estimating variance components only once (independent of the number of genetic variants tested), making the assumption that most variants have a small effect on the phenotype (the pylmm software used here, along with our adaption, is based on EMMAX). FaSTLMM (Lippert et al., 2011)
, in addition to using a heuristic lowrank spectral decomposition to construct an orthogonal representation of the GRM, uses rotations of the genetic variants and response to perform computation of the likelihood function in linear time. Further methods
(Lippert et al., 2013) use more systematic methods for subsetting genetic variants that are relevant to the phenotype of interest in order to produce accurate variance component estimations in a LMM.Our contribution to accelerating parameter estimates in this model is to use ARSVD to address the complexity of estimating the design matrix . In general, the GRM is computed using a samplebysample covariance of the design matrix (), which has a computational complexity of to build, where typically dominates in genomic studies. Decomposing this matrix into an exact orthogonal representation using an eigendecomposition has complexity in the worst case. We forgo explicit computation of the GRM, and instead perform ARSVD on the original design matrix, with a computational complexity of . By using ARSVD to decompose the design matrix, we automatically detect lowrank structure present in the GRM, and we avoid the need to manually or heuristically subset the data to achieve a lowrank representation. We note in the simulated data and in the genomic study results that this has both a substantial acceleration in the computational speed and also the implicit regularization of the design matrix
from our approach improves the type I error substantially.
3 Results on simulated data
We use real and simulated data to demonstrate three major contributions of this work:

In the presence of interesting lowrank structure in the data, the randomized algorithms tend to be much faster than the exact methods with minimal loss in approximation accuracy.

The rank and the subspace containing the information in the data can be reliably estimated and used to provide efficient solutions to dimension reduction methods based on the truncated (generalized) eigendecomposition formulation.

The randomized algorithms implicitly impose regularization, which can be adaptively controlled in a computationally efficient manner to produce improved outofsample performance.
3.1 Unsupervised dimension reduction
We begin with unsupervised dimension reduction of data with lowrank structure contaminated with Gaussian noise, and we focus on evaluating the application of Adaptive Randomized SVD for PCA (see Section 2.4). In particular, we demonstrate that the proposed method estimates the sample singular values with exponentially decreasing relative error in . Then we show that achieving similar lowrank approximation accuracy to a stateoftheart Lanczos method requires the same run time complexity, which scales linearly in both dimensions of the input matrix. This makes our proposed method applicable to large data matrices. Lastly, we demonstrate the ability to adaptively estimate the underlying rank of the data, given a coarse upper bound. For the evaluation of the randomized methods, based on all our simulated and real data, we assume a default value for the oversampling parameter .
Simulation setup
The data matrix , is generated as follows: , where . The columns of and are drawn uniformly at random from the corresponding unit sphere and the singular values S = are randomly generated starting from a baseline value, which is a fraction of the maximum noise singular value, with exponential increments separating consecutive entries:
The noise is iid Gaussian: . The sample variance has the SVD decomposition , where are the singular values in decreasing order. The signaltonoise relationship is controlled by , large values of which correspond to increased separation between the signal and the noise.
Results
In our first simulation we set and assume the rank to be given and fixed to . The main focus is on the effect of the regularization parameter controlling the singular value shrinkage, with larger values corresponding to stronger preference for the higher versus the lower variance directions. The simulation uses an input matrix of dimension . Studying the estimates of the percent relative error of the singular values averaged over ten random data sets, we observed exponential convergence to the sample estimates with increasing (Table 1). This suggests that the variability in the sample data directions is approximated well for large data sets at the cost of few data matrix multiplications.
t  1  2  3  4  5 

% relative error  
[mean] 
In addition to minor relative error compared with exact approaches, we further investigate the source of putative error (Figure 1). Data were generated from and with true rank . We note our method is extremely precise for the greatest singular values but has a slightly increased decay rate of singular values compared to exact methods, where less overall variance is captured in directions present in the data that are more influenced by noise. Our method has larger standard error and tends to slightly underestimate the smallest singular values, which we interpret as providing regularization to directions of noise, as further explored in Section 3.2.
Similar ideas have been proposed before in the absence of randomization. Perhaps the most wellestablished and computationally efficient methods are based on Lanczos Krylov Subspace estimation, which also operate on the data matrix only through matrix multiplies (Saad, 1992; Lehoucq et al., 1998; Stewart, 2001; Baglama and Reichel, 2006). These methods are also iterative in nature and tend to converge quickly with run time complexity, typically scaling as the product of the data dimensions , with small.
In order to further investigate the practical run time behavior of Adaptive Randomized SVD we used one such stateofthe art lowrank approximation algorithm, Blocked Lanczos (Baglama and Reichel, 2006). Here we use the same simulation setting as before with fixed , but we vary the regularization parameter to achieve comparable percent relative lowrank Frobenius norm reconstruction error (to 1 d.p.). In Table 2 we report the ratio of the run times of the two approaches based on ten random data sets. Notice that the relative run time remains approximately constant with simultaneous increase in both data dimensions, which suggests similar order of complexity for both methods. In the presence of lowrank structure, Adaptive Randomized SVD is feasible for larger data problems where fullfactorization methods (e.g. (Anderson et al., 1990)) are not; moreover, Adaptive Randomized SVD achieves good approximation accuracy for the top sample singular values and singular vectors.
n + p  6,000  7,500  9,000  10,500  12,000 

relative time 
In many modern applications the data have latent lowrank structure of unknown dimension. In Section 2.4.1, we address the issue of estimating the rank using a stabilitybased approach. Next, we studied the ability of our randomized method to perform rank estimation to identify the dimensionality in simulated lowrank data. For that purpose, we generated random data sets, , , and . We set an initial rank upper bound estimate to be and used the adaptive method ARSVD (Section 2.4.1) to estimate both optimal and the corresponding . Figure 2 plots the estimated versus the true rank and the corresponding estimates of the regularization parameter for two different signaltonoise scenarios. In both scenarios, the rank estimates showed good agreement with the true rank values. When the smallest “signal” direction has the same variance as the largest variance “noise” direction, which causes our approach to slightly underestimate the largest ranks. This is due to the fact that the few smallest variance signal directions tend to be difficult to distinguish from the random noise and hence less stable under random projections. We observe that our approach tends to select small values for , especially when there is a clear separation between the signal and the noise (right panel of Figure 2). This results in a reduced number of matrix multiplications and hence fast computation.
3.2 Results on simulated data with latent population structure
To simulate a genomic association study with latent population structure, we generate genotypes and phenotypes that are conditionally independent given the true population structure, as described in (Mimno et al., 2014). Each individual’s genotype is admixed, containing a proportion of ancestry from each population . The admixture proportions for each individual , are drawn . Since we only consider two alleles at each position in the genome, the frequency of each variant in population is drawn representing a uniform allele frequency. To generate latent assignments of each variant to a population, we create variantpopulation assignments by generating and from . The final generation of genotypes draws each variant from individual as . To generate a phenotype (response) vector for the population, we draw .
Results are generated for four distinct scenarios representing data with differing structure. The simulations presented here have no true associations between genotypes (features) and phenotypes (response), but are dependent conditional on latent population structure. This situation mimics the commonly observed confounding effect in genomic data, and motivates the need for LMMs and regularization in genomewide studies. Given the simulations are entirely under the null hypothesis of no association between genotype and phenotype, it is expected that pvalues follow a uniform distribution; we consider any result exceeding some
threshold a false positive, occurring at rate .In each simulation, the pvalue distribution from an LMM association without ARSVD is reported as the full data matrix rank (the largest rank on the xaxis). The top left simulation uses (samples or genetic variants) and (responses or values of the phenotype), representing a situation with few samples. Our method performs some slight regularization beyond that of a standard LMM with rank close to full data rank (Figure 4). The next simulation (top right, and ) is the most common case in genomics with
, or a large set of features compared to responses. Our method performs equally well as current methods but at a fraction of the computational cost, as illustrated by the lowestrank estimation that still yields uniform pvalues and low false positive rate. In the bottom left panel we generate a case of an over parametrized model (
and ), for which our method produces significant false positives compared to a standard LMM; however, we note that if we know a priori that our model is over parametrized, we can modify our decomposition to perform in the sample space instead of feature space and achieve comparable results to a standard LMM. In the last setting (bottom right, and ) again our method performs similarly to the standard method but with gains in computational efficiency. In general, we expect the tradeoff between computational efficiency and accuracy to be dependent on the data analyzed. In the case of structured genomic data, we report in these simulations and in Section 4 that massive computational savings are accompanied by numerical accuracy.4 Applications to large scale genomic data
4.1 Results on WTCCC association mapping
We test our approach that incorporates ARSVD in LMM association studies on the Wellcome Trust Case Control Consortium (WTCCC) data (Consortium, 2007)
that includes 4,684 individuals (half case individuals with Crohn’s disease and half control individuals without Crohn’s disease) and 478,765 genetic variants across the 22 autosomal chromosomes in the genome. We compare our method ARSVD incorporated in the next generation LMM software after EMMAX (pylmm) to GEMMA. Performing ARSVD on the whole genome took 82.2 secs, while the traditional eigendecomposition of the covariance matrix in pylmm took 88 mins 23.9 sec. In order to most accurately control for the test statistic computed in a LMM and achieve maximal statistical power, it is suggested that a covariance matrix is constructed once per (22) chromosomes, performed by holding out the test chromosome and concatenating the remaining chromosomes
(Yang et al., 2014). Our method performs the 22 decompositions in a total of 5 mins 4.8 secs, while the traditional method performs the same decomposition in 4 hrs 24 mins.GEMMA is executed with the lmm option to most closely resemble the analysis that pylmm performs. Despite having discrete phenotypes (disease status {0,1}), we note that estimating coefficients
(effect sizes) with linear regression and logistic regression have nearly indistinguishable impacts on computing association statistics
(Zhou et al., 2013). Estimates of for pylmm and GEMMA have a high correlation, yet GEMMA shows no enrichment for significant pvalues. On the other hand, the distribution of pvalues estimated from pylmm is closer to what we expect from an analysis on true associations, showing that our method, while providing some regularization, does not impact enrichment of true signal.We investigated the most significant genetic associations identified by our method and compared to previous results from a largescale metaanalysis of Crohn’s disease. This metaanalysis had a much larger sample size than the WTCCC: it included 6,333 affected individuals (cases) and 15,056 controls and followed up the top association signals in 15,694 cases, 14,026 controls, and 414 parentoffspring trios. Table 3 reports the curated list of associated genetic variants reported in this metaanalysis through various sources (Listgarten et al., 2012). This includes genetic variants identified by metaanalysis (Franke et al., 2010), the original WTCCC study (Consortium, 2007), or those that were confirmed more recently via several methods and known to correspond to a region associated with Crohn’s disease risk (Listgarten et al., 2012). For each analysis type, we report the number of variants found significant using this analysis, the subset of those variants that lie within the top 0.5% of our results, and the subset of variants that reach significance at a local false discovery rate (lfdr) of 5% (Strimmer, 2008).
Furthermore, we identify several genetic variants associated with Crohn’s that were previously unidentified. In particular, there are a few genetic variants within the 3.6MB region that defines the MHC on chromosome 6 in the human genome (sequencing consortium, 1999) at an lfdr threshold of 10%. In particular, genetic variant rs9269186 () passes below an lfdr threshold of 1%, and lies within 5KB of the start of the HLADRB5 gene, putatively acting to regulate transcription of this proteincoding gene that produces a membranebound class II molecule. The HLADRB5 protein is an antigen that has an important role in the human immune system, and thus may play a role in an autoimmune disorder such as Crohn’s disease.
source  significant variants  top 0.5%  lfdr 5% 

All  151  61  35 
MA + WTCCC  93  48  30 
MA  81  47  29 
5 Discussion
Massive high dimensional data sets are ubiquitous in modern data analysis settings. In this paper we address the problem of tractably and accurately performing eigendecompositions when analyzing such data. The main computational tool we use is based on recent randomized algorithms developed by the numerical analysis community. Taking into account the presence of noise in the data we provide an adaptive method for estimation of both the rank and number of Krylov iterations for the randomized approximate version of SVD. Using this adaptive estimator of lowrank structure, we implement efficient algorithms for PCA and linear mixed models. Perhaps, the most interesting statistical observation is related to the regularization that randomization implicitly imposes on the resulting eigendecomposition estimates.
In simulated experiments we show high accuracy in recovering the true (latent) rank of matrices with lowrank substructure, without the need for many Krylov iterations. Additionally, we show in simulations that our method performs implicit regularization and improves quantitative properties of results under various settings of data structure. Furthermore, our results on modern genomewide association studies show that our approach to using ARSVD in linear mixed models fills a critical need for methods with computational efficiency that do not sacrifice the desirable statistical properties of these tests, but instead have the potential to improve upon these properties. Some important open questions still remain:

There is need for a theoretical framework to quantify what generalization guarantees the randomization algorithm can provide on outofsample data, and the dependence of this bound on the noise and the structure in the data on one hand and on the parameter settings on the other.

A probabilistic interpretation of the algorithm could contribute additional insights into the practical utility of the proposed approach under different assumptions. In particular it would be interesting to relate our work to a Bayesian model with posterior modes that correspond to the subspaces estimated by the randomized approach.

The implicit regularization on latent factors imposed by ARSVD should be further explored with respect to the structure of the noise, and its impact on estimates of random effects in LMMs (Runcie and Mukherjee, 2013).
SM would like to acknowledge LekHeng Lim, Michael Mahoney, Qiang Wu, and Ankan Saha. SM is pleased to acknowledge support from grants NIH (Systems Biology): 5P50GM081883, AFOSR: FA95501010436, NSF CCF1049290, and NSFDMS1209155. BEE is pleased to acknowledge support from grants NIH R00 HG006265 and NIH R01 MH101822. SG would like to acknowledge Uwe Ohler, Jonathan Pritchard and Ankan Saha.
Appendix A
a.1 Generalized eigendecomposition and dimension reduction
The appendix states a variety of dimension reduction methods supervised, unsupervised, and nonlinear can use the ARSVD engine to scale to massive data. The key requirement is a a formulation of the truncated generalized eigendecomposition problem that can be implemented by the Adaptive Randomized SVD from Section 2.4. The dimension reduction methods we will focus on are sliced inverse regression (SIR) and localized sliced inverse regression (LSIR).
a.1.1 Problem Formulation.
Assume we are given that characterize pairwise relationships in the data and let be the ”intrinsic dimensionality” of the information contained in the data. In the case of supervised dimension reduction methods this corresponds to the dimensionality of the linear subspace to which the joint distribution of assigns nonzero probability mass. Our objective is to find a basis for that subspace. For SIR and LSIR this corresponds to the span of the generalized eigenvectors with largest eigenvalues :
(1) 
An important structural constraint we impose on , which is applicable to a variety of highdimensional data settings, is that it has lowrank: . It is this constraint that we will take advantage of in the randomized methods. In the case of (unsupervised case) .
a.1.2 Sufficient dimension reduction
Dimension reduction is often a first step in the statistical analysis of highdimensional data and could be followed by data visualization or predictive modeling. If the ultimate goal is the latter, then the statistical quantity of interest is a low dimensional summary
which captures all the predictive information in relevant to :Sufficient dimension reduction (SDR) is one popular approach for estimating , which (Li, 1991; Cook and Weisberg, 1991; Li, 1992; Li et al., 2005; Nilsson et al., 2007; Sugiyama, 2007; Cook, 2007; Wu et al., 2010). In this appendix we focus on linear SDRs: , which provide a predictionoptimal reduction of
We will consider two specific dimension reduction methods: Sliced Inverse Regression (SIR) (Li, 1991) and Localized Sliced Inverse Regression (LSIR) (Wu et al., 2010). SIR is effective when the predictive structure in the data is global, i.e., there is single predictive subspace over the support of the marginal distribution of . In the case of local or manifold predictive structure in the data, LSIR can be used to compute a projection matrix that contains this nonlinear (manifold) structure.
a.1.3 Efficient solutions and approximate SVD
SIR and LSIR reduce to solving a truncated generalized eigendecomposition problem in (1). Since we consider estimating the dimension reduction based on sample data we focus on the sample estimators and , where is symmetric and encodes the methodspecific grouping of the samples based on the response . In the classic statistical setting, when , both and are positive definite almost surely. Then, a typical solution proceeds by first sphering the data: , e.g., using a Cholesky or SVD representation . This is followed by eigendecomposition of Li (1991); Wu et al. (2010) and backtransformation of the top eigenvectors directions to the canonical basis. The computational time is . When , and are rankdeficient and a unique solution to the problem (1) does not exist. One widelyused approach, which allows us to make progress in this problematic setting, is to restrict our attention to the directions in the data with positive variance. Then we can proceed as before, using an orthogonal projections onto the span of the data. The total computation time in this case is . In many modern data analysis applications both and are very large, and hence algorithmic complexity of could be prohibitive, rendering the above approaches unusable. We propose an approximate solution that explicitly recovers the lowrank structure in using Adaptive Randomized SVD from Section 2.4. In particular, assume rank (where is the dimensionality of the optimal dimension reduction subspace). Then , where . The generalized eigendecomposition problem (1) solution becomes restricted to the subspace spanned by the columns of :
(2) 
The dimension reduction subspace is contained in the , where .
References
 Achlioptas (2001) D. Achlioptas. Databasefriendly random projections. In Proceedings of the twentieth ACM SIGMODSIGACTSIGART symposium on Principles of database systems, PODS ’01, pages 274–281, New York, NY, USA, 2001. ACM. ISBN 1581133618.
 Adcock (1878) R.J. Adcock. A problem in least squares. The Analyst, 5:53–54, 1878.
 Anderson et al. (1990) E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J. Du Croz, S. Hammerling, J. Demmel, C. Bischof, and D. Sorensen. Lapack: a portable linear algebra library for highperformance computers. In Proceedings of the 1990 ACM/IEEE conference on Supercomputing, Supercomputing ’90, pages 2–11, Los Alamitos, CA, USA, 1990. IEEE Computer Society Press. ISBN 0897914120.
 Baglama and Reichel (2006) J. Baglama and L. Reichel. Restarted block Lanczos bidiagonalization methods. Numerical Algorithms, 43(3):251–272, 2006.
 Boutsidis et al. (2009) C. Boutsidis, M.W. Mahoney, and P. Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’09, pages 968–977. Society for Industrial and Applied Mathematics, 2009.
 Consortium (2007) The Wellcome Trust Case Control Consortium. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678, 2007.
 Cook (2007) R.D. Cook. Fisher Lecture: Dimension Reduction in Regression. Statistical Science, 22(1):1–26, 2007.
 Cook and Weisberg (1991) R.D. Cook and S. Weisberg. Discussion of Li (1991). J. Amer. Statist. Assoc., 86:328–332, 1991.
 Dasgupta and Gupta (2003) S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.
 Drineas et al. (2006) P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo Algorithms for Matrices II: Computing a LowRank Approximation to a Matrix. SIAM J. Comput., 36:158–183, July 2006.
 Edegworth (1884) F.Y. Edegworth. On the reduction of observations. Philosophical Magazine, pages 135–141, 1884.
 Fisher (1922) R.A. Fisher. On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Statistical Society A, 222:309–368, 1922.
 Fisher (1936) R.A. Fisher. The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7(2):179–188, 1936.
 Franke et al. (2010) A. Franke, D.P.B. McGovern, J.C Barrett, K. Wang, G.L RadfordSmith, T. Ahmad, C.W Lees, T. Balschun, J. Lee, R. Roberts, C.A. Anderson, J.C. Bis, S. Bumpstead, D. Ellinghaus, E.M. Festen, M. Georges, T. Green, T. Haritunians, L. Jostins, A. Latiano, C.G. Mathew, G.W Montgomery, N.J. Prescott, S. Raychaudhuri, J.I. Rotter, P. Schumm, Y. Sharma, L.A. Simms, D. Wijmenga C. Taylor, K.D. Whiteman, R.N. Baldassano, M. Barclay, T.M. Bayless, S. Brand, C. Buning, A. Cohen, JF. Colombel, M. Cottone, L. Stronati, T. Denson, M. De Vos, R. D’Inca, M. Dubinsky, C. Edwards, T. Florin, D. Franchimont, R. Gearry, J. Glas, A. Van Gossum, S.L. Guthery, J. Halfvarson, H.W. Verspaget, A. Hugot, JP. Karban, D. Laukens, I. Lawrance, M. Lemann, A. Levine, C. Libioulle, E. Louis, C. Mowat, W. Newman, J. Panes, A. Phillips, D.D. Proctor, M. Regueiro, R. Russell, P. Rutgeerts, J. Sanderson, M. Sans, F. Seibold, A.H. Steinhart, P.C.F. Stokkers, L. Torkvist, G. KullakUblick, D. Wilson, T. Walters, S.R. Targan, S.R. Brant, J.D. Rioux, M. D’Amato, R.K. Weersma, S. Kugathasan, A.M. Griffiths, J.C. Mansfield, S. Vermeire, R.H. Duerr, M.S. Silverberg, J. Satsangi, S. Schreiber, J.H. Cho, V. Annese, H. Hakonarson, M.J. Daly, and M. Parkes. Genomewide metaanalysis increases to 71 the number of confirmed crohn’s disease susceptibility loci. Nature genetics, 42(12):1118–1125, 2010.
 Frankl and Maehara (1987) P. Frankl and H. Maehara. The JohnsonLindenstrauss lemma and the sphericity of some graphs. J. Comb. Theory Ser. A, 44:355–362, June 1987.
 Gabriel (2002) K.R. Gabriel. Le biplot  outil d’exploration de données multidimensionnelles. Journal de la société française de statistique, 143(34):5–55, 2002.
 Golub (1969) G.H. Golub. Matrix decompositions and statistical calculations. Technical report, Stanford, CA, USA, 1969.
 Golub and Van Loan (1996) G.H. Golub and C.F. Van Loan. Matrix computations. John Hopkins University Press, Baltimore, MD, 3rd edition, 1996. ISBN 080185413X.
 Golub et al. (2000) G.H. Golub, K. Sløna, and P. Van Dooren. Computing the SVD of a General Matrix Product/Quotient. SIAM J. Matrix Anal. Appl, 22:1–19, 2000.
 Gu and Eisenstat (1996) M. Gu and S.C. Eisenstat. Efficient algorithms for computing a strong rankrevealing QR factorization. SIAM J. Sci. Comput., 17(4):848–869, July 1996.
 Halko et al. (2011) N. Halko, PG. Martinsson, and J.A. Tropp. Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2):217–288, 2011.
 Henderson (1984) C.R. Henderson. Applications of Linear Models in Animal Breeding. University of Guelph, 1984.
 Hotelling (1933) H. Hotelling. Analysis of a complex of statistical variables in principal components. Journal of Educational Psychology, 24:417–441, 1933.

Indyk and Motwani (1998)
P. Indyk and R. Motwani.
Approximate nearest neighbors: towards removing the curse of dimensionality.
InProceedings of the thirtieth annual ACM symposium on Theory of computing
, STOC ’98, pages 604–613, New York, NY, USA, 1998. ACM. ISBN 0897919629.  Johnson and Lindenstrauss (1984) W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. In Conference in modern analysis and probability (New Haven, Conn., 1982), volume 26 of Contemporary Mathematics, pages 189–206. American Mathematical Society, 1984.
 Kang et al. (2008) H.M. Kang, N.A. Zaitlen, C.M. Wade, A. Kirby, D. Heckerman, MJ. Daly, and E. Eskin. Efficient control of population structure in model organism association mapping. Genetics, 178(3):1709–1723, 2008.
 Kang et al. (2010) H.M. Kang, J.H. Sul, S.K. Service, N.A. Zaitlen, S. Kong, N.B. Freimer, C. Sabatti, and E. Eskin. Variance component model to account for sample structure in genomewide association studies. Nature genetics, 42(4):348–354, 2010.
 Krote et al. (2012) A. Krote, B.J. Vilhjálmsson, V. Segura, A. Platt, Q. Long, and M. Nordburg. A mixedmodel approach for genomewide association studies of correlated traits in structured populations. Nature genetics, 44(9):1066–10711, 2012.
 Lehoucq et al. (1998) R.R.B. Lehoucq, D.D.C. Sorensen, and CC. Yang. Arpack User’s Guide: Solution of LargeScale Eigenvalue Problems With Implicitly Restorted Arnoldi Methods, volume 6. SIAM, 1998.
 Li et al. (2005) B. Li, H. Zha, and F. Chiaromonte. Contour Regression: A General Approach to Dimension Reduction. The Annals of Statistics, 33(4):1580–1616, 2005.
 Li (1991) K.C. Li. Sliced inverse regression for dimension reduction (with discussion). J. Amer. Statist. Assoc., 86:316–342, 1991.
 Li (1992) K.C. Li. On principal Hessian directions for data visulization and dimension reduction: another application of Stein’s lemma. J. Amer. Statist. Assoc., 87:1025–1039, 1992.
 Liberty et al. (2007) E. Liberty, F. Woolfe, PG. Martinsson, V. Rokhlin, and M. Tygert. Randomized algorithms for the lowrank approximation of matrices. Proceedings of the National Academy of Sciences, 104(51):20167–20172, 2007.
 Lippert et al. (2011) C. Lippert, J. Listgarten, Y. Liu, C.M. Kadie, R.I. Davidson, and D. Heckerman. Fast linear mixed models for genomewide association studies. Nature Methods, 8(10):833–835, 2011.
 Lippert et al. (2013) C. Lippert, G. Quon, E.Y. Kang, C.M. Kadie, J. Listgarten, and D. Heckerman. The benefits of selecting phenotypespecific variants for applications of mixed models in genomics. Scientific reports, 3, 2013.
 Listgarten et al. (2012) J. Listgarten, C. Lippert, C.M. Kadie, R.I. Davidson, E. Eskin, and D. Heckerman. Improved linear mixed models for genomewide association studies. Nature methods, 9(6):525–526, 2012.
 Mahoney (2011) M.W. Mahoney. Randomized Algorithms for Matrices and Data. Foundations and Trends in Machine Learning, 3(2):123–224, 2011.
 Martinsson et al. (Vancouver, Canada, 2010) P.G. Martinsson, A. Szlam, and M. Tygert. Normalized power iterations for the computation of SVD. NIPS workshop on lowrank methods for largescale machine learning, Vancouver, Canada, 2010.
 Mimno et al. (2014) D. Mimno, D.M. Blei, and B.E. Engelhardt. Posterior predictive checks to quantify lackoffit in admixture models of latent population structure. arXiv preprint arXiv:1407.0050, 2014.
 Nilsson et al. (2007) J. Nilsson, F. Sha, and M.I. Jordan. Regression on Manifolds Using Kernel Dimension Reduction. In Proceedings of the 24th International Conference on Machine Learning, 2007.
 Owen and Perry (2009) A.B. Owen and P.O. Perry. Bicrossvalidation of the SVD and the nonnegative matrix factorization. The Annals of Applied Statistics, 3:564–594, 2009. doi: 10.1214/08AOAS227.
 Polson and Scott (2010) N.G. Polson and J.G. Scott. Shrink Globally, Act Locally: Sparse Bayesian Regularization and Prediction. In Bayesian Statistics 9. Oxford University Press, 2010.
 Price et al. (2011) A.L. Price, A. Helgason, G. Thorleifsson, S.A. McCarroll, A. Kong, and K. Stefansson. Singletissue and crosstissue heritability of gene expression via identitybydescent in related or unrelated individuals. PLoS Genet, 7, 02 2011.
 Pritchard and Donnelly (2001) J.K. Pritchard and P. Donnelly. Case–control studies of association in structured or admixed populations. Theoretical population biology, 60(3):227–237, 2001.
 Rokhlin et al. (2009) V. Rokhlin, A. Szlam, and M. Tygert. A Randomized Algorithm for Principal Component Analysis. SIAM J. Matrix Anal. Appl., 31(3):1100–1124, August 2009.
 Runcie and Mukherjee (2013) D.E. Runcie and S. Mukherjee. Dissecting highdimensional phenotypes with bayesian sparse factor analysis of genetic covariance matrices. Genetics, 194(3):753–767, 2013.
 Saad (1992) Y. Saad. Numerical methods for large eigenvalue problems, volume 158. SIAM, 1992.
 Sarlos (2006) T. Sarlos. Improved approximation algorithms for large matrices via random projections. In Foundations of Computer Science, 2006. FOCS ’06. 47th Annual IEEE Symposium on, pages 143 –152, October 2006.
 sequencing consortium (1999) The MHC sequencing consortium. Complete sequence and gene map of a human major histocompatibility complex. Nature, 401(6756):921–923, 1999.
 Stewart (2001) G.W. Stewart. Matrix Algorithms: Volume 2, Eigensystems, volume 2. SIAM, 2001.
 Strimmer (2008) K. Strimmer. fdrtool: a versatile r package for estimating local and tail areabased false discovery rates. Bioinformatics, 24(12):1461–1462, 2008.
 Sugiyama (2007) M. Sugiyama. Dimension reduction of multimodal labeled data by local Fisher discriminant analysis. Journal of Machine Learning Research, 8:1027–1061, 2007.
 Wu et al. (2010) Q. Wu, F. Liang, and S. Mukherjee. Localized Sliced Inverse Regression. Journal of Computational and Graphical Statistics, 19(4):843–860, 2010.
 Yang et al. (2011) J. Yang, S.H. Lee, M.E. Goddard, and Peter M Visscher. GCTA: a tool for genomewide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.
 Yang et al. (2014) J. Yang, N.A. Zaitlen, M.E. Goddard, P.M. Visscher, and A.L. Price. Advantages and pitfalls in the application of mixedmodel association methods. Nature genetics, 46(2):100–106, 2014.
 Young (1941) G. Young. Maximum likelihood estimation and factor analysis. Psychometrika, 6:49–53, 1941.
 Zhou and Stephens (2012) X. Zhou and M. Stephens. Genomewide efficient mixedmodel analysis for association studies. Nature genetics, 44(7):821–824, 2012.
 Zhou et al. (2013) X. Zhou, P. Carbonetto, and M. Stephens. Polygenic modeling with Bayesian sparse linear mixed models. PLoS genetics, 9(2):e1003264, 2013.
Comments
There are no comments yet.