1 Introduction
Principal component analysis is a fundamental tool for dimensionality reduction, clustering, classification, and many more learning tasks. It is a basic preprocessing step for learning, recognition, and estimation procedures. The core computational element of PCA is performing a (partial) singular value decomposition, and much work over the last half century has focused on efficient algorithms (e.g.,
[7] and references therein) and hence on computational complexity.The recent focus on understanding highdimensional data, where the dimensionality of the data scales together with the number of available sample points, has led to an exploration of the
sample complexity of covariance estimation. This direction was largely influenced by Johnstone’s spiked covariance model, where data samples are drawn from a distribution whose (population) covariance is a lowrank perturbation of the identity matrix
[11]. Work initiated there, and also work done in [19] (and references therein) has explored the power of batch PCA in thedimensional setting with subGaussian noise, and demonstrated that the singular value decomposition (SVD) of the empirical covariance matrix succeeds in recovering the principal components (extreme eigenvectors of the population covariance) with high probability, given
samples.This paper brings the focus on another critical quantity: memory/storage. This is relevant in the socalled streaming data model, where the samples are collected sequentially, and unless we store them, they are irretrievably gone.^{1}^{1}1This is similar to what is sometimes referred to as the single pass model. The only currently available algorithms with provable sample complexity guarantees either store all samples (note that for more than a single pass over the data, the samples must all be stored) or explicitly form the empirical (typically dense) covariance matrix. Either case requires at least storage. Despite the availability of massive local and distributed storage systems, for highdimensional applications (e.g., where data points are high resolution photographs, biometrics, video, etc.), could be on the order of , making
prohibitive, if not in fact impossible to manage. Indeed, at multiple computing scales, manipulating vectors of length
is possible, when storage of is not. A typical desktop may have 1020 GB of RAM, but will not have more than a few TB of total storage. A modern smartphone may have as much as a GB of RAM, but has a few GB, not TB, of storage.We consider the streaming data setting, where data points are generated sequentially, and are never stored. In the setting of the socalled spiked covariance model (and natural generalizations) we show that a simple algorithm requiring storage – the best possible – performs as well as batch algorithms (namely, SVD on the empirical covariance matrix), with sample complexity . To the best of our knowledge, this is the only algorithm with both storage complexity and sample complexity guarantees. We discuss the connection to past work in more detail in Section 2. We introduce the model with all related details in Section 3, and present the solution to the rank 1 case, the rank case, and the perturbedrank case in Sections 4.1, 4.2 and 4.3, respectively. In Section 5 we provide simulations that not only confirm the theoretical results, but demonstrate that our algorithm works well outside the assumptions of our main theorems.
2 Related Work
Memory and computationefficient algorithms that operate on streaming data are plentiful in the literature and many seem to do well in practice. However, there is no algorithm that provably recovers the principal components in the same noise and samplecomplexity regime as the batch PCA algorithm does and maintains a provably light memory footprint. Because of the practical relevance, there has been renewed interest recently in this problem, and the fact that this is an important unresolved issue has been pointed out in numerous places, e.g., [21, 1].
A large body of work has focused on the nonstatistical data paradigm that deals with a fixed pool of samples. This includes work on online PCA and lowrank matrix approximation in the streaming scenario, including sketching and dimensionalityreduction based techniques.
OnlinePCA for regret minimization has been considered in several papers, most recently in [21], where the multiplicative weights approach is adapted for this problem (now experts correspond to subspaces). The goal there is to control the regret, improving on the natural followtheleader algorithm that performs batchPCA at each step. However, the algorithm can require memory, in order to store the multiplicative weights. A memorylight variant described in [1] typically requires much less memory, but there are no guarantees for this, and moreover, for certain problem instances, its memory requirement is on the order of .
Subsampling, dimensionalityreduction and sketching form another family of lowcomplexity and lowmemory techniques, see, e.g., [5, 13, 8]. These save on memory and computation by performing SVD on the resulting smaller matrix. The results in this line of work provide worstcase guarantees over the pool of data, and typically require a rapidly decaying spectrum (which we do not have in our setting) to produce good bounds. More fundamentally, these approaches are not appropriate for data coming from a statistical model such as the spiked covariance model. It is clear that subsampling approaches, for instance, simply correspond to discarding most of the data, and for fundamental sample complexity reasons, cannot work. Sketching produces a similar effect: each column of the sketch is a random () sum of the data points. If the data points are, e.g., independent Gaussian vectors, then so will each element of the sketch, and thus this approach again runs against fundamental sample complexity constraints. Indeed, it is straightforward to check that the guarantees presented in ([5, 8]) are not strong enough to guarantee recovery of the spike. This is not because the results are weak; it is because they geared towards worstcase bounds.
Algorithms focused on sequential SVD (e.g., [4, 3], [6],[12] and more recently [2, 9]) seek to have the best subspace estimate at every time (i.e., each time a new data sample arrives) but without performing fullblown SVD at each step. While these algorithms indeed reduce both the computational and memory burden of batchPCA, there are no rigorous guarantees on the quality of the principal components or on the statistical performance of these methods.
In a Bayesian mindset, some researchers have come up with expectation maximization approaches
[16, 18], that can be used in an incremental fashion. The finite sample behavior is not known.Stochasticapproximationbased algorithms along the lines of [15] are also quite popular, because of their low computational and memory complexity, and excellent performance in practice. They go under a variety of names, including Incremental PCA (though the term Incremental has been used in the online setting as well [10]), Hebbian learning, and stochastic power method [1]. The basic algorithms are some version of the following: upon receiving data point at time , update the estimate of the top principal components via:
(1) 
where denotes the “projection” that takes the SVD of the argument, and sets the top singular values to and the rest to zero (see [1] for further discussion).
While empirically these algorithms perform well, to the best of our knowledge  and efforts  there does not exist any rigorous finite sample guarantee for these algorithms. The analytical challenge seems to be the high variance at each step, which makes direct analysis difficult.
In summary, while much work has focused on memoryconstrained PCA, there has as of yet been no work that simultaneously provides sample complexity guarantees competitive with batch algorithms, and also memory/storage complexity guarantees close to the minimal requirement of – the memory required to store only the output. We present an algorithm that provably does both.
3 Problem Formulation and Notation
We consider a streaming model, where at each time step , we receive a point . Furthermore, any vector that is not explicitly stored can never be revisited. Now, our goal is to compute the top principal components of the data: the dimensional subspace that offers the best squarederror estimate for the points. We assume a probabilistic generative model, from which the data is sampled at each step . Specifically, we assume,
(2) 
where is a fixed matrix,
is a multivariate normal random variable, i.e.,
and vector
is the “noise” vector and is also sampled from a multivariate normal distribution, i.e.,
Furthermore, we assume that all random vectors () are mutually independent.
In this regime, it is wellknown that batchPCA is asymptotically consistent (hence recovering up to unitary transformations) with number of samples scaling as [20]. It is interesting to note that in this highdimensional regime, the signaltonoise ratio quickly approaches zero, as the signal, or “elongation” of the major axis, , is , while the noise magnitude, , scales as . The central goal of this paper is to provide finite sample guarantees for a streaming algorithm that requires memory no more than and matches the consistency results of batch PCA in the sampling regime (possibly with additional log factors, or factors depending on and ).
We denote matrices by capital letters (e.g. ) and vectors by lowercase boldface letters (). denotes the norm of ; denotes the norm of . or denotes the spectral norm of while denotes the Frobenius norm of . Without loss of generality (WLOG), we assume that: , where denotes the spectral norm of . Finally, we write for the inner product between , . In proofs the constant is used loosely and its value may vary from line to line.
4 Algorithm and Guarantees
In this section, we present our proposed algorithm and its finite sample analysis. It is a blockwise stochastic variant of the classical powermethod. Stochastic versions of the power method are already popular in the literature and are known to have good empirical performance; see [1] for a nice review of such methods. However, the main impediment to the analysis of such stochastic algorithms (as in (1)) is the potentially large variance of each step, due primarily to the highdimensional regime we consider, and the vanishing SNR.
This motivated us to consider a modified stochastic power method algorithm, that has a variance reduction step built in. At a high level, our method updates only once in a “block” and within one block we average out noise to reduce the variance.
Below, we first illustrate the main ideas of our method as well as our sample complexity proof for the simpler rank case. The rank and rank algorithms are so similar, that we present them in the same panel. We provide the rank analysis in Section 4.2. We note that, while our algorithm describes as “input,” we mean this in the streaming sense: the data are nowhere stored, and can never be revisited unless the algorithm explicitly stores them.
4.1 RankOne Case
We first consider the rank case for which each sample is generated using: where is the principal component that we wish to recover. Our algorithm is a blockwise method where all the samples are divided in blocks (for simplicity we assume that is an integer). In the st block, we compute
(3) 
Then, the iterate is updated using . Note that, can be easily computed in an online manner where operations are required per step. Furthermore, storage requirements are also linear in .
4.1.1 Analysis
We now present the sample complexity analysis of our proposed method (Algorithm 1). We show that, using samples, Algorithm 1 obtains a solution of accuracy , i.e. .
Theorem 1.
Denote the data stream by , where is generated by (2). Set the total number of iterations and the block size . Then, with probability , , where is the th iterate of Algorithm 1. That is, Algorithm 1 obtains an accurate solution with number of samples () given by:
Note that in the total sample complexity, we use the notation to suppress the extra factor for clarity of exposition, as already appears in the expression linearly.
Proof.
The proof decomposes the current iterate into the component of the current iterate, , in the direction of the true principal component (the spike) , and the perpendicular component, showing that the former eventually dominates. Doing so hinges on three key components: (a) for large enough , the empirical covariance matrix is close to the true covariance matrix , i.e., is small. In the process, we obtain “tighter” bounds for for fixed ; (b) with probability (or any other constant probability), the initial point has a component of at least magnitude along the true direction ; (c) after iterations, the error in estimation is at most where is a constant.
There are several results that we use repeatedly, which we collect here, and prove individually in the appendix.
Lemmas 4, 5 and 6. Let , and the data stream be as defined in the theorem. Then:

(Lemma 4): With probability , for a universal constant, we have:
Step (a) is proved in Lemmas 4 and 5, while Lemma 6 provides the required result for the initial vector . Using these lemmas, we next complete the proof of the theorem. We note that both (a) and (b) follow from wellknown results; we provide them for completeness.
Let , where is the component of that is perpendicular to and is the magnitude of the component of along . Note that may well change at each iteration; we only wish to show .
Now, using Lemma 5, the following holds with probability at least :
(4) 
Next, we consider the component of that is perpendicular to :
where and is the error matrix: . Using Lemma 4, (w.p. ). Hence, w.p. :
(5) 
Now, since ,
(6) 
where, follows from (4) and follows from (5) along with the fact that is an increasing function in for .
Assuming and using (6) and bounding the failure probability with a union bound, we get (w.p. )
(7) 
where and is a global constant. Inequality follows from Lemma 6; to prove , we need one final result: the following lemma shows that the recursion given by (7) decreases at a fast rate. Interestingly, the rate of decrease in error initially (for small ) might be sublinear but for large enough the rate turns out to be linear. We defer the proof to the appendix.
Lemma 2.
If for any and , we have , then,
Hence, using the above equation after updates, with probability at least , . The result now follows by noting that . ∎
Remark: Note that in Theorem 1, the probability of accurate principal component recovery is a constant and does not decay with . One can correct this by either paying a price of in storage, or in sample complexity: for the former, we can run instances of Algorithm 1 in parallel; alternatively, we can run Algorithm 1 times on fresh data each time, using the next block of data to evaluate the old solutions, always keeping the best one. Either approach guarantees a success probability of at least .
4.2 General Rank Case
In this section, we consider the general rank PCA problem where each sample is assumed to be generated using the model of equation (2), where represents the principal components that need to be recovered. Let be the SVD of where , . The matrices and are orthogonal, i.e., , and is a diagonal matrix with diagonal elements . The goal is to recover the space spanned by , i.e., . Without loss of generality, we can assume that .
Similar to the rank problem, our algorithm for the rank problem can be viewed as a streaming variant of the classical orthogonal iteration used for SVD. But unlike the rank case, we require a more careful analysis as we need to bound spectral norms of various quantities in intermediate steps and simple, crude analysis can lead to significantly worse bounds. Interestingly, the analysis is entirely different from the standard analysis of the orthogonal iteration as there, the empirical estimate of the covariance matrix is fixed while in our case it varies with each block.
For the general rank problem, we use the largestprincipalanglebased distance function between any two given subspaces:
where and represent an orthogonal basis of the perpendicular subspace to and , respectively. For the spiked covariance model, it is straightforward to see that this is equivalent to the usual PCA figureofmerit, the expressed variance.
Theorem 3.
Consider a data stream, where for every is generated by (2), and the SVD of is given by . Let, wlog, . Let,
Then, after sizeblockupdates, w.p. , . Hence, the sufficient number of samples for accurate recovery of all the top principal components is:
Again, we use to suppress the extra factor.
The key part of the proof requires the following additional lemmas that bound the energy of the current iterate along the desired subspace and its perpendicular space (Lemmas 8 and 9), and Lemma 10, which controls the quality of the initialization.
Lemmas 8, 9 and 10. Let the data stream, , , and be as defined in Theorem 3, be the variance of noise, and be the th iterate of Algorithm 1.
We provide the proof of the lemmas and theorem in the appendix.
4.3 Perturbationtolerant Subspace Recovery
While our results thus far assume has rank exactly , and is known a priori, here we show that both these can be relaxed; hence our results hold in a quite broad setting.
Let be the th step sample, with and where is the true rank of which is unknown. However, we run Algorithm 1 with rank and the goal is to recover a subspace , s.t., is contained in .
We first observe that the largestprincipal angle based distance function that we use in the previous section can directly be used for our more general setting. That is, measures the component of “outside” the subspace and the goal is to show that component is .
Now, our analysis can be easily modified to handle this more general setting as crucially our distance function does not change. Naturally, now the number of samples we require increases according to . In particular, if
then . Furthermore, if we assume (or a large enough constant ) then the initialization step provides us better distance, i.e., rather than bound if . This initialization step enables us to give tighter sample complexity as the in the numerator above can be replaced by .
5 Experiments
(a)  (b) 
(c)  (d) 
) from the spiked covariance model, with noise standard deviation
and desired accuracy . (b) Fraction of trials in which Algorithm 1 successfully recovers the principal component () in the same model, with and samples, (c) Explained variance by Algorithm 1 compared to the optimal batch SVD, on the NIPS bagofwords dataset. (d) Explained variance by Algorithm 1 on the NY Times and PubMed datasets.In this section, we show that, as predicted by our theoretical results, our algorithm performs close to the optimal batch SVD. We provide the results from simulating the spiked covariance model, and demonstrate the phasetransition in the probability of successful recovery that is inherent to the statistical problem. Then we stray from the analyzed model and performance metric and test our algorithm on real world–and some very big–datasets, using the metric of explained variance.
In the experiments for Figures 1 (a)(b), we draw data from the generative model of (2). Our results are averaged over at least independent runs. Algorithm 1 uses the block size prescribed in Theorem 3, with the empirically tuned constant of . As expected, our algorithm exhibits linear scaling with respect to the ambient dimension – the same as the batch SVD. The missing point on batch SVD’s curve (Figure 1(a)), corresponds to . Performing SVD on a dense matrix, either fails or takes a very long time on most modern desktop computers; in contrast, our streaming algorithm easily runs on this size problem. The phase transition plot in Figure 1(b) shows the empirical sample complexity on a large class of problems and corroborates the scaling with respect to the noise variance we obtain theoretically.
Figures 1 (c)(d) complement our complete treatment of the spiked covariance model, with some outofmodel experiments. We used three bagofwords datasets from [14]. We evaluated our algorithm’s performance with respect to the fraction of explained variance metric: given the matrix output from the algorithm, and all the provided samples in matrix , the fraction of explained variance is defined as . To be consistent with our theory, for a dataset of samples of dimension , we set the number of blocks to be and the size of blocks to in our algorithm. The NIPS dataset is the smallest, with documents and K words and allowed us to compare our algorithm with the optimal, batch SVD. We had the two algorithms work on the document space () and report the results in Figure 1(c). The dashed line represents the optimal using samples. The figure is consistent with our theoretical result: our algorithm performs as well as the batch, with an added factor in the sample complexity.
Finally, in Figure 1 (d), we show our algorithm’s ability to tackle very large problems. Both the NY Times and PubMed datasets are of prohibitive size for traditional batch methods – the latter including million documents on a vocabulary of thousand words – so we just report the performance of Algorithm 1. It was able to extract the top components for each dataset in a few hours on a desktop computer. A second pass was made on the data to evaluate the results, and we saw 710 percent of the variance explained on spaces with .
References
 [1] R. Arora, A. Cotter, K. Livescu, and N. Srebro. Stochastic optimization for PCA and PLS. In 50th Allerton Conference on Communication, Control, and Computing, Monticello, IL, 2012.
 [2] L. Balzano, R. Nowak, and B. Recht. Online identification and tracking of subspaces from highly incomplete information. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, page 704–711, 2010.
 [3] M. Brand. Fast lowrank modifications of the thin singular value decomposition. Linear algebra and its applications, 415(1):20–30, 2006.
 [4] Matthew Brand. Incremental singular value decomposition of uncertain data with missing values. Computer Vision—ECCV 2002, page 707–720, 2002.

[5]
Kenneth L. Clarkson and David P. Woodruff.
Numerical linear algebra in the streaming model.
In
Proceedings of the 41st annual ACM symposium on Theory of computing
, page 205–214, 2009.  [6] P. Comon and G. H. Golub. Tracking a few extreme singular values and vectors in signal processing. Proceedings of the IEEE, 78(8):1327–1343, 1990.
 [7] Gene H. Golub and Charles F. Van Loan. Matrix computations, volume 3. JHUP, 2012.
 [8] Nathan Halko, PerGunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217–288, 2011.
 [9] J. He, L. Balzano, and J. Lui. Online robust subspace tracking from partial information. arXiv preprint arXiv:1109.3827, 2011.

[10]
Mark Herbster and Manfred K. Warmuth.
Tracking the best linear predictor.
The Journal of Machine Learning Research
, 1:281–309, 2001. 
[11]
Iain M. Johnstone.
On the distribution of the largest eigenvalue in principal components analysis.(english.
Ann. Statist, 29(2):295–327, 2001.  [12] Y. Li. On incremental and robust subspace learning. Pattern recognition, 37(7):1509–1518, 2004.
 [13] Boaz Nadler. Finite sample approximation results for principal component analysis: a matrix perturbation approach. The Annals of Statistics, page 2791–2817, 2008.
 [14] Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, page 569–577, 2008.
 [15] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, page 400–407, 1951.
 [16] Sam Roweis. EM algorithms for PCA and SPCA. Advances in neural information processing systems, page 626–632, 1998.
 [17] Mark Rudelson and Roman Vershynin. Smallest singular value of a random rectangular matrix. Communications on Pure and Applied Mathematics, 62(12):1707–1739, 2009.
 [18] Michael E. Tipping and Christopher M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
 [19] R. Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, page 1–32, 2010.
 [20] Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 [21] Manfred K. Warmuth and Dima Kuzmin. Randomized online PCA algorithms with regret bounds that are logarithmic in the dimension. Journal of Machine Learning Research, 9:2287–2320, 2008.
Appendix A Lemmas from Section 4.1
We first give the statement of all the Lemmas whose proofs we omitted in the body of the paper. Then we provide some results from the literature – what we call Preliminaries – and then we prove Theorem 3 and the supporting lemmas.
Lemma 4.
Let , and the data stream be as defined in Theorem 1. Then, w.p. we have:
Lemma 5.
Lemma 6.
Let be the initial guess for , given by Steps 1 and 2 of Algorithm 1. Then, w.p. : , where is a universal constant.
Lemma 7.
If for any and , we have , then,
Appendix B Lemmas from Section 4.2
Lemma 8.
Lemma 9.
Let , , , , be as defined in Lemma 8. Then, w.p. , .
Lemma 10.
Let be sampled uniformly at random from the set of all dimensional subspaces (see Initialization Steps of Algorithm 1). Then, w.p. at least : , where is a global constant.
Appendix C Preliminaries
Lemma 11 (Lemma 5.4 of [20]).
Let A be a symmetric matrix, and let be an net of for some . Then,
Lemma 12 (Proposition 2.1 of [19]).
Consider independent random vectors in ,
, which have subGaussian distribution with parameter
. Then for every with probability at least one has,Lemma 13 (Corollary 3.5 of [20]).
Let be an matrix whose entries are independent standard normal random variables. Then for every , with probability at least one has,
Lemma 14 (Theorem 1.2 of [17]).
Let be independent centered real random variables with variances at least
and subgaussian moments bounded by
. Let be an matrix whose rows are independent copies of the random vector . Then for every one haswhere and depend only on . Note that for the standard Gaussian variables.
Lemma 15.
Let be i.i.d. standard multivariate normal variables. Also, are also i.i.d. normal variables and are independent of . Then, w.p. ,
Proof.
Let and let . Then, the goal is to show that, the following holds w.p. : for all s.t. .
We prove the lemma by first showing that the above mentioned result holds for any fixed vector and then use standard epsilonnet argument to prove it for all .
Appendix D Proof of Theorem 3
Recall that our algorithm proceeds in a blockwise manner; for each block of samples, we compute
(11) 
where is the
th block iterate and is an orthogonal matrix, i.e.,
. Given , the next iterate, , is computed by the QRdecomposition of . That is,(12) 
where is an uppertriangular matrix.
Proof.
By using update for (see (11), (12)):
(13) 
where . That is,
(14) 
where is an orthogonal basis of the subspace orthogonal to . Now, let be the singular vector corresponding to the largest singular value, then:
(15) 
where . follows as is an orthogonal matrix and form a complete orthogonal basis; follows by using (13). The existence of follows using Lemma 8 along with the fact that , where is the singular vector of corresponding to its smallest singular value, .
Comments
There are no comments yet.