Principal component analysis (PCA) is one of the most fundamental tools for feature extraction in machine learning problems such as data compression, classification, clustering, image processing and visualization.
Given a data set with samples and features, PCA can be easily performed through eigen decomposition of the sample covariance matrix
or the singular value decomposition of the data matrix, and the theoretical guarantee is established by matrix Bernstein theory [22, 21]. Throughout the paper we assume that has mean zero for notation simplicity. For X with large and , we may not be able to store matrices of size or due to limited memories on small devices such as laptops and mobile phones, and it is often expensive to conduct multiple passes over the data. For these cases, streaming PCA algorithms [16, 7, 18, 17, 8, 13, 6, 5, 15, 14] come into play, with the nice property that they only need to use or memory to store the current sample or the current block of samples.
While these streaming PCA algorithms resolve the memory issues, they often suffer from slow convergence in practice. The main reason is that most streaming PCA algorithms in the past decades do not fully utilize information in previously streamed data, which came in sequentially and contain valuable information. This motivates us to develop a new algorithm that utilizes past information effectively without taking much memory space.
Specifically, our new algorithm can compute the top- principal components in the streaming setting with memory requirement, where is the block size and can be pretty small (e.g., or ). The convergence speed of the new algorithm is much faster than existing approaches mainly due to effective use of past information, and by setting the number of inner iterations to one, the memory cost of our algorithm can be reduced to . The resulting algorithm is similar to Oja’s algorithm but with the ability to compute the top- principal components altogether without the need to tune the step size as in Oja’s algorithm. We provide theoretical justification of our algorithm.
Experimental results show that our algorithm consistently outperforms existing algorithms in the streaming setting. Moreover, we test the ability of our algorithm to compute PCA on massive data sets. Surprisingly it outperforms existing streaming and non-streaming algorithms (including VR-PCA and power method) in terms of both the number of data passes and real run time.
The rest of the paper is outlined as follows. First, we discuss and compare current algorithms on streaming PCA. Then, we propose the History PCA algorithm with theoretical guarantees. Finally, we provide experiments on both simulated streaming data sets and real world data sets.
2 Related Work
There are many existing methods to estimate the topeigenvectors of the true covariance matrix. It was shown by Adamczak et al. 
that the sample covariance matrix can guarantee estimation of the true covariance matrix with a fixed error in the operator norm for sub-exponential distributions, if the sample size is
. Later, this statement is generalized to distributions with finite fourth moment and the true eigenvectors of the distribution can be recovered with high probability by the singular value decomposition of the sample covariance matrix formed bysamples .
2.1 Non-streaming PCA Algorithms
When the full data set is given in advance, the optimal way to estimate principal components is to conduct eigen decomposition on the sample covariance matrix . However, it is often impossible to form the sample covariance matrix for large-scale data sets. So we need to compute in the power method without explicitly forming the sample covariance matrix.
proposed the VR-PCA method that outperforms power method for PCA computation. The main idea is to reformulate PCA as a stochastic optimization problem and apply the variance reduction techniques. Although VR-PCA looks similar to streaming PCA, it cannot obtain the first update until the second pass. So it is not suitable in the streaming setting.
Since the entire data are often too large to be stored on the hard disk, algorithms that are specifically designed for streaming data have become a research focus in the past few decades.
2.2 Streaming PCA Algorithms
A classical algorithm for streaming PCA is Oja’s algorithm . It can be viewed as the traditional stochastic gradient decent method, in which each data point that come into memory can be viewed as a random sample drawn from an underlying distribution. In 2013, Balsubramani et al.  modified Oja’s algorithm with a special form of step size and provided the statistical analysis of this algorithm. Later on in 2015, Jin et al. 
proposed a faster algorithm to find the top eigenvector based on shift and invert framework with improved sample complexity. Although this algorithm seem to have good sample complexity, it requires the initialization vector be within certain distance to the true top eigenvector, which is costly to attain and hard to achieve when the dimensionis very high. More recently, Jain et al.  proved that Oja’s algorithm can achieve the optimal sample complexity for the first eigenvector under certain choices of step size, in the sense that it matches the matrix Bernstein inequality [22, 21]. However, the step sizes depend on some data-related constants that cannot be estimated beforehand. So in practice one still needs to test different step sizes and choose the best one. Under a similar assumption on the step size, Li et al.  recently proved that the streaming PCA algorithm matches the minimax information rate, and Allen-Zhu and Li  extended such optimality to the k-subspace version and proposed the Oja algorithm. Recently, Sa et al. 
showed the global convergence of Alecton algorithm, which is another variant of stochastic gradient descent algorithm applied to a non-convex low-rank factorized problem. In our experiments, we compare our algorithms with Oja’s algorithm and Ojaalgorithm based on various choices of step sizes and show that our algorithm outperforms all of them.
Mitliagkas et al.  introduced the block stochastic power method. This method uses each block of data to construct an estimator of the covariance matrix and applies power method in each iteration to update eigenvectors. Theoretical guarantees are provided under the setting of the spiked data model. More recently, Li et al.  extended the work of  by using blocks of variable size and progressively increasing the block size as the estimate becomes more accurate. Later on, Hardt and Price  gave a robust convergence analysis of a more general version of block stochastic power method for arbitrary distributions.
The aforementioned works are all in the setting of streaming i.i.d. data samples. There is another line of research for an arbitrary sequence of data. Most works [13, 6, 5, 15, 14] use the technique of computing a sketch of matrix to find an estimator of the top eigenvector.
Our algorithm is under the setting of streaming i.i.d. data samples and it differs from other streaming algorithms in the sense that our History PCA algorithm exploits key summaries in the historical data to achieve better performance.
3 Problem Setting and Background
Let be a sequence of -dimensional data that stream into memory. Assume all the ’s are sampled i.i.d from a distribution with zero-mean and a fixed covariance matrix . Let and
denote the eigenvalues and corresponding eigenvectors of.
In this paper, our goal is to recover the top eigenvectors. In the case of , our goal is to find a unit vector that is within the -neighborhood of the first true eigenvector . In the general case with , we use the the largest-principal-angle-based distance function as the metric to evaluate our algorithm, which is
for any . Here denotes the orthogonal basis of the perpendicular subspace to the subspace spanned by matrix .
Our algorithm is motivated by the block stochastic power method . While the classical power method keeps on multiplying the vector with a fixed sample covariance matrix at each iteration (), the block stochastic power method  updates the current estimate by
where is a sample covariance matrix formed by the block of data points that stream into memory at iteration . Thus, the block stochastic power method can be viewed as a stochastic variant of the power method, and using a larger block size at each iteration can reduce the potentially large variance by averaging out the noises.
The block stochastic power method has several limitations. First, the block size needs to be very large in order for this algorithm to work. That is mainly because the algorithm requires the sample covariance matrix formed by each block of data points to be very close to the true covariance matrix, which is difficult to achieve if the block size is not large enough. And this block size depends highly on the target accuracy . Secondly, the block stochastic power method cannot converge to the true eigenvectors of . If the block size is fixed, it can only converge to the -approximated solution, where .
Our algorithm aims at solving the above problems by using more past information. Instead of doing only one matrix-vector multiplication at each step after a block of data stream in, we do matrix-vector multiplication iteratively either until the algorithm converges or for a pre-specified number of times. This strategy gives us good results at very early stage. Moreover, with the use of past information, we could further reduce the variance of the estimates. Since intuitively there is no reason to favor a particular past block, we assign equal weights to all data samples (or blocks) in each iteration. Details are provided below.
4 Proposed Algorithms
Since data come into memory sequentially, we can form blocks of sample, each with size , as data streaming in, and denote the th data block by a matrix , where is a data point of dimension for .
4.1 Main algorithm: History PCA (Rank-)
We first provide the method and intuition of our History PCA in the rank- case.
At time , we have the first block of data . Since we have no past information, we can make use of the sample covariance matrix formed by the first block of data point, , to estimate the true covariance matrix and its eigenvectors. But instead of finding the eigenvectors of , we try to find the eigenvectors of in the first step, which is mainly for our convergence theory to go through. In the rank- case, we only need to save the first eigenvector of .
At time , we have the second block of data . If we could have both blocks of data in memory, then the best we can do is to form the sample covariance matrix . However, limited memory space prohibits us from storing the entire first block of data. Thus, we need to find an alternative scheme to extract key information from the first block so that the estimator at time could be better than just using the sample covariance matrix . For simplicity of illustration we assume . An intuitive idea is to form a rank- matrix and make use of it in our estimator. Intuitively, we should assign equal weights to the information we get from both blocks of data. Thus, the updated estimator is . Now we only need to save the first eigenvector of this estimator.
At time , we have the th block of data . Similarly, we could get a new estimator of the covariance matrix by exploiting the history from the past blocks of data. Intuitively, each block of data should have equal weights. Thus, we assign the weight and , respectively, to the sample covariance matrix formed by and the rank- matrix formed by the past blocks of data. Therefore, the new covariance estimator at time is
where is the eigenvector at time .
At the final time , we have seen in total blocks of data and we output . We summarize this algorithm in Algorithm 1.
4.2 Solving each subproblem approximately
In Algorithm 1, at each iteration we run the power method on the current covariance estimator, which is equation (3), to get the first eigenvector. This can be viewed as a subproblem. However, it will in theory require infinite number of iterations to get the exact eigenvector. Simulations show that using exact convergence have similar results as using power method with small iterations. So instead of solving the sub-problem exactly, we solve it approximately by performing matrix vector multiplication times in the for loops. Note that the space complexity of our algorithm is . However, when we set (only one power iteration at each step), Algorithm 1 will only require memory. This gives us the flexibility to reduce the memory size to if memory is not enough.
4.3 Extending to rank- History PCA
Our method can be generalized to find the first eigenvectors, as shown in Algorithm 2. In the rank- case, we can simply replace the -dimensional vector with a matrix
and then replace the normalization update by the QR decomposition.
In the rank- case, we make use of the history information of eigenvalues. So the new covariance estimator at time , which we utilize to update our estimate, is
where is a matrix of eigenvectors and is a diagonal matrix with corresponding eigenvalues as the diagonal elements at time .
Similarly, in the generalized History PCA algorithm, we can replace the exact solver by an approximate solver (with 1 or a fixed number of iterations ) to find . We can further reduce the memory complexity of our algorithm from to when we set .
4.4 Theoretical Analysis
Now we provide the convergence theorem for the proposed algorithm. Let be the sample covariance matrix formed by -th block of data, for , that satisfies the following assumptions:
, where is a symmetric PSD matrix.
with probability .
Let denote the eigenvectors of and denote the corresponding eigenvalues. Our goal is to compute an -approximation to which is a unit vector satisfying , where denotes the sin of the angle between and . In the following, we prove that our History PCA algorithm can achieve accuracy with
after seeing data blocks.
Jain et al.  proved the convergence of their algorithm by directly analyzing the convergence of Oja’s algorithm as an operator on the initialized vector rather than analyzing the error reduced after each update. Oja’s algorithm applies the matrix
to the random initial vector, say , and then output the normalized result
Inspired by their approach, we apply the following trick to prove the theoretical guarantees of our algorithm. If we set , our algorithm can also be viewed as applying on the initialized vector . In our algorithm, we have and for . We summarize the main convergence theorem for the History PCA algorithm here. The detailed proof can be found in the appendix.
Fix any . Let be the smallest integer such that Let . Under the assumption that is invertible, the History PCA algorithm (rank- with ) converges to an -accurate solution with
with probability at least .
To prove our main Theorem 1, we need to introduce the following lemma.
Assume is invertible for any positive integer . Let be a non-negative integer. Let be such that for any positive integer . Let be a unit vector, and let be a matrix whose columns form an orthonormal basis of the subspace orthogonal to . If is chosen uniformly at random from the surface of the unit sphere, then with probability at least ,
where is an absolute constant.
Define the step sizes for all positive integer . We can see that
Applying on gives us
where is some absolute constant. Therefore, Lemma 1 can be interpreted as a relationship between the error bound after iterations of Oja’s algorithm and the error bound of applying the last iterations on a randomly chosen initial vector if we totally dump the result from the first iterations. With Lemma 1 and a carefully selected constant integer , we can prove Theorem 2, which implies Theorem 1 directly.
Fix any . Let be the smallest integer such that Assume is invertible, the output of the History PCA algorithm (rank- with ) satisfies:
with probability at least .
5 Experimental Results
In this section we conduct numerical experiments to compare our algorithm with previous streaming PCA approaches using synthetic streaming data as well as real world data listed in Table 1.
5.1 Choice of for History PCA
In Figure 1, we compare the performances of our History PCA algorithm with different choices of (number of inner iterations) on NIPS data set with . Here the accuracy is the explained variance of the algorithm output. We can see that when , History PCA’s performances are equally well. And when , its performance is slightly worse. Thus, a small is already good enough for our algorithm. In the following experiments, we will just set in our algorithm to compare with other existing streaming PCA methods.
5.2 Simulated Streaming Data
In this section, we compare the proposed History PCA algorithm with state-of-the-art streaming PCA algorithms: block power method, DBPCA method, Oja’s algorithm and Oja algorithm.
We set Oja’s algorithm and Oja algorithm with step sizes , where is the number of iterations and is the tuning parameter. We obtain the results for but only include the best three values in Figure 2 and Figure 3. The same experimental setting in  is used in order to compare our methods with theirs. Here we tried
and varied the noise standard deviationfor the simulated data sets. Thus, , where is the true orthonormal matrix we are trying to recover.
We consider two scenarios: and . For each scenario, we try to recover the top eigenvectors of the data sets respectively. We also vary the block size . The performances of the algorithms are evaluated based on the largest-principal-angle metric, which measures the distances between the estimated principal vectors and the ground truth.
Figure 2 shows the results for and Figure 3 shows the results for . Both figures suggest that the History PCA method is the best for the case of recovering the top eigenvector, as well as for cases recovering additional top eigenvectors, regardless of the block size, noise level, and whether is large or small. What’s more, the error from the History PCA method continues to decrease as the sample size of streaming data increases, while some other streaming methods stop improving at a very early stage. From the simulations, we conclude that the performance of the block power method depends highly on the block size . The performance of the block power method improves as increases. But the History PCA method does not depend on the block size, which is an advantage over the block power method. DBPCA’s performance is sometimes much better than the block power method and outperforms the Oja’s algorithm and Oja algorithm in most scenarios. But it is less stable and less accurate than our History PCA method. The performance of Oja’s algorithm and Oja algorithm depends highly on the tuning of step size. There is no universal good step size for Oja’s algorithm and Oja algorithm. The History PCA method does not need to tune the step size but yields superior performance than those of the best tuned Oja’s algorithm and Oja algorithm in all scenarios.
5.3 Large-scale real data
|# samples||# features||# nonzeroes|
We compare our algorithm with Oja’s algorithm, VR-PCA and noisy power method [16, 7] on four real data sets. We do not include the Oja algorithm as we observe that its performance is almost identical to the Oja’s algorithm. NIPS and NYTimes data sets are downloaded from UCI data. RCV1 and KDDB data sets are downloaded from LIBSVM data sets. A summary of our data sets is provided in Table 1.
We first consider the streaming setting, where the algorithms can only go through the data once. For a fair comparison, we set Oja’s algorithm with step sizes , where is number of iterations and ranges from to in all the experiments. The best result of the Oja’s algorithm is presented in the plots. We also compare the algorithms with different block size and at different target rank . In practice we find Oja’s algorithm () often numerically unstable and cannot achieve the comparable performance with other approaches, so we do not include their result for . To evaluate the results, we use the widely used metric “explained variance”,
to measure the quality of the solution (where is the computed solution). Note that the value is always less or equal to 1, and will be maximized when is the leading eigenvectors of . In Figure 4, we observe that our algorithm (History PCA) is very stable and consistently better in all settings (big or small , big or small ).
5.4 Comparing streaming with non-streaming algorithms
In the final set of experiments, we go beyond the streaming setting and allow several passes of data. We want to test (1) In addition to having fewer number of data access, can our algorithm also have short running time in practice? (2) If we go beyond the first pass of the data, can our algorithm keep on improving the solution by making more passes? (3) In practice, if there is a large data set with millions or billions of data points and we want to compute PCA, should we use our algorithm instead of other batch PCA algorithms?
In order to answer these questions, we test our algorithm on two large data sets: RCV1 and KDDB (with 20 million samples and 30 million features). Also, we add two strong batch algorithms into comparison, including power method and VR-PCA , which are the state-of-the-art PCA solvers. In order to clearly show the convergence speed, we compute the “error” by the largest eigenvalue minus the unnormalized explained variance , so ideally the error will converge to 0 (we are doing this in order to show the error in log-scale in the y-axis). The results are presented in Figure 5. In Figure (a)a and (b)b, we show the number of data passes versus errors, and the vertical line means the end of the first data pass. We observe that even within the first data pass, the error of our algorithm can be bounded by and after the first pass our algorithm is still able to improve the solution. We also observe that VR-PCA achieves better performance with more passes of data and sufficient time while we argue that an error of order is accurate enough for practical use.
Finally we compare the real run time with all the other algorithms. In Figure (c)c, since the data set is not that large, we store the data set in memory and perform in-memory data access, while for KDDB data (Figure (d)d), since the data size is larger than memory, we load data block by block and compare the overall running time. The results show that our algorithm is still much better than other methods in terms of run time. The only real competitor, again, is VR-PCA, which will catch up our algorithm when reaching an error of order . In the future, it will be interesting to apply a variance reduction technique to speed up the convergence of our algorithm after the first pass of data.
We propose History PCA, a hyperparameter-free streaming PCA algorithm which utilizes past information effectively. We extend our algorithm to solve the rank-PCA problems with . What’s more, we provide the theoretical guarantees and convergence rate for our algorithm, and we show that the convergence speed is faster than existing algorithms on both synthetic and real data sets.
- Adamczak et al.  R. Adamczak, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann. 2010. Quantitative estimates of the convergence of the empirical covariance matrix in log-concave ensembles. Journal of the American Mathematical Society 234 (2010), 535–561.
- Allen-Zhu and Li  Z. Allen-Zhu and Y. Li. 2017. First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate. arXiv:1607.07837 (2017).
- Balsubramani et al.  A. Balsubramani, S. Dasgupta, and Y. Freund. 2013. The fast convergence of incremental pca. Advances in Neural Information Processing Systems (2013), 3174–3182.
Christos Boutsidis 
Peilin Zhong Christos Boutsidis, David
P. Woodruff. 2016.
Optimal Principal Component Analysis in Distributed
and Streaming Models.
in Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, 236–249.
- Christos Boutsidis  Zohar Karnin Edo Liberty Christos Boutsidis, Dan Garber. 2015. Online Principal Components Analysis. in Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 887–901.
- Hardt and Price  M. Hardt and E. Price. 2014. The noisy power method: A meta algorithm with applications. In NIPS.
- Jain et al.  P. Jain, C. Jin, S. Kakade, P. Netrapali, and A. Sidford. 2016. Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja’s Algorithm. In COLT.
- Jin et al.  C. Jin, S. M. Kakade, C. Musco, P. Netrapalli, and A. Sidford. 2015. Robust shift-and-invert precondi- tioning: Faster and more sample efficient algorithms for eigenvector computation. arXiv:1510.08896 (2015).
- Johnson and Zhang  R. Johnson and T. Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. Conference on Neural Information Processing Systems (2013).
- Li et al.  C. Li, H. Lin, and C. Lu. 2015. Rivalry of Two Families of Algorithms for Memory-Restricted Streaming PCA. arXiv:1506.01490 (2015).
- Li et al.  C. Li, M. Wang, H. Liu, and T. Zhang. 2017. Near-Optimal Stochastic Approximation for Online Principal Component Estimation. arXiv:1603.05305 (2017).
- Liberty  Edo Liberty. 2013. Simple and Deterministic Matrix Sketching. in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 581–588.
- Mina Ghashami  Jeff M. Phillips Mina Ghashami. 2014. Relative Errors for Deterministic Low-Rank Matrix Approximations. in Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms (2014), 707–717,.
- Mina Ghashami  Jeff M. Phillips David P. Woodruff Mina Ghashami, Edo Liberty. 2015. Frequent Directions: Simple and Deterministic Matrix Sketching. SIAM J. Comput. 45, 5 (2015), 1762–1792.
- Mitliagkas et al.  I. Mitliagkas, C. Caramanis, and P. Jain. 2013. Memory Limited, Streaming PCA. In NIPS.
- Oja  E. Oja. 1982. On the distribution of the largest eigenvalue in principal components analysis. Journal of Mathematical Biology 15, 3 (1982), 267–273.
- Sa et al.  Christopher De Sa, Kunle Olukotun, and Christopher Re. 2015. Global convergence of stochastic gradient descent for some nonconvex matrix problems. In ICML.
- Shamir  O. Shamir. 2015. A stochastic PCA and SVD algorithm with an exponential convergence rate. In ICML.
- Shamir  O. Shamir. 2016. Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity. In ICML.
- Tropp  Joel A Tropp. 2012. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics 12, 4 (2012), 389–434.
- Vershynin  Roman Vershynin. 2010. Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027 (2010).
Appendix A Proof of Lemma 2
Since with and , we know that By Lemma 6 of , we know that with probability at least ,
where is an absolute constant. Since
where is an absolute constant. ∎
Appendix B Proof of Theorem 3
Define the step sizes for all positive integer . Then we have With , we have Thus, satisfy the conditions in Theorem 3.1 of . Define . So, we have a bound for the error
Since , we have
Thus, we can conclude that .
Besides, we have Therefore,
for some constant . Since , we have