History PCA: A New Algorithm for Streaming PCA

02/15/2018 ∙ by Puyudi Yang, et al. ∙ University of California-Davis 0

In this paper we propose a new algorithm for streaming principal component analysis. With limited memory, small devices cannot store all the samples in the high-dimensional regime. Streaming principal component analysis aims to find the k-dimensional subspace which can explain the most variation of the d-dimensional data points that come into memory sequentially. In order to deal with large d and large N (number of samples), most streaming PCA algorithms update the current model using only the incoming sample and then dump the information right away to save memory. However the information contained in previously streamed data could be useful. Motivated by this idea, we develop a new streaming PCA algorithm called History PCA that achieves this goal. By using O(Bd) memory with B≈ 10 being the block size, our algorithm converges much faster than existing streaming PCA algorithms. By changing the number of inner iterations, the memory usage can be further reduced to O(d) while maintaining a comparable convergence speed. We provide theoretical guarantees for the convergence of our algorithm along with the rate of convergence. We also demonstrate on synthetic and real world data sets that our algorithm compares favorably with other state-of-the-art streaming PCA methods in terms of the convergence speed and performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal component analysis (PCA) is one of the most fundamental tools for feature extraction in machine learning problems such as data compression, classification, clustering, image processing and visualization.

Given a data set with samples and features, PCA can be easily performed through eigen decomposition of the sample covariance matrix

or the singular value decomposition of the data matrix

, and the theoretical guarantee is established by matrix Bernstein theory [22, 21]. Throughout the paper we assume that has mean zero for notation simplicity. For X with large and , we may not be able to store matrices of size or due to limited memories on small devices such as laptops and mobile phones, and it is often expensive to conduct multiple passes over the data. For these cases, streaming PCA algorithms [16, 7, 18, 17, 8, 13, 6, 5, 15, 14] come into play, with the nice property that they only need to use or memory to store the current sample or the current block of samples.

While these streaming PCA algorithms resolve the memory issues, they often suffer from slow convergence in practice. The main reason is that most streaming PCA algorithms in the past decades do not fully utilize information in previously streamed data, which came in sequentially and contain valuable information. This motivates us to develop a new algorithm that utilizes past information effectively without taking much memory space.

Specifically, our new algorithm can compute the top- principal components in the streaming setting with memory requirement, where is the block size and can be pretty small (e.g., or ). The convergence speed of the new algorithm is much faster than existing approaches mainly due to effective use of past information, and by setting the number of inner iterations to one, the memory cost of our algorithm can be reduced to . The resulting algorithm is similar to Oja’s algorithm but with the ability to compute the top- principal components altogether without the need to tune the step size as in Oja’s algorithm. We provide theoretical justification of our algorithm.

Experimental results show that our algorithm consistently outperforms existing algorithms in the streaming setting. Moreover, we test the ability of our algorithm to compute PCA on massive data sets. Surprisingly it outperforms existing streaming and non-streaming algorithms (including VR-PCA and power method) in terms of both the number of data passes and real run time.

The rest of the paper is outlined as follows. First, we discuss and compare current algorithms on streaming PCA. Then, we propose the History PCA algorithm with theoretical guarantees. Finally, we provide experiments on both simulated streaming data sets and real world data sets.

2 Related Work

There are many existing methods to estimate the top

eigenvectors of the true covariance matrix. It was shown by Adamczak et al. [2]

that the sample covariance matrix can guarantee estimation of the true covariance matrix with a fixed error in the operator norm for sub-exponential distributions, if the sample size is

. Later, this statement is generalized to distributions with finite fourth moment and the true eigenvectors of the distribution can be recovered with high probability by the singular value decomposition of the sample covariance matrix formed by

samples [22].

2.1 Non-streaming PCA Algorithms

When the full data set is given in advance, the optimal way to estimate principal components is to conduct eigen decomposition on the sample covariance matrix . However, it is often impossible to form the sample covariance matrix for large-scale data sets. So we need to compute in the power method without explicitly forming the sample covariance matrix.

Recently, Shamir [19, 20]

proposed the VR-PCA method that outperforms power method for PCA computation. The main idea is to reformulate PCA as a stochastic optimization problem and apply the variance reduction techniques 

[10]. Although VR-PCA looks similar to streaming PCA, it cannot obtain the first update until the second pass. So it is not suitable in the streaming setting.

Since the entire data are often too large to be stored on the hard disk, algorithms that are specifically designed for streaming data have become a research focus in the past few decades.

2.2 Streaming PCA Algorithms

A classical algorithm for streaming PCA is Oja’s algorithm [17]. It can be viewed as the traditional stochastic gradient decent method, in which each data point that come into memory can be viewed as a random sample drawn from an underlying distribution. In 2013, Balsubramani et al. [4] modified Oja’s algorithm with a special form of step size and provided the statistical analysis of this algorithm. Later on in 2015, Jin et al. [9]

proposed a faster algorithm to find the top eigenvector based on shift and invert framework with improved sample complexity. Although this algorithm seem to have good sample complexity, it requires the initialization vector be within certain distance to the true top eigenvector, which is costly to attain and hard to achieve when the dimension

is very high. More recently, Jain et al. [8] proved that Oja’s algorithm can achieve the optimal sample complexity for the first eigenvector under certain choices of step size, in the sense that it matches the matrix Bernstein inequality [22, 21]. However, the step sizes depend on some data-related constants that cannot be estimated beforehand. So in practice one still needs to test different step sizes and choose the best one. Under a similar assumption on the step size, Li et al. [12] recently proved that the streaming PCA algorithm matches the minimax information rate, and Allen-Zhu and Li [3] extended such optimality to the k-subspace version and proposed the Oja algorithm. Recently, Sa et al. [18]

showed the global convergence of Alecton algorithm, which is another variant of stochastic gradient descent algorithm applied to a non-convex low-rank factorized problem. In our experiments, we compare our algorithms with Oja’s algorithm and Oja

algorithm based on various choices of step sizes and show that our algorithm outperforms all of them.

Mitliagkas et al. [16] introduced the block stochastic power method. This method uses each block of data to construct an estimator of the covariance matrix and applies power method in each iteration to update eigenvectors. Theoretical guarantees are provided under the setting of the spiked data model. More recently, Li et al. [11] extended the work of [16] by using blocks of variable size and progressively increasing the block size as the estimate becomes more accurate. Later on, Hardt and Price [7] gave a robust convergence analysis of a more general version of block stochastic power method for arbitrary distributions.

The aforementioned works are all in the setting of streaming i.i.d. data samples. There is another line of research for an arbitrary sequence of data. Most works [13, 6, 5, 15, 14] use the technique of computing a sketch of matrix to find an estimator of the top eigenvector.

Our algorithm is under the setting of streaming i.i.d. data samples and it differs from other streaming algorithms in the sense that our History PCA algorithm exploits key summaries in the historical data to achieve better performance.

3 Problem Setting and Background

Let be a sequence of -dimensional data that stream into memory. Assume all the ’s are sampled i.i.d from a distribution with zero-mean and a fixed covariance matrix . Let and

denote the eigenvalues and corresponding eigenvectors of

.

In this paper, our goal is to recover the top eigenvectors. In the case of , our goal is to find a unit vector that is within the -neighborhood of the first true eigenvector . In the general case with , we use the the largest-principal-angle-based distance function as the metric to evaluate our algorithm, which is

(1)

for any . Here denotes the orthogonal basis of the perpendicular subspace to the subspace spanned by matrix .

Our algorithm is motivated by the block stochastic power method [16]. While the classical power method keeps on multiplying the vector with a fixed sample covariance matrix at each iteration (), the block stochastic power method [16] updates the current estimate by

(2)

where is a sample covariance matrix formed by the block of data points that stream into memory at iteration . Thus, the block stochastic power method can be viewed as a stochastic variant of the power method, and using a larger block size at each iteration can reduce the potentially large variance by averaging out the noises.

The block stochastic power method has several limitations. First, the block size needs to be very large in order for this algorithm to work. That is mainly because the algorithm requires the sample covariance matrix formed by each block of data points to be very close to the true covariance matrix, which is difficult to achieve if the block size is not large enough. And this block size depends highly on the target accuracy . Secondly, the block stochastic power method cannot converge to the true eigenvectors of . If the block size is fixed, it can only converge to the -approximated solution, where .

Our algorithm aims at solving the above problems by using more past information. Instead of doing only one matrix-vector multiplication at each step after a block of data stream in, we do matrix-vector multiplication iteratively either until the algorithm converges or for a pre-specified number of times. This strategy gives us good results at very early stage. Moreover, with the use of past information, we could further reduce the variance of the estimates. Since intuitively there is no reason to favor a particular past block, we assign equal weights to all data samples (or blocks) in each iteration. Details are provided below.

4 Proposed Algorithms

Since data come into memory sequentially, we can form blocks of sample, each with size , as data streaming in, and denote the th data block by a matrix , where is a data point of dimension for .

4.1 Main algorithm: History PCA (Rank-)

We first provide the method and intuition of our History PCA in the rank- case.

At time , we have the first block of data . Since we have no past information, we can make use of the sample covariance matrix formed by the first block of data point, , to estimate the true covariance matrix and its eigenvectors. But instead of finding the eigenvectors of , we try to find the eigenvectors of in the first step, which is mainly for our convergence theory to go through. In the rank- case, we only need to save the first eigenvector of .

At time , we have the second block of data . If we could have both blocks of data in memory, then the best we can do is to form the sample covariance matrix . However, limited memory space prohibits us from storing the entire first block of data. Thus, we need to find an alternative scheme to extract key information from the first block so that the estimator at time could be better than just using the sample covariance matrix . For simplicity of illustration we assume . An intuitive idea is to form a rank- matrix and make use of it in our estimator. Intuitively, we should assign equal weights to the information we get from both blocks of data. Thus, the updated estimator is . Now we only need to save the first eigenvector of this estimator.

At time , we have the th block of data . Similarly, we could get a new estimator of the covariance matrix by exploiting the history from the past blocks of data. Intuitively, each block of data should have equal weights. Thus, we assign the weight and , respectively, to the sample covariance matrix formed by and the rank- matrix formed by the past blocks of data. Therefore, the new covariance estimator at time is

(3)

where is the eigenvector at time .

At the final time , we have seen in total blocks of data and we output . We summarize this algorithm in Algorithm 1.

  Input: , block size: .
  .
  while  not converge  do
     
     
  end while
  for  do
     
     while  not converge  do
        
        
     end while
  end for
  Output
  Extension: the "while" loops can be run by iterations.
Algorithm 1 History PCA: k = 1

4.2 Solving each subproblem approximately

In Algorithm 1, at each iteration we run the power method on the current covariance estimator, which is equation (3), to get the first eigenvector. This can be viewed as a subproblem. However, it will in theory require infinite number of iterations to get the exact eigenvector. Simulations show that using exact convergence have similar results as using power method with small iterations. So instead of solving the sub-problem exactly, we solve it approximately by performing matrix vector multiplication times in the for loops. Note that the space complexity of our algorithm is . However, when we set (only one power iteration at each step), Algorithm 1 will only require memory. This gives us the flexibility to reduce the memory size to if memory is not enough.

4.3 Extending to rank- History PCA

Our method can be generalized to find the first eigenvectors, as shown in Algorithm 2. In the rank- case, we can simply replace the -dimensional vector with a matrix

and then replace the normalization update by the QR decomposition.

In the rank- case, we make use of the history information of eigenvalues. So the new covariance estimator at time , which we utilize to update our estimate, is

(4)

where is a matrix of eigenvectors and is a diagonal matrix with corresponding eigenvalues as the diagonal elements at time .

Similarly, in the generalized History PCA algorithm, we can replace the exact solver by an approximate solver (with 1 or a fixed number of iterations ) to find . We can further reduce the memory complexity of our algorithm from to when we set .

  Input: , block size: .
  . (QR-decomposition)
  while  not converge  do
     
      (QR-decomposition)
  end while
   for .
  
  for  do
     
     while  not converge  do
        
         (QR-decomposition)
     end while
      for .
     
  end for
  Output
  Extension: the "while" loops can be run by iterations.
Algorithm 2 History PCA:

4.4 Theoretical Analysis

Now we provide the convergence theorem for the proposed algorithm. Let be the sample covariance matrix formed by -th block of data, for , that satisfies the following assumptions:

  • , where is a symmetric PSD matrix.

  • with probability .

  • .

Let denote the eigenvectors of and denote the corresponding eigenvalues. Our goal is to compute an -approximation to which is a unit vector satisfying , where denotes the sin of the angle between and . In the following, we prove that our History PCA algorithm can achieve accuracy with

(5)

after seeing data blocks.

Jain et al. [8] proved the convergence of their algorithm by directly analyzing the convergence of Oja’s algorithm as an operator on the initialized vector rather than analyzing the error reduced after each update. Oja’s algorithm applies the matrix

to the random initial vector, say , and then output the normalized result

Inspired by their approach, we apply the following trick to prove the theoretical guarantees of our algorithm. If we set , our algorithm can also be viewed as applying on the initialized vector . In our algorithm, we have and for . We summarize the main convergence theorem for the History PCA algorithm here. The detailed proof can be found in the appendix.

Theorem 1.

Fix any . Let be the smallest integer such that Let . Under the assumption that is invertible, the History PCA algorithm (rank- with ) converges to an -accurate solution with

(6)

with probability at least .

To prove our main Theorem 1, we need to introduce the following lemma.

Lemma 1.

Assume is invertible for any positive integer . Let be a non-negative integer. Let be such that for any positive integer . Let be a unit vector, and let be a matrix whose columns form an orthonormal basis of the subspace orthogonal to . If is chosen uniformly at random from the surface of the unit sphere, then with probability at least ,

where is an absolute constant.

Define the step sizes for all positive integer . We can see that

Applying on gives us

where is some absolute constant. Therefore, Lemma 1 can be interpreted as a relationship between the error bound after iterations of Oja’s algorithm and the error bound of applying the last iterations on a randomly chosen initial vector if we totally dump the result from the first iterations. With Lemma 1 and a carefully selected constant integer , we can prove Theorem 2, which implies Theorem 1 directly.

Theorem 2.

Fix any . Let be the smallest integer such that Assume is invertible, the output of the History PCA algorithm (rank- with ) satisfies:

with probability at least .

5 Experimental Results

In this section we conduct numerical experiments to compare our algorithm with previous streaming PCA approaches using synthetic streaming data as well as real world data listed in Table 1.

5.1 Choice of for History PCA

In Figure 1, we compare the performances of our History PCA algorithm with different choices of (number of inner iterations) on NIPS data set with . Here the accuracy is the explained variance of the algorithm output. We can see that when , History PCA’s performances are equally well. And when , its performance is slightly worse. Thus, a small is already good enough for our algorithm. In the following experiments, we will just set in our algorithm to compare with other existing streaming PCA methods.

Figure 1: The choice of in History PCA on NIPS data set

5.2 Simulated Streaming Data

In this section, we compare the proposed History PCA algorithm with state-of-the-art streaming PCA algorithms: block power method, DBPCA method, Oja’s algorithm and Oja algorithm.

We set Oja’s algorithm and Oja algorithm with step sizes , where is the number of iterations and is the tuning parameter. We obtain the results for but only include the best three values in Figure 2 and Figure 3. The same experimental setting in [16] is used in order to compare our methods with theirs. Here we tried

and varied the noise standard deviation

for the simulated data sets. Thus, , where is the true orthonormal matrix we are trying to recover.

We consider two scenarios: and . For each scenario, we try to recover the top eigenvectors of the data sets respectively. We also vary the block size . The performances of the algorithms are evaluated based on the largest-principal-angle metric, which measures the distances between the estimated principal vectors and the ground truth.

Figure 2 shows the results for and Figure 3 shows the results for . Both figures suggest that the History PCA method is the best for the case of recovering the top eigenvector, as well as for cases recovering additional top eigenvectors, regardless of the block size, noise level, and whether is large or small. What’s more, the error from the History PCA method continues to decrease as the sample size of streaming data increases, while some other streaming methods stop improving at a very early stage. From the simulations, we conclude that the performance of the block power method depends highly on the block size . The performance of the block power method improves as increases. But the History PCA method does not depend on the block size, which is an advantage over the block power method. DBPCA’s performance is sometimes much better than the block power method and outperforms the Oja’s algorithm and Oja algorithm in most scenarios. But it is less stable and less accurate than our History PCA method. The performance of Oja’s algorithm and Oja algorithm depends highly on the tuning of step size. There is no universal good step size for Oja’s algorithm and Oja algorithm. The History PCA method does not need to tune the step size but yields superior performance than those of the best tuned Oja’s algorithm and Oja algorithm in all scenarios.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
(o)
Figure 2: Comparison of streaming PCA algorithms on simulated data sets (d = 100).
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
Figure 3: Comparison of streaming PCA algorithms on simulated data sets (d = 1000).

5.3 Large-scale real data

(a) NIPS, ,
(b) NIPS, ,
(c) NIPS, ,
(d) NYTimes, ,
(e) NYTimes, ,
(f) NYTimes, ,
Figure 4: Comparison of streaming PCA algorithms on real data sets (streaming setting).
(a) RCV1 , data passes
(b) KDDB , data passes
(c) RCV1 , time
(d) KDDB , time
Figure 5: Comparison of all batch and streaming PCA algorithms on large data sets with respect to number of data access (a, b) and run time (c, d).
# samples # features # nonzeroes
NIPS 1500 12419 746,316
NYTimes 300,000 102,660 69,679,427
RCV1 677,399 47,236 49,556,258
KDDB 19,264,097 29,890,095 566,345,888
Table 1: Data set statistics

We compare our algorithm with Oja’s algorithm, VR-PCA and noisy power method [16, 7] on four real data sets. We do not include the Oja algorithm as we observe that its performance is almost identical to the Oja’s algorithm. NIPS and NYTimes data sets are downloaded from UCI data. RCV1 and KDDB data sets are downloaded from LIBSVM data sets. A summary of our data sets is provided in Table 1.

We first consider the streaming setting, where the algorithms can only go through the data once. For a fair comparison, we set Oja’s algorithm with step sizes , where is number of iterations and ranges from to in all the experiments. The best result of the Oja’s algorithm is presented in the plots. We also compare the algorithms with different block size and at different target rank . In practice we find Oja’s algorithm () often numerically unstable and cannot achieve the comparable performance with other approaches, so we do not include their result for . To evaluate the results, we use the widely used metric “explained variance”,

to measure the quality of the solution (where is the computed solution). Note that the value is always less or equal to 1, and will be maximized when is the leading eigenvectors of . In Figure 4, we observe that our algorithm (History PCA) is very stable and consistently better in all settings (big or small , big or small ).

5.4 Comparing streaming with non-streaming algorithms

In the final set of experiments, we go beyond the streaming setting and allow several passes of data. We want to test (1) In addition to having fewer number of data access, can our algorithm also have short running time in practice? (2) If we go beyond the first pass of the data, can our algorithm keep on improving the solution by making more passes? (3) In practice, if there is a large data set with millions or billions of data points and we want to compute PCA, should we use our algorithm instead of other batch PCA algorithms?

In order to answer these questions, we test our algorithm on two large data sets: RCV1 and KDDB (with 20 million samples and 30 million features). Also, we add two strong batch algorithms into comparison, including power method and VR-PCA [19], which are the state-of-the-art PCA solvers. In order to clearly show the convergence speed, we compute the “error” by the largest eigenvalue minus the unnormalized explained variance , so ideally the error will converge to 0 (we are doing this in order to show the error in log-scale in the y-axis). The results are presented in Figure 5. In Figure (a)a and (b)b, we show the number of data passes versus errors, and the vertical line means the end of the first data pass. We observe that even within the first data pass, the error of our algorithm can be bounded by and after the first pass our algorithm is still able to improve the solution. We also observe that VR-PCA achieves better performance with more passes of data and sufficient time while we argue that an error of order is accurate enough for practical use.

Finally we compare the real run time with all the other algorithms. In Figure (c)c, since the data set is not that large, we store the data set in memory and perform in-memory data access, while for KDDB data (Figure (d)d), since the data size is larger than memory, we load data block by block and compare the overall running time. The results show that our algorithm is still much better than other methods in terms of run time. The only real competitor, again, is VR-PCA, which will catch up our algorithm when reaching an error of order . In the future, it will be interesting to apply a variance reduction technique to speed up the convergence of our algorithm after the first pass of data.

6 Conclusions

We propose History PCA, a hyperparameter-free streaming PCA algorithm which utilizes past information effectively. We extend our algorithm to solve the rank-

PCA problems with . What’s more, we provide the theoretical guarantees and convergence rate for our algorithm, and we show that the convergence speed is faster than existing algorithms on both synthetic and real data sets.

References

Appendix A Proof of Lemma 2

Proof.

Since with and , we know that By Lemma 6 of [8], we know that with probability at least ,

where is an absolute constant. Since

we have

where is an absolute constant. ∎

Appendix B Proof of Theorem 3

Proof.

Define the step sizes for all positive integer . Then we have With , we have Thus, satisfy the conditions in Theorem 3.1 of [8]. Define . So, we have a bound for the error

where .

Since , we have

Thus, we can conclude that .

Besides, we have Therefore,

for some constant . Since , we have

Thus,