Modern problems give rise to more and more massive data sets. Modern solutions require computationally efficient algorithms that make data manageable and therefore meaningful. The main purpose of Principal Component Analysis is to project a data set with high dimension onto a data set of low dimension and still retain its fundamental properties. Therefore, Principal Component Analysis is most useful for data sets that have such massive dimensions and that they cannot be stored or manipulated in practice. Unfortunately, the prohibitive size of such data means that standard techniques for computing the principal components are also inefficient or impossible in practice.
This motivates the streaming setting for PCA, where the PCA algorithm iterates over the data set a few samples at a time to produce a basis for the desired lower-dimensional subspace of the data. Of particular interest are algorithms for single-pass streaming PCA, which require only a single pass over the data set to obtain the desired subspace. These algorithms are particularly important in the online setting, where data may only be read once.
Oja’s method  has been the fundamental basis for most streaming PCA results since its proposal in 1982. This is largely because of its simplicity and asymptotic convergence guarantees under mild conditions . However, Oja’s method is commonly implemented using learning rate (step size) schemes that scale with an unknown constant. This hyper-parameter must be predetermined to obtain optimal convergence rates–often requiring multiple passes over the data and violating the online setting. Indeed, one of the fundamental problems with many streaming PCA algorithms is optimizing hyper-parameters without violating the streaming or the online settings. In response to this deficiency we propose AdaOja, a new learning rate scheme for Oja’s method for streaming PCA. This method uses an adaptive scheme that circumvents the need to select or test over any hyper-parameters. It is simple to implement and provides excellent convergence results in practice.
In section 2 we explain the problem setting and give background for Oja’s method. In section 3 we present the AdaOja algorithm. In section 4 we demonstrate that AdaOja performs as well as or better than multiple pass learning rates for Oja’s method on both synthetic and real-world data. This section demonstrates compelling empirical evidence for AdaOja’s efficacy as a new implementation for Oja’s method. In section 5, we further demonstrate AdaOja’s performance by comparing it to other state-of-the-art single-pass streaming algorithms across this variety of data sets. Finally, we present future directions for our work.
2 Problem Setting
Let be a data set consisting of samples . We assume that these samples are i.i.d. with mean 0 and some unknown covariance . We want to find the -dimensional, orthonormal subspace s.t. when the data is projected onto
the variance is maximized. In other words, we want to solve
Clearly, this equation is maximized by the top eigenvectors of . Since the true covariance matrix is unknown, classical PCA computes the top eigenvectors for the sample covariance matrix . Using the eigenvectors of the sample covariance matrix to approximate the eigenvectors of achieves the information lower bound [17, 25]. As we mentioned before, however, computing the sample covariance and its corresponding eigenvectors directly using offline methods may be impossible for large .
We recall that . The gradient of the subspace in is . It follows that
is an unbiased stochastic estimate of the gradient of our problem.
2.1 Oja’s Method
The natural next step for this problem is to apply projected stochastic gradient descent (SGD), which is precisely Oja’s method. In the case that we apply projected SGD as follows
Initialize a vector of unit norm,.
For each in
Choose learning rate
Project onto by taking .
A simple extension to the case yields the following steps:
2.1.1 Learning Rates
In many implementations of Oja’s algorithm, it is common practice to choose or where is a constant chosen by running the algorithm multiple times over the data. For example, , ,  and  implement Oja’s with step size ; ,  and  use step size ; and  and  use step size . In these applications, typically many different possible constant values are tested on the data set, and only the best case values are used for empirical results. This demonstrates the behavior of Oja’s method in the best case, but this best case multi-pass PCA is not feasible for relevant online applications. Note that, even if multiple learning rates are applied in parallel (which would increase the memory constraints), determining the best case subspace to use requires taking and comparing accuracy readings for the all of the final solutions. This would be prohibitive for relevant accuracy metrics. Hence our goal is to find a robust learning rate scheme for Oja’s method that does not require multiple passes, a priori information or wasteful parallelization.
A few papers suggest learning rate schemes distinct from the and settings. For example,  establishes an adaptive learning rate
for the single eigenvector case. However, this method still includes hyperparameter, the paper includes limited empirical results and  found that the method had poor performance. In  the authors propose a complicated burn-in scheme followed by a probabilistically chosen step size scheme. The authors have not yet published many empirical results to justify their work, and the majority of the paper is theoretical.
3 The AdaOja Algorithm
We wanted to design a simple, effective way to determine the step size parameter for Oja’s method without the need to fine tune hyperparameters or pre-determine properties of the data set. This led us to consider common variants of stochastic gradient descent. In 2010  and  introduced the AdaGrad update step for stochastic gradient descent for a single vector. This method is widely used in practice. In  the authors develop theory for a global step size variant of the AdaGrad update step.
For this scheme, the learning rate is defined via
Hence . Here is the latest stochastic approximation to the gradient. Not only does this scheme work well when applied to SGD in practice,  develops novel theory for this AdaGrad setting in non-convex landscapes.
We also considered how to adapt and apply other common learning rate schemes for stochastic gradient descent, such as ADAM 
and RMSProp. Both of these algorithms use a momentum term to improve the convergence rates for stochastic gradient descent. However,  discusses why adding momentum naively to Oja’s method fails in practice. Indeed, when we implemented these methods and applied them to Oja’s method, they completely failed as expected. However, an interesting problem would be to see if a more sophisticated adaptation of ADAM or RMSProp could improve the results for Oja’s method in practice.
Note that a simple adaptation of the streaming PCA setting updates with mini-batches of samples (where is small) rather than updating with a single sample . That is, for a mini-batch of samples our stochastic approximation to the gradient would be rather than . When this is consistent with the single sample case. Examples of this technique are found in , ,  and . This is particularly relevant for the case where the time cost is dominated by the decomposition of . Manipulating the samples as a mini-batch therefore reduces the time complexity by a factor of . We apply this mini-batch scheme to our empirical implementations of Oja’s method as well as our new algorithm, AdaOja, in Section 3.2.
We apply the adaptive learning rate method from Algorithm 2 in  to Oja’s method  to obtain the AdaOja algorithm. To the best of our knowledge, AdaGrad has never before been applied to Oja’s method and the streaming PCA problem in this way.
Note that  establishes that AdaGrad (equations 2-4) is strongly robust to the initial choice of . We found to be a sufficiently small starting size based on the empirical results of  and our own work.
We want to extend the case to the case. Yet the global step size AdaGrad algorithm from equations 2-4 assumes . The simplest extension of AdaGrad to the case updates the learning rate with the squared norm of the entire matrix . A better extension, however, draws on this AdaGrad algorithm in its coordinate form. In the , coordinate form, each of the components receives and updates its own learning rate, setting:
We can apply this principle to the case by obtaining and updating unique learning rates for each of the columns of . We have chosen not to use the full coordinate form of AdaGrad to avoid over-constraining the problem and find that this column-wise extension works well in practice. Algorithm 2 is the result.
4 Testing and Empirical Results
4.0.1 Accuracy Metric
The PCA problem seeks to find the directions of maximum variation for our data set. That is, when we project our data set onto a subspace of lower dimension, we want to capture the maximum amount of variance possible in the data set. Hence, an obvious metric for the accuracy of our model is the explained variance, defined (for ) to be:
This metric is essentially the percentage of the original variance of the data set captured by our new, lower dimensional data set. Note that the explained variance is maximized by the eigenvectors of of the sample covariance . The explained variance is particularly useful for several reasons.
First, the explained variance demonstrates how much of the original data set can be represented in a lower dimension. When is equal to the top eigenvectors of the sample covariance matrix, the explained variance is the maximum amount of variance we can recover given the data we have received. Second, the explained variance is directly connected to the Rayleigh quotient and subsequently the problem setting. Third, the explained variance is robust to situations where the gap between the top eigenvalues is . That is, if the top two eigenvalues are the same (or nearly the same), then the eigenvector associated with either the first or second eigenvalue would sufficiently maximize the variance in the problem setting. This is reflected in the explained variance. For error metrics that are concerned with retrieving the exact eigenvectors of (such as the commonly used principal angle based error metric), interchanging the eigenvectors would cause the algorithm to fail to converge.
4.1 The Power of AdaOja Learning Rates
To test the efficacy of our algorithm, we test on both synthetic and real-world data. Our experiments on synthetic data are included in section 4.1.1, our experiments on sparse real-world data are included in section 4.1.2, and our experiments on dense real-world data are included in section 4.1.3. For each data set, we run AdaOja against Oja’s method with learning rates respectively. In our plots, we show the final explained variance achieved by Oja’s method for on a range of scales, and plot these against the final explained variance achieved AdaOja (with ) in a single pass. For sufficiently small data sets, we also plot the final explained variance for the eigenvectors of the sample covariance computed explicitly in the offline setting. Both this final ”svd” derived value and the final AdaOja value are kept constant for all and serve as a reference.
We find that across every data type, AdaOja outperforms Oja’s method for the majority of scales for . It also achieves compellingly close explained variance results to the ”true” offline explained variance results–particularly for our dense real-world data and low-noise synthetic data.
4.1.1 Spiked Covariance Data
For our experiments with synthetic data, we use the spiked covariance model. In our implementation of this model, we let where
Here is a set of d-dimensional orthonormal vectors. We set to be s.t. and . We scale so that and set to be the diagonal matrix with values . Here is a noise parameter that augments the distribution. In our examples, we set and .
For this data, we measure the final explained variance for . We ran these tests with batch size for and to demonstrate the behavior over a range of values. Here we include the results for to demonstrate the differences in the high and low noise cases–see figures 1 and 2. The remaining plots are included in Appendix A.
From our synthetic data sets, we see certain trends. First, for instances of low noise, the final explained variance for Oja’s method with both and varies widely depending on the scaling of . As the noise increases, there is less variation in the final explained variance regardless of the scaling of . We further notice that as increases, there is a greater difference between the maximum final explained variance achieved by Oja’s and the minimum final explained variance achieved by Oja’s. This trend is consistent for every noise level .
We also note that the maximum explained variance achieved across values is approximately the same for both Oja’s with learning rate and Oja’s with learning rate . It is most significant, however, that for every instance of and , the final explained variance of AdaOja approximately achieves or outperforms this maximum explained variance without any hyper-parameter optimization. Not only does AdaOja outperform Oja’s with learning rates and for the vast majority of scales, it achieves the best results possible almost every time without violating the single-pass streaming setting to determine learning rate hyperparameters. Note that for low noise, there is almost no difference between the explained variance achieved offline from the sample covariance and online via AdaOja. As the noise increases and as increases, there is a marginal gap between this explicit value and the value achieved with AdaOja.
4.1.2 Sparse bag-of-words
We apply our algorithm to five different real world, sparse, bag-of-words data sets: Kos, NIPS, Enron, Nytimes, and PubMed . All of these data sets are sparse, with densities ranging from 0.0004 to 0.04.
For our small bag-of-words data (kos, nips, and enron) we set . We run Oja’s and AdaOja’s with batch size and seek to recover the top eigenvectors. These results are visualized in figure 3. For these data sets, AdaOja achieves greater explained variance than Oja’s for both learning rates for every choice of . For these data sets, we also notice that the maximum explained variance for is slightly greater than the maximum explained variance for . We also note that Oja’s with and achieved their best performance for a very limited number scales–without a priori information or multiple passes to determine , it is unlikely that Oja’s would perform well in these settings.
For our NyTimes, we again test and for our Pubmed data set we choose . For both of these we set the batch size to be and seek to recover the top eigenvectors. These results are visualized in figure 4. In the case of these larger data sets, we note there are slightly more values for which Oja’s achieves best case behavior, but the algorithm still falls short for the vast majority of values. As with our previous results, we note that AdaOja’s method achieves greater than or equal to the best case explained variance for Oja’s method without any hyperparameter optimization.
4.1.3 CIFAR-10 Data set
CIFAR 10 is a subset of the tiny images data set . It contains 50,000 training and 10,000 testing color images in 10 mutually exclusive classes (6,000 images per class) and is frequently used for image classification. Note that because this data is dense, we first centralized the data by subtracting the mean of each attribute (pixel) before applying our algorithm. Figure 5 exhibits the final explained variance after 50,000 samples for AdaOja vs Oja’s method with learning rates . As before, we see that the final explained variance achieved by Oja’s method is significantly lower than that achieved by AdaOja for the majority of scales. We also note that, as with our spiked covariance data, as increases the gap between the explained variance achieved by AdaOja and the explained variance achieved by Oja’s increases for the majority of the scales. We also note that AdaOja achieves almost exactly the same explained variance as the ”true” principal components computed offline.
5 AdaOja vs. State of the Art
One of our objectives for this work is to not only demonstrate the ability of AdaOja’s method to solve the learning rate problem for Oja’s method, but to test the performance of AdaOja against other, state-of-the-art streaming solutions. In particular, we test the convergence of the explained variance for the AdaOja algorithm against two other algorithms: Streaming Power Method (SPM) and History PCA (HPCA). Streaming Power Method (which is the noisy power method applied to the streaming PCA problem) was first introduced for PCA in . Further theory for this method in the PCA setting was developed in  and . Yang, Hsieh, and Wang recently proposed a new algorithm  for streaming PCA called History PCA (HPCA) which performs PCA in the block streaming setting using an update step that combines the block power method and Oja’s algorithm. This algorithm performs well in practice and does not require hyperparameter estimation for the learning rate. It is interesting to note that in the empirical results from , Oja’s method was implemented with for a range of values (with the top results displayed). In these experiments, HPCA consistently–and sometimes significantly–outperformed Oja’s method. However, our results demonstrate that the adaptive nature of AdaOja appears to compensate for some of these deficiencies.
In our experiments, we tested the explained variance for AdaOja, HPCA, and SPM on the data sets from section 4 for both small and large batch sizes . Varying the batch size is important to demonstrate the behavior of SPM, which as a stochastic variant of the standard power method is highly dependent on the choice of . For example, for the real-world, dense CIFAR data set, AdaOja and HPCA achieve almost identical convergence in both the and settings. Yet SPM only achieves comparable convergence with these methods in the setting, and for still falls short of the near optimal explained variance achieved by AdaOja and HPCA. Figure 6 demonstrates this result.
AdaOja appears to perform comparably to HPCA, and in some instances outperforms it. For example, figure 7 compares the convergence of the three methods for our small bag-of-words data sets, on which HPCA either marginally outperforms AdaOja (as with the Kos and Nips data sets) or performs approximately the same (as with the Enron data set).
However, for our slightly larger Bag-of-Words data sets (see figure 8), AdaOja appears to marginally outperform HPCA. Hence the adaptive choice of stepsize enables Oja’s method to legitimately compete with state-of-the-art methods for real-world data.
We notice some of the same trends on our synthetic spiked covariance data. In particular, the performance of AdaOja and HPCA appears to be fairly consistent for and (with AdaOja sometimes achieving improvement in the case), but SPM achieves far better explained variance in the case. For higher noise levels, however, SPM achieves worse and worse explained variance with only marginal improvements in the case. We also note that across all choices of , AdaOja tends to achieve better convergence rates and a higher explained variance than HPCA as the noise increases. The images for these results are contained in Appendix B.
6 Future Work
In this paper we introduced AdaOja, a new algorithm for streaming PCA based on Oja’s method and a global stepsize variant of the AdaGrad algorithm. This algorithm provides a simple solution to the hyperparameter optimization problem for Oja’s method that is easy to implement and works well in practice. We demonstrated on multiple different types of data that this algorithm approximately achieves or surpasses the optimal performance of Oja’s method against other commonly used learning rate schemes. We also showed that this algorithm performs comparably to or surpasses other state-of-the-art algorithms, and is robust to the choice of batch size (unlike SPM).
These compelling empirical results open intriguing new avenues for research. Several algorithms for streaming PCA incorporate Oja’s method into their update steps. For example, HPCA  uses an update step that combines Oja’s method with the block power method. One interesting area of further development would be to incorporate AdaOja–rather than Oja’s method–into this algorithm. Another algorithm, Oja++  suggests a gradual initialization period after which the algorithm proceeds exactly as Oja’s method. It may yield better results to incorporate AdaOja into this scheme.
Of course, given these compelling empirical convergence results we want to establish theoretical convergence guarantees for AdaOja as well. This is a particularly exciting field of research because there is an extraordinary gap between methods implemented in theory and in practice for such iterative methods. Our method is essentially a variant of stochastic gradient descent, projected onto the (nm-convex) unit sphere. However, the setting is non-convex, the learning rate is adaptive, and the result at each iteration is projected onto the space of orthogonal unit vectors. Theoretical results for AdaGrad in the stochastic, non-convex setting are only recently being developed and only guarantee convergence to a critical point [26, 29]. Theoretical results for Oja’s method are also an open area of research, and current convergence rates have been largely derived from complex learning rate schemes that are neither practically usable nor applicable in the adaptive setting (see [13, 2] for some of the most recent work in this area). Hence, establishing convergence rate results for AdaOja will lead to novel theoretical improvements for both AdaGrad and Oja’s method.
-  (2018-07) An Acceleration Scheme for Memory Limited, Streaming PCA. ArXiv e-prints. External Links: Cited by: §2.1.1.
-  (2016-07) First Efficient Convergence for Streaming k-PCA: a Global, Gap-Free, and Near-Optimal Rate. ArXiv e-prints. External Links: Cited by: §2.1.1, §6, §6.
-  (2018-08) Streaming pca and subspace tracking: the missing data case. Proceedings of the IEEE 106 (8), pp. 1293–1310. External Links: Cited by: §2.1.1.
-  (2018) Online principal component analysis in high dimension: which algorithm to choose?. International Statistical Review 86 (1), pp. 29–50. External Links: Cited by: §2.1.1.
-  (2018) Dimensionality reduction for stationary time series via stochastic nonconvex optimization. CoRR abs/1803.02312. External Links: Cited by: §2.1.1.
-  (2018-04) Average performance analysis of the stochastic gradient method for online PCA. arXiv e-prints, pp. arXiv:1804.01071. External Links: Cited by: §2.1.1.
-  (2017-07) Accelerated Stochastic Power Iteration. arXiv e-prints, pp. arXiv:1707.02670. External Links: Cited by: §3.1.1.
UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §4.1.2.
-  (2011-07) Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, pp. 2121–2159. External Links: Cited by: §3.1.1.
-  (2016-02) An Improved Gap-Dependency Analysis of the Noisy Power Method. arXiv e-prints, pp. arXiv:1602.07046. External Links: Cited by: §5.
-  (2014) The noisy power method: a meta algorithm with applications. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2861–2869. External Links: Cited by: §3.1.2, §5.
-  (2012) Lecture 6a overview of mini-batch gradient descent”. Note: Coursera Lectures slidesAvailable at: https://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf Cited by: §3.1.1.
-  (2016) Matching matrix bernstein with little memory: near-optimal finite sample guarantees for oja’s algorithm. CoRR abs/1602.06929. External Links: Cited by: §2.1.1, §6.
-  (2016-03) Near-Optimal Stochastic Approximation for Online Principal Component Estimation. arXiv e-prints, pp. arXiv:1603.05305. External Links: Cited by: §2.1.1.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §3.1.1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.1.3.
-  (2018-01-01) Near-optimal stochastic approximation for online principal component estimation. Mathematical Programming 167 (1), pp. 75–97. External Links: Cited by: §2.
-  (2006) Global convergence of oja’s pca learning algorithm with a non-zero-approaching adaptive learning rate. Theoretical Computer Science 367 (3), pp. 286 – 307. External Links: Cited by: §2.1.1.
-  (2018-10–15 Jul) Streaming principal component analysis in noisy setting. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 3413–3422. External Links: Cited by: §2.1.1.
-  (2010) Adaptive bound optimization for online convex optimization. CoRR abs/1002.4908. External Links: Cited by: §3.1.1.
-  (2013-06) Memory Limited, Streaming PCA. ArXiv e-prints. External Links: Cited by: §3.1.2, §5.
Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology 15 (3), pp. 267–273. External Links: Cited by: §1, §2.1, §3.2.
-  (2015) A stochastic pca and svd algorithm with an exponential convergence rate. In ICML, Cited by: §2.1.1.
-  (2016-07) Image alignment by online robust pca via stochastic gradient descent. IEEE Transactions on Circuits and Systems for Video Technology 26 (7), pp. 1241–1250. External Links: Cited by: §2.1.1.
-  (2013-12) Minimax sparse principal subspace estimation in high dimensions. Ann. Statist. 41 (6), pp. 2905–2947. External Links: Cited by: §2.
-  (2018-06) AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv e-prints, pp. arXiv:1806.01811. External Links: Cited by: §3.1.1, §3.1.1, §3.2, §3.2, §6.
Accelerated stochastic power iteration.
Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, A. Storkey and F. Perez-Cruz (Eds.), Proceedings of Machine Learning Research, Vol. 84, Playa Blanca, Lanzarote, Canary Islands, pp. 58–67. External Links: Cited by: §2.1.1, §3.1.2.
-  (2018-02) History PCA: A New Algorithm for Streaming PCA. ArXiv e-prints. External Links: Cited by: §2.1.1, §3.1.2, §5, §6.
On the convergence of adagrad with momentum for training deep neural networks. CoRR abs/1808.03408. External Links: Cited by: §6.