Federated PCA with Adaptive Rank Estimation

07/18/2019 ∙ by Andreas Grammenos, et al. ∙ University of Cambridge 8

In many online machine learning and data science tasks such as data summarisation and feature compression, d-dimensional vectors are usually distributed across a large number of clients in a decentralised network and collected in a streaming fashion. This is increasingly common in modern applications due to the sheer volume of data generated and the clients' constrained resources. In this setting, some clients are required to compute an update to a centralised target model independently using local data while other clients aggregate these updates with a low-complexity merging algorithm. However, some clients with limited storage might not be able to store all of the data samples if d is large, nor compute procedures requiring at least Ω(d^2) storage-complexity such as Principal Component Analysis, Subspace Tracking, or general Feature Correlation. In this work, we present a novel federated algorithm for PCA that is able to adaptively estimate the rank r of the dataset and compute its r leading principal components when only O(dr) memory is available. This inherent adaptability implies that r does not have to be supplied as a fixed hyper-parameter which is beneficial when the underlying data distribution is not known in advance, such as in a streaming setting. Numerical simulations show that, while using limited-memory, our algorithm exhibits state-of-the-art performance that closely matches or outperforms traditional non-federated algorithms, and in the absence of communication latency, it exhibits attractive horizontal scalability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, the advent of edge computing in smartphones, IoT and cryptocurrencies has induced a paradigm shift in distributed model training and large-scale data analysis. Under this new paradigm, data is generated by commodity devices with hardware limitations and severe restrictions on data-sharing and communication, which makes the centralisation of the data extremely difficult. This has brought new computational challenges since algorithms do not only have to deal with the sheer volume of data generated by networks of devices, but also leverage the algorithm’s voracity, accuracy, and complexity with constraints on hardware capacity, data access, and device-device communication. In light of this, the necessity of being able to analyse large-scale decentralised datasets and extract useful insights out of them is becoming more prevalent than ever before. Seminal work in this domain has been made, but mainly in the context of deep neural networks, see

McMahan et al. (2016); Konečnỳ et al. (2016). Specifically, in Konečnỳ et al. (2016) a federated method for training of neural networks was proposed. In this setting one assumes that each of a large number of independent clients can contribute to the training of a centralised model by computing local updates with their own data and sending them to the client holding the centralised model for aggregation. Ever since the publication of this seminal work, interest in federated algorithms for training neural networks has surged, see Smith et al. (2017)He et al. (2018),  Geyer et al. (2017). Despite of this, federated adaptations of classical data analysis techniques are still missing. Out of the many techniques available, Principal Component Analysis (Pearson (1901); Jolliffe (2011) is arguably the most ubiquitous one for discovering linear structure or reducing dimensionality in data, so has become an essential component in inference, machine-learning, and data-science pipelines. In a nutshell, given a matrix of feature vectors of dimension , aims to build a low-dimensional subspace of

that captures the directions of maximum variance in the data contained in

. Apart from being a fundamental tool for data analysis, is often used to reduce the dimensionality of the data to minimise the cost of computationally expensive operations. For instance, before applying t-SNE Maaten and Hinton (2008) or UMAP McInnes et al. (2018). Hence, a federated algorithm for is not only desired when data-ownership is sought to be preserved, but also from a computational viewpoint.

Herein, we propose a federated algorithm for PCA. The computation of

is related to the Singular Value Decomposition (

Eckart and Young (1936); Mirsky (1966)

which can decompose any matrix into a linear combination of orthonormal rank-1 matrices weighted by positive scalars. In the context of high-dimensional data, the main limitation stems from the fact that, in the absence of structure, performing

on a matrix requires computation time and memory. This cubic computational complexity and quadratic storage dependency on makes the cost of computation prohibitive for high-dimensional data, though it can often be circumvented when the data is sparse or has other type of exploitable structure. Moreover, in some decentralised applications, the computation has to be done in commodity devices with storage capabilities, so a algorithm with memory dependency is highly desirable. On this front, there have been numerous recent works in the streaming setting that try to tackle this problem, see Mitliagkas et al. (2014, 2013); Marinov et al. (2018); Arora et al. (2012, 2016); Boutsidis et al. (2015). However, most of these methods do not naturally scale well nor can they be parallelised efficiently despite their widespread use, e.g. Bouwmans and Zahzah (2014); Boutsidis et al. (2015). To overcome these issues a reliable and federated scheme for large decentralised datasets is highly desirable. In this work, we exploit a known paradigm from the randomised linear algebra literature dubbed sketch and solve Woodruff et al. (2014). This paradigm aims to save computational costs by producing a summary, or reduced representation of the data, to approximately solve an instance of the original problem. In matrix computation terms, this paradigm is particularly suitable if summaries can be computed incrementally as the data is discovered, such as in the streaming setting.

Summary of contributions: Our main contribution is a modular, federated algorithm for designed for the combined setting of stochastic and streaming data. Our algorithm is comprised out of two distinct and independent components: (1) An algorithm for the incremental, decentralised computation of local updates to , (2) a low-complexity merging procedure to aggregate these incremental updates together. The incremental component is based on a scheme for streaming linear dimensionality reduction proposed in Eftekhari et al. (2018) which belongs to the family of deterministic sketching algorithms in the sequence Ghashami et al. (2016); Liberty (2013); Ghashami and Phillips (2014). However, contrary to previous algorithms, our algorithm is able to adaptively estimate the rank and adjust the number of principal components by controlling the contribution of the least significant singular value to the total variance. The aggregation component in the streaming setting is the result of a corollary of (Lemma 1), which leads to an improved version of the algorithm proposed in Rehurek (2011). Moreover, our algorithm is designed to work under the assumption that we are only allowed to do one pass through each column of using an -memory device.

2 Notation & Preliminaries

This section introduces the notational conventions used throughout the paper. For integers we use the shorthand and for the special case . We use lowercase letters for scalars, bold lowercase letters for vectors, bold capitals for matrices, and calligraphic capitals for subspaces. If and , then is the block composed of columns indexed by . In particular, when for some we write . We reserve

for the zero matrix in

and

for the identity matrix in

. Additionally, we use to denote the Frobenius norm operator and to denote the norm. If we let be its full formed from unitary and and diagonal having . The values are the singular values of . If , we let be the residual of and be its best rank- approximation, which is the solution to

We will often abuse notation and write to denote the triplet . Both uses of should be clear from the context. Finally, we let be the QR factorisation of and

be its eigenvalues when

.

Streaming Model: A data stream is defined as a feature vector sequence with the property that for any . In this work, we shall assume that data streams are vectors in indexed by the natural numbers, which at any time can be arranged in a matrix .

Federated learning: Federated Learning Konečnỳ et al. (2016) is a machine-learning paradigm that considers how a large number of clients owning different data-points can contribute to the training of a centralised model by locally computing updates with their own data and merging them to the centralised model without sharing data between each other. Our method resembles the distributed agglomerative summary model (DASM) Tanenbaum and Van Steen (2007) in which updates are aggregated in a “bottom-up” approach following a tree-structure. That is, by arranging the nodes in a tree-like hierarchy in such a way that for any sub-tree, the leaves compute intermediate results and propagate them to the roots for merging or summarisation. In our model, the summaries only have to travel upwards only once to be combined. This permits minimal synchronisation between the nodes so, for the purposes of this work, we do not model any issues related to synchronisation.

3 Federated PCA

We consider a decentralised dataset scattered across clients. The dataset can be stored in a matrix with and such that is owned by client . We assume that each is generated in a streaming fashion and that due to resource limitations it cannot be stored in full. Furthermore, under the DASM we assume that the clients in the network can be arranged in a tree-like structure with levels and approximately leaves per node. Without loss of generality, in this paper we assume that . Our federated algorithm for is given in Algorithm 1.

Data: such that belongs to client ;
, rank estimate; , number of subspace aggregations per batch;
, bounds on if local updates are computed with ;
Result: ( such that and with ;
for ;
for each level  do
        // Phase I: Estimate local updates
        for each client in level  do
              
              
        end for
       // Phase II: Merging estimations
        Obtain by merging recursively on batches of size ;
       
end for
;
Algorithm 1 Federated PCA

Note that Algorithm 1, invokes two separate sets of procedures corresponding to the computation of local updates and the merging of resulting sub-spaces. In the inner-most loop, each client collects or generates in a streaming fashion and computes an estimate of using or (Algorithm 3). Then, the estimates for level are aggregated by the calling merge (Algorithm 2) recursively on subspace batches of size . Due to the recursive nature of the procedure, only merging operations are required at level . Detailed descriptions of Algorithms 2-3 are given in the following sections.

3.1 Merging

Our algorithmic constructions are built upon the concept of subspace merging in which two subspaces and are merged together to produce a subspace describing the combined principal directions of and . One can merge two sub-spaces by computing a truncated on a concatenation of their bases. Namely,

(1)

where a forgetting factor that allocates less weight to the previous subspace . An efficient version of (1) is presented in Algorithm 4 (Appendix). In Rehurek (2011), it is shown how (1) can be further improved when is not required and we have knowledge that and are already orthonormal. This is done by building a basis for via the QR factorisation and then computing the decomposition of a matrix such that

(2)

It is shown in  (Rehurek, 2011, Chapter 3) that this yields an of the form

(3)

where . This procedure is shown in Algorithm 2.

Data: , , subspaces to be merged;
, rank estimate ; , forgetting factor for ; , enhancing factor for ;
Result: merged subspace;
;
;
;
;
Algorithm 2 MergeSubspaces (Fastest subspace-merging algorithm)

The merging procedure outlined above can be generalised to multiple subspaces. A proof of this was given by Iwen and Ong (2016). In Lemma 1 we extend this result to the case of streaming data. The proof is delayed to the Appendix.

Lemma 1 (Streaming partial uniqueness).

Let with and with given by . Assume streaming data and that at any given time only only vectors can be stored in each client. Let , and let and be the reduced decompositions of and . Then , and , where is a unitary block diagonal matrix. If none of the nonzero singular values are repeated then .

We stress that result proved in Iwen and Ong (2016) is incremental, but not streaming. This means that every result has to be computed in-full in order to be processed, merged, and propagated for the user to get a final result, so is not fit for a federated computing approach.

3.2 Estimation of local updates: Streaming, adaptive

In this section we introduce and which are the streaming algorithms clients use to compute local updates to the centralised model. Consider a sequence of feature vectors and let their concatenation at time be

(4)

A block of size is formed by taking contiguous columns of . Hence, a matrix with induces blocks. For convenience, we assume , so that . In this case, block corresponds to the sub-matrix containing columns . It is assumed that all blocks are owned and observed exclusively by client , but that due to resource limitations it can not store them all. Hence, once client has observed it uses it to update its estimate of the principal components of and then releases from memory. If is the empty matrix, the principal components of can be estimated by

(5)

This algorithm is called and was proposed in Eftekhari et al. (2018). Its output after iterations is , which, as mentioned above, contains both an estimate of leading principal components of and the projection of onto this estimate. The local subspace estimation of our federated algorithm is inspired on an implementation of iteration (5) presented in Eftekhari et al. (2018). However, our algorithm is designed to achieve substantial computational savings by only tracking the left principal subspace and, contrary to other algorithms, the singular value estimates. Additionally, the nodes in our algorithm can adjust, independently of each other, their rank estimate based on the distribution of the data seen so far. This is convenient because we expect to have nodes which observe different distributions and will likely require to adjust the number of principal components kept in order to accurately track the data distribution over time. While energy thresholding methods Al-Kandari and Jolliffe (2005); Papadimitriou et al. (2005) are a natural and established way to achieved this, these methods only work well when updates are performed for each feature vector. Moreover, according to our experiments, they also under-perform with block-based approaches. To this end, we propose a simple and efficient regularisation scheme based solely on the estimate of the current singular values and their contribution to the total approximation discovered so far. Specifically, letting be the -th singular value of the dataset at time , we use to control the lower bound on the total variance . We do this by letting be minimum and maximum contributions to the variance of the dataset and enforcing

(6)

That is, by increasing whenever and decreasing it when . This adjustment happens only once per block in order to prevent the addition of too many principal components in one go, but point out that a number of variations to this strategy are possible. For example, a user could also implement a hold-off duration to prevent updates in for a predefined number of blocks. Our algorithm is presented in Algorithm 3. We use the labels and to distinguish the variant with adaptive rank estimation.

Data: Feature vectors , with ;
, rank estimate ; the desired block size;
, bounds on (If algorithm is SPCA);
Result: , estimations of the principal subspaces for the -th block.
for  do
        , // ;
        if   then
               ;
              
       else
               ;
               ;
               ;
               ;
              
        end if
       if Algorithm is  then
               if  then
                     
                      , is -th canonical vector.
                     
              else if  then
                     
                     
               end if
              
        end if
       
end for
Algorithm 3 Streaming / Streaming Adaptive (/)

Note that Algorithm 3 can, if desired, explicitly maintain the estimates of principal components along with the projected data. Moreover, its storage and computational requirements are nearly optimal for the given objective since, at iteration , it only requires bits of memory and flops. We point out that inherits some theoretical guarantees which ensure that its output differs from the offline in only a small polynomial factor, see Appendix B. Finally, in Lemma 2 we provide a simple bound on the error of together with an guarantee on time-order independence in the data, which is essential in a decentralised, federated system to guarantee robustness. The proofs to these statements are given in the Appendix.

Lemma 2 ( properties).

Let , , and . Then,

  1. where and is the minimum rank estimated by .

  2. If is a row permutation of the identity, then . A similar result holds for .

4 Experimental Evaluation

To validate our scheme, we evaluate the empirical performance of by comparing it against competing algorithms using synthetic and real datasets111 To foster reproducibility both code and datasets used are made publicly available here: https://github.com/andylamp/federated_pca. . All our experiments were computed on a server using Dual Intel Xeon X5650 CPUs with cores at GHz, GB MHz ECC DDR3 RAM, and Matlab R2019a (build 9.6.0.1099231). The algorithms considered in this instance are, , GROUSE Balzano and Wright (2013), Frequent Directions (FD) Desai et al. (2016); Luo et al. (2017), the Power Method (PM) Mitliagkas et al. (2014), and a variant of Projection Approximation Subspace Tracking (PAST) Yang (1995), named SPIRIT (SP) Papadimitriou et al. (2005). As is derived from MOSES Eftekhari et al. (2018) and returns the same subspace and singular values as MOSES given a fixed ; hence, a comparison with MOSES is not done. We assume a dataset like (4) and that for each algorithm and each an estimate is computed along with the Frobenius-norm error and Mean Squared Error (). That is,

4.1 performs favourably compared to other methods in synthetic and real datasets

Figures 0(a) and 0(b) show the results of our experiments on synthetic data with generated independently from a zero-mean multivariate Gaussian with covariance matrix , where is the orthogonalisation of a random Gaussian matrix and is a diagonal matrix with and . In the experiments, we let be the forgetting factor of SP. Figure 0(a) compares with SP when and Figure 0(b) when . While exhibits relative stability in in both cases, exhibits a monotonic increase in the number of principal components estimated when . This behaviour is replicated in Figures 0(e) and 0(f) where is computed on , SP, and variations of and PM dubbed , , and with rank set to the minimum () and maximum () ranks estimated by in the simulation. Figures 0(c) and 0(d) benchmark on real sensor-node humidity and light datasets222Datasets on volt and temperature readings are also available and used, see Appendix C.1 for detailed description of these datasets and additional experiments. Source of data: https://www.cs.cmu.edu/afs/cs/project/spirit-1/www/data/Motes.zip obtained from Deshpande et al. (2004) both with dimensionality of . The figures show that our method performs favourably against all competing algorithms. Additional experiments on synthetic and real datasets are given in Appendix C.1.

(a) and PC count on with
(b) and PC count on with
(c) on humidity dataset
(d) on light dataset
(e) with
(f) with
Figure 1: Performance of on synthetic and real datasets.

4.2 ’s memory efficiency is considerably better on average than competing methods

Table 1 reports average and median memory-allocation profiles for each algorithm and each sensor-node dataset provided in Deshpande et al. (2004). The statistics were computed over a sample of ten runs. Table 1 shows that ’s memory efficiency is compares favourably against other algorithms in the experiment. See Appendix C.3 for further details and an extended discussion.

Real Datasets
Method Humidity Light Voltage Temperature
SAPCA 138.11 Kb / 58.99 Kb 104.00 Kb / 76.03 Kb / 23.47 Kb 187.74 Kb / 113.28 Kb
PM Kb / Kb / Kb Kb / Kb Kb / Kb
GROUSE Kb / Kb Kb / Kb Kb / Kb Kb / Kb
FD Kb / Kb Kb / Kb 114.46 Kb / Kb Kb / Kb
SP Kb / Kb Kb / Kb Kb / Kb Kb / Kb
Table 1: Average / median memory allocations

4.3 The federated scheme shows graceful scaling when simulated in a multi-core CPU

To simulate a federated computation environment we compare the average execution times, , required to compute on a dataset . To do this, we fix and for each , report where and . In Figure 2 we report the time required to compute each sub-problem and merge the results together under different assumptions on the number of computing nodes (maximum of 32 in our simulation). The real time scalings are shown in (a) while the amortised scaling, corresponding to the federated scenario of having one processing core per client, can be seen in Figure 1(b). (a) shows that a regression occurs after we exceed the number of available physical cores on our machine (in our case ) while fixing to a specific value; this behaviour is normal as, due to the lack of processing nodes, not all of the sub-problems can be executed in concurrently. On the other hand, (b) shows a very graceful scaling in the average execution times in the presence of as many processing nodes as the amount of sub-problems to solve across all values of . We also assume that each node is independent and completely owns its subset of the (evolving) dataset. More details and discussion can be found in Appendix C.4.

(a) Real
(b) Amortised.
Figure 2: Performance Scaling for Real and Amortised execution time.

5 Discussion & Conclusions

We introduced a federated streaming algorithm for . Our federated algorithm is the result of two separate innovations. Namely, an adaptive streaming algorithm for computing rank- approximations (), and a fast subspace merging algorithm to aggregate these updates together (MergeSubspaces). We complement our algorithm with several theoretical results that guarantee bounded estimation errors and as well as data permutation invariance. In particular, we extended a result by Iwen and Ong (2016) to guarantee bounded memory in , which is crucial to enable horizontal scalability and bring to the federated setting. Our numerical experiments show that performs favourably against other methods in terms of convergence, bounded estimation errors, low memory requirements, and also that Federated- scales gracefully when simulated on a multi-core CPU. While our algorithm comes with no guarantees on data-privacy, we point out that it can readily implement the input-perturbation scheme described in Chaudhuri et al. (2013), but we leave this as potential avenue for future work. Another interesting avenue of future work is to devise a computationally efficient scheme for performing in a federated setting but in the presence of missing values.

References

  • (1)
  • Al-Kandari and Jolliffe (2005) Noriah M Al-Kandari and Ian T Jolliffe. 2005. Variable selection and interpretation in correlation principal components. Environmetrics: The official journal of the International Environmetrics Society 16, 6 (2005), 659–672.
  • Arora et al. (2012) Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. 2012. Stochastic optimization for PCA and PLS. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 861–868.
  • Arora et al. (2016) Raman Arora, Poorya Mianjy, and Teodor Marinov. 2016. Stochastic optimization for multiview representation learning using partial least squares. In International Conference on Machine Learning. 1786–1794.
  • Balzano and Wright (2013) L. Balzano and S. J Wright. 2013. On GROUSE and incremental SVD. In IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing (CAMSAP). IEEE, 1–4.
  • Boutsidis et al. (2015) Christos Boutsidis, Dan Garber, Zohar Karnin, and Edo Liberty. 2015. Online principal components analysis. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 887–901.
  • Bouwmans and Zahzah (2014) Thierry Bouwmans and El Hadi Zahzah. 2014. Robust PCA via principal component pursuit: A review for a comparative evaluation in video surveillance. Computer Vision and Image Understanding 122 (2014), 22–34.
  • Chaudhuri et al. (2013) Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. 2013. A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research 14, 1 (2013), 2905–2943.
  • Desai et al. (2016) Amey Desai, Mina Ghashami, and Jeff M Phillips. 2016. Improved practical matrix sketching with guarantees. IEEE Transactions on Knowledge and Data Engineering 28, 7 (2016), 1678–1690.
  • Deshpande et al. (2004) Amol Deshpande, Carlos Guestrin, Samuel R Madden, Joseph M Hellerstein, and Wei Hong. 2004. Model-driven data acquisition in sensor networks. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 588–599.
  • Eckart and Young (1936) C. Eckart and G. Young. 1936. The approximation of one matrix by another of lower rank. Psychometrika 1 (1936), 211–218. https://doi.org/10.1007/BF02288367
  • Eftekhari et al. (2018) Armin Eftekhari, Raphael A Hauser, and Andreas Grammenos. 2018. MOSES: A Streaming Algorithm for Linear Dimensionality Reduction. arXiv preprint arXiv:1806.01304 (2018).
  • Geyer et al. (2017) Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).
  • Ghashami et al. (2016) Mina Ghashami, Edo Liberty, Jeff M Phillips, and David P Woodruff. 2016. Frequent directions: Simple and deterministic matrix sketching. SIAM J. Comput. 45, 5 (2016), 1762–1792.
  • Ghashami and Phillips (2014) Mina Ghashami and Jeff M Phillips. 2014. Relative errors for deterministic low-rank matrix approximations. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms. SIAM, 707–717.
  • He et al. (2018) Lie He, An Bian, and Martin Jaggi. 2018. Cola: Decentralized linear learning. In Advances in Neural Information Processing Systems. 4536–4546.
  • Iwen and Ong (2016) MA Iwen and BW Ong. 2016. A distributed and incremental svd algorithm for agglomerative data analysis on large networks. SIAM J. Matrix Anal. Appl. 37, 4 (2016), 1699–1718.
  • Jolliffe (2011) Ian Jolliffe. 2011. Principal component analysis. In International encyclopedia of statistical science. Springer, 1094–1096.
  • Konečnỳ et al. (2016) Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
  • Liberty (2013) Edo Liberty. 2013. Simple and deterministic matrix sketching. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 581–588.
  • Luo et al. (2017) Luo Luo, Cheng Chen, Zhihua Zhang, Wu-Jun Li, and Tong Zhang. 2017. Robust Frequent Directions with Application in Online Learning. arXiv preprint arXiv:1705.05067 (2017).
  • Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
  • Marinov et al. (2018) Teodor Vanislavov Marinov, Poorya Mianjy, and Raman Arora. 2018. Streaming Principal Component Analysis in Noisy Settings. In International Conference on Machine Learning. 3410–3419.
  • McInnes et al. (2018) Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
  • McMahan et al. (2016) H Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, et al. 2016. Communication-efficient learning of deep networks from decentralized data. arXiv preprint arXiv:1602.05629 (2016).
  • Mirsky (1966) L. Mirsky. 1966. Symmetric gauge functions and unitarily invariant norms. Quart. J. Math. Oxford (1966), 1156–1159.
  • Mitliagkas et al. (2013) Ioannis Mitliagkas, Constantine Caramanis, and Prateek Jain. 2013. Memory limited, streaming PCA. In Advances in Neural Information Processing Systems. 2886–2894.
  • Mitliagkas et al. (2014) I. Mitliagkas, C. Caramanis, and P. Jain. 2014. Streaming PCA with many missing entries. Preprint (2014).
  • Papadimitriou et al. (2005) Spiros Papadimitriou, Jimeng Sun, and Christos Faloutsos. 2005. Streaming pattern discovery in multiple time-series. In Proceedings of the 31st international conference on Very large data bases. VLDB Endowment, 697–708.
  • Pearson (1901) Karl Pearson. 1901. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 11 (1901), 559–572.
  • Rehurek (2011) Radim Rehurek. 2011. Subspace tracking for latent semantic analysis. In European Conference on Information Retrieval. Springer, 289–300.
  • Smith et al. (2017) Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. 2017. Federated multi-task learning. In Advances in Neural Information Processing Systems. 4424–4434.
  • Tanenbaum and Van Steen (2007) Andrew S Tanenbaum and Maarten Van Steen. 2007. Distributed systems: principles and paradigms. Prentice-Hall.
  • Woodruff et al. (2014) David P Woodruff et al. 2014. Sketching as a tool for numerical linear algebra. Foundations and Trends® in Theoretical Computer Science 10, 1–2 (2014), 1–157.
  • Yang (1995) Bin Yang. 1995. Projection approximation subspace tracking. IEEE Transactions on Signal processing 43, 1 (1995), 95–107.

Appendix A Supplementary Material

We start, by providing the full proof for Lemma 1.

Proof.

Let the singular values of be the positive square root of the eigenvalues of ; then by using the previously defined streaming block decomposition of a matrix we have the following,

Equivalently, the singular values of are similarly defined as the square root of the eigenvalues of .

Thus , hence the singular values of matrix must surely equal to those of matrix . Moreover, since the left singular vectors of both and

will be also eigenvectors of

and

, respectively; then the eigenspaces associated with each - possibly repeated - eigenvalue will also be equal thus

. The block diagonal unitary matrix

which has unitary blocks of size for each repeated eigenvalue; this enables the singular vectors which are associated with each repeated singular value to be rotated in the desired matrix representation . ∎

We continue with the proof of Lemma 2, which begins by proving Lemma 2.1

Proof.

If is the of , then . Since is orthogonal, is the of . Hence, both and have the same singular values and left principal subspaces. ∎

We then finalise our proof of Lemma 2 by proving Lemma 2.2

Proof.

At iteration , computes , the best rank- approximation of using iteration (5). Hence, for each , the error of the approximation is given by . Let and be the minimum and maximum rank estimates in when running . The result follows from

Now we present in Algorithm 4 the full implementation of the basic subspace merging algorithm which is a direct consequence of Lemma 1, with the only addition of the forgetting factor that only serves the purpose of giving more significance to the “newer” subspace.

Data: , first subspace
, first subspace singular values
, second subspace
, second subspace singular values
, , the desired rank
, forgetting factor
, enhancing factor
Result: , merged subspace
, merged singular values
Algorithm 4 Basic MergeSubspaces algorithm

Following this, we can improve upon Algorithm 4 by exploiting the fact that the input subspaces are already orthonormal hence we can transform the Algorithm 4 to Algorithm 5. The key intuition comes from the fact that we can incrementally update by using . To do this we need to first create a subspace basis which spans and , namely . This is done by performing and use to perform an incremental update. However, is not diagonal, so a further needs to be applied on it, which yields the required singular values and the rotation matrix to apply onto to represent the new subspace basis.

Data: , first subspace
, first subspace singular values
, second subspace
, second subspace singular values
, , the desired rank
, forgetting factor
, enhancing factor
Result: , merged subspace
, merged singular values
Algorithm 5 Faster MergeSubspaces algorithm

Appendix B SAPCA Inherited Guarantees

We note that in Algorithm 3 inherits some theoretical guarantees from MOSES presented in Eftekhari et al. (2018). Specifically, let be an unknownprobability measure supported on with zero mean. The informal objective is to find an -dimensional subspace that provides the best approximation with respect to the mass of . That is, provided that is drawn from , the target is to find an -dimensional subspace that minimises the population risk by solving

(7)

where the Grassmanian is the manifold of all -dimensional subspaces in and is the orthogonal projection onto . Unfortunately, the value of is unknown and cannot be used to directly solve (7), but provided we have access to a block of samples that are independently drawn from , then (7) can be reformulated using the empirical risk by

(8)

Given that , it follows by the EYM Theorem Eckart and Young (1936); Mirsky (1966), that is the best rank- approximation to which is given by . Therefore, , which implies that , so the solution of (8) equals . For completeness the full MOSES theorem is shown below.

Theorem 1 (Moses Eftekhari et al. (2018)).

Suppose

are independently drawn from a zero-mean Gaussian distribution with covariance matrix

and form . Let be the eigenvalues of and be its residual. Define

(9)

Let be defined as in (5), and be constants such that , and . Then, if and , it holds, with probability at most that

(10)
(11)

where

Notably, the condition of is only required in order to obtain a tidy bound and is not necessary in the general case. Moreover, this implies that, when considering only asymptotic dominant terms of Theorem 1,

(12)

Practically speaking, assuming and we can read that, meaning that the outputs of offline truncated and Eftekhari et al. (2018) coincide.

b.1 Interpretation of SPCA as streaming, stochastic algorithm for PCA

It is easy to interpret SPCA as a streaming, stochastic algorithm for PCA. To see this, note that (7) is equivalent to maximising over The restriction can be relaxed to , where denotes that is a positive semi-definite matrix. Using the Schur’s complement, we can formulate this program to

(13)

Note that, (13) has an objective function that is convex and that the feasible set is also conic and convex. However, its gradient can only be computed when , since otherwise is unknown. If is known, and an iterate of the form is provided, we could draw a random vector moving along the direction of . This is because which is then followed by back-projection onto the feasible set . Namely,

(14)

One can see that in (14), projects onto the unitary ball of the spectral norm by clipping at one all of ’s singular values exceeding one.

Appendix C Evaluation Details

c.1 Synthetic Datasets

For the tests on synthetic datasets, the vectors are drawn independently from a zero-mean Gaussian distribution with the covariance matrix , where is a generic basis obtained by orthogonalising a standard random Gaussian matrix. The entries of the diagonal matrix (the eigenvalues of the covariance matrix ) are selected according to the power law, namely, , for a positive . To be more succinct, wherever possible we employ MATLAB’s notation for specifying the value ranges in this section.

To assess the performance of