Principal component analysis is one of the most extensively studied methods for constructing linear low-dimensional representation of high-dimensional data. Modern applications such as privacy presevering distributed computations (Hardt and Roth (2013)), covariance estimtion of high-frequency data (Chang et al. (2018),Aït-Sahalia et al. (2010)), detecting power grid attacks (Bienstock and Shukla (2018), Escobar et al. (2018)
) etc. require design of sub-linear time algorithms with low storage overhead. Existing work on PCA has focused on design and analysis of single-pass (streaming) algorithms with near-optimal memory and storage complexity assuming stationarity of the underlying data-generating process. However, physical systems generating data for such applications undergo rapid evolution. For example, dynamic market behaviour leads to time-series data with volatile covariance matrices. Our understanding of such physical system crucially relies on accurate estimation of the data generating space. This motivates the design of memory and storage efficient algorithms for analyzing dynamic data. In this work, we consider the streaming PCA problem under the relaxed assumption of non-stationarity of the data generating process.
The streaming PCA problem is concerned with constructing a linear low-dimensional representation of the observed data , in a single pass over the dataset with the goal of recovering the best top-, , dimensional space representing the observations. We consider the streaming PCA problem when the observations are sampled from a spiked covariance model and relax the assumption of stationarity of the data generating distribution. Hence, the observations
are sampled from a Gaussian distribution with a time-varying covariancesuch that and processed in blocks of size where . Define the spectral gap as where, is the
-th largest singular value of matrix. Our goal is to propose a memory and storage efficient algorithm to recover the best -dimensional approximation of the last observed block of data. We define this problem as non-stationary streaming PCA. When , we recover the streaming PCA problem.
1.1 Our Contributions
For non-stationary streaming PCA we study the following:
Price of non-stationarity: When the observations are sampled from a non-stationary variant of the spiked covariance model (equation 2) we establish the fundamental performance limits of any algorithm for non-stationary streaming PCA. In section 2 we derive lower bounds on the expected recovery error, for any given values of (the number of collected samples), rate of rotation , the spectral gap and any algorithm . Theorem 2.1 establishes
. Unlike the stationary case, for the non-stationary case we observe a phase transition in the recovery error as we collect more samples. We show that when the observations are, the error decreases as the inverse of the square root of the number of observations so far. However, the recovery error beyond these observations stagnates to and doesn’t decrease any further upon collecting more samples. In Theorem 3.2, we show that the noisy power method (algorithm 1) can guarantee error when there are sufficient number of samples.
Optimal algorithm and analysis: The noisy power method (algorithm 1) is a memory efficient and storage optimal algorithm for the streaming PCA problem (Mitliagkas et al. (2013)). In section 3 we show that in addition to being memory and storage optimal, it also achieves the fundamental performance limit for the non-stationary streaming PCA problem. However, there are no simple adaptations of existing proof strategies for the non-stationary problem.
To recover the true subspace spanned by the top-
singular vectors offor the streaming PCA problem, one needs to mitigate the effect of sampling noise and ensure that each iteration improves the estimate of the true subspace. The sampling noise can be controlled using concentration of measure techniques and the estimates improve after every iteration since all the observations provide information about the same underlying space. Existing works such as Allen-Zhu and Li (2017) and Hardt and Price (2014) provide an in-depth analysis of the convergence behaviour of the noisy power method based on this general strategy.
In the non-stationary case in addition to bounding tails from time-varying distributions, there is also the challenge of tracking time-varying spaces which generate the observations. A straightforward generalization of conventional proof strategies to the non-stationary problem is not possible since existing methods require that the distance between the true subspace spanned by top- singular vectors and the subspace estimated every iteration decreases. Under the non-stationary model, however, the improvement in estimate after every iteration cannot be guaranteed, especially during the initial phase with the random initialization, since the data-generating space rotates and small improvements in estimates maybe negated by the larger rotations. To circumvent this issue, we introduce a new proof technique based on analyzing the singular value and singular vectors of product of matrices. Precisely, we study the singular values and the left singular vectors of the product
where is the true covariance matrix at the -th iteration and is the noise matrix because of the sampling from time-varying distributions.
We analyze the convergence behaviour of the noisy power method in theorem 3.2. Using lemmas 3.2 and 3.3, we bound singular vectors and singular values of the iterates of the noisy power method and the data-generating subspaces. It turns out that the top- left singular vectors of the matrix product are very close to the top- left singular vectors of and the -th largest singular value is much greater than the -th largest singular value, when is large enough compared to the spectral norm of the noise matrices and . We also recover the results for streaming PCA in corollary 3.1.
1.2 Comparison with existing work
This paper introduces non-stationary streaming PCA. To the best of the our knowledge, there are no known results for this problem. Past work has exclusively focused on the stationary variant of this problem. Corollary 3.1 guarantees -accuracy with samples (where is an intricate function of , see 2) for the streaming PCA problem. It should be noted that although we make assumptions about the spectral norm of the noise matrices, unlike previous work we do not make explicit assumptions on the amount of overlap between the noise matrix and the subspace to be recovered. We instead constraint the ratio of the and largest singular values to facilitate our proof technique.
When the observations are sampled from a generic distribution Hardt and Price (2014) guarantee accuracy with . For the special case of the spiked covariance model, they require samples. Balcan et al. (2016) improve upon the guarantees of Hardt and Price (2014) by reducing the spectral gap to difference between the -th largest singular value and -th largest singular value .
The best known convergence results for the streaming PCA problem are by Allen-Zhu and Li (2017). They establish fundamental performance limits and provide the first spectral-gap free global convergence guarantee for streaming PCA using Oja’s algorithm. Their lower bound for the case when observations are sampled from the spiked covariance model is as compared to our . For Oja’s algorithm, they show that samples are sufficient for -accurate recovery. They also provide the first spectral-gap free analysis for the streaming PCA problem. A spectral-gap free analysis for the non-stationary streaming PCA problem remains an open question.
1.3 Other Related Work
Existing literature on streaming PCA focusses on the stochastic gradient descent based methods or the power method. The power method with a randominitial matrix is well studied in Halko et al. (2011). Some variants of the power method e.g., Golub and Van Loan (2012) and Musco and Musco (2015), enhance the convergence speed for the same target error . Oja and Karhunen (1985)
was the first work to propose a stochastic approximation based approach for recovering the top eigenvector of a matrix. Recently,Li et al. (2018) analyze Oja’s algorithm for recovering the top eigenvector as a stochastic approximation iteration. For recovering the top eigenvector, Shamir (2016a) provide a gap-dependent analysis which was made gap-independent in Shamir (2016b). The iterative eigenvector computation schemes are instances of stochastic approximation based solution for this problem (Arora et al. (2012)). Most literature on stochastic approximation assumes stationarity of the objective function in the optimization problem. Besbes et al. (2015) were the first to consider a non-stationary variant of stochastic approximation methods. They focus on convex optimization problems and focus on providing statistical guarantees rather than designing efficient algorithms.
1.4 Notation and preliminaries
We fix notations and preliminaries used throughout the main body of the paper. Vector spaces are denoted with blackboard bold letters representing the -dimensional euclidean space and represents the -dimensional sphere. Matrices are denoted by bold upper case letters and vectors are denoted by bold lower case letters. For a matrix we use the following notations. Let be the transpose of the matrix and be the largest singular value. Let denote the column of , be the element of M and denote the sub-matrix of M consisting of elements from row to and columns to of the matrix
. The singular value decomposition ofis defined as , where, . is a diagonal matrix with its diagonal element given by . We assume without loss of generality that the singular values and respective singular vectors are ordered from largest to smallest. , and represent the left singular vectors, the diagonal matrix of singular values and the right singular vector of the matrix in context. represents the largest eigen value of the matrix . Let span() denote the range (column space) of and let represent an orthonormal basis of span() when . This can be defined as the column space of
or can be obtained through a QR decomposition of. The precise use of throughout the manuscript will be clear from the context. Let denote the projection matrix of the orthogonal complement of range of given by . The spectral gap of is defined as the difference between the and the largest singular value of denoted by . In absence of ambiguity we use, . We use . We define the distance between the column space of and as . is the projection distance on the Grassmannian manifold, (manifold of all -dimensional subspaces of ) (section 2.5 in Golub and Van Loan (2012) and lemma A.1). All constants appearing in our results are independent of the problem parameters.
The spiked covariance model assumes that observations are lifted from a low-dimensional space and corrupted with high-dimensional Gaussian noise. Let and be the rank of the low-dimensional space. According to a spiked covariance model, is generated as:
The vectors are independent and identically distributed, sampled from a multivariate Gaussian distribution with zero mean and identity covariance. The homoskedastic noise vectors are also independent and identically distributed with zero mean and covariance, . Samples drawn from a spiked covariance model follow a Gaussian distribution .
We incorporate non-stationarity by allowing the underlying subspace to rotate slowly with time. This subspace rotation characterized by parameter is incorporated by allowing the space spanned by its left singular components to rotate slowly. Hence, we assume that is generated as:
We focus on finite-sample recovery guarantees as the performance measure of our algorithms. We are interested in retrieving the column space spanned by the block of observations in the last iteration. Let represent the output of the noisy power method after iterations. The recovery error is expressed as the the distance between the column spaces of the underlying space and the output (theorem 2.6.1 in Golub and Van Loan (2012)). Therefore, after iterations our performance metric is , where are the left singular vectors of true covariance matrix of . Given desired tolerance , our goal is to establish conditions for .
2 Lower Bound
In this section, we establish fundamental performance limits of any algorithm for the non-stationary streaming PCA problem when the observations are sampled from (2) i.e., . To construct the lower bound we consider the family of sequence of matrices defined as:
Let be the top- left singular vectors of . Consider the class of algorithms which estimate from observations and for let the estimate be denoted by
. The probability measure induced over the space of observationsby the sequence is given by:
where, is a sequence of measurable sets in . The minimax risk associated with the problem of inferring from observations is:
where the expectation is taken with respect to .
We establish the lower bound on using a two hypotheses test. In order to establish the bound it is sufficient to lower bound the minimax probability of error (section 2.2 in Tsybakov (2009)) in recovering the top- singular vectors when the observations are generated from family of Gaussian distributions whose covariance are the sum of matrices in two suitably choosen sequences in and . We bound the minimax probability of error using lemma A.2 and the KL-divergence between the product distribution generated by these sequences. Consider the following two hypotheses for this purpose:
We note the following about the proposed hypotheses and :
The sequence and are in .
For , we have
For the sequences and , we have
We establish the lower bound for the non-stationary streaming PCA problem in theorem 2.1 using lemma A.2. In order to apply lemma A.2 we bound the KL-divergence between the product distributions of observations generated using and by a constant in lemma 2.1.
Theorem 2.1 (Lower Bound).
Given observations, rate of change and the spectral gap such that . For any algorithm the minimax estimation error between any two sequences of matrices and belonging to is given by:
The theorem follows from lemma A.2. We verify the necessary conditions for the generic reduction scheme outlined in section 2 (also see section 2.2 in Tsybakov (2009)). The application of lemma A.2 relies on applicability of the generic scheme. We then verify the conditions required for lemma A.2 and conclude the result. To this end we need to ensure the following:
The first two properties follow from the construction detailed in section 2. The third property is established in lemma 2.1 with . Putting this into lemma A.2, for minimax probability of error we have,
Lemma 2.1 (Bound on KL-divergence).
Proof sketch Using the properties of the Gaussian distribution and , the KL-divergence between and is . The sequence and is carefully constructed so that the distance between the column space of the sequence of matrices so that is bounded by a parameter-independent constant. The bound on in theorem 2.1 exhibits a phase transition phenomenon with the first term representing the effect of non-stationarity. Initially, decreases with the rate of . However, when , the first term of dominates the second term and becomes independent to the number of samples . This also validates our intuition that in a dynamic environment past information quickly becoming stale. For the streaming PCA problem, we recover the fundamental limit as ( in theorem 2.1).
3 Non-Stationary Streaming PCA
The noisy power method (alogrithm 1) is an iterative algorithm for computing the top- eigen vectors of a given matrix. It is initialized with a random -dimensional basis of , (line 3 in algorithm 1). Each iteration computes a representative orthogonal basis upon observing a new block of data, (line 7 in algorithm 1), where denote an orthonormal matrix whose columns span the column space of (e.g. it can be computed through the QR decomposition of ).
We begin by considering the problem of recovering the top- singular vectors of a symmetric matrix using the power method. Let . Starting with a random initial matrix , in the iteration the power method computes, . The column space of is equivalent to the column space of since,
where, is the matrix computed each time from the decomposition of . Hence, . We analyze the distance between the output of the power method and the space spanned by , from the singular values of . In theorem 3.1, we quantify the decrease in the error in estimation of the top- singular vectors of . After each iteration of the power method, the desired singular vector are amplified by at least where as the remaining are amplified by at most . Since the power iterations start with a random initialization , we have (Hardt and Price (2014)): Theorem 3.1 bounds the distance between the top- singular vectors of as:
Theorem 3.1 (Perturbation by multiplication).
Let and . Let () and . Assume, and . Then,
Proof sketch Bounding the distance between the column spaces of and the output is equivalent to bounding the largest singular value of the projection of the space orthogonal to onto . This projects onto the space spanned by the last right singular vectors . The projection of onto the right singular vectors is bounded using lemma A.5.
We now establish the convergence behaviour of the noisy power method (algorithm 1) in presence of noise and non-stationarity. We observe vectors in blocks of size , . For all let be the deviation of the empirical covariance from the expected empirical covariance of the vector at the end of every block of computation :
In presence of non-stationarity: (i) we need to bound for sampling noise from time-varying distributions and (ii) analyse the distance between the time-varying underlying subspaces and the output of the noisy power method. In lemma 3.1, we obtain a bound on the spectral norm of the noise matrix using concentration inequalities to bound the spectral norm of the noise matrix due to sampling from time-varying distributions.
Lemma 3.1 (Spectral norm of noise).
Given , spectral gap , observations generated according to 2 with , with probability ,
We now bound the distance between the output of the noisy power method and the time varying underlying subspaces. Let . The output of the noisy power method (algorithm 1) is equivalent to computing an orthonormal basis for the product . In order to analyse the convergence behaviour we bound the distance between the output of the noisy power method, and the subspace to be recovered, However, due to non-stationarity there doesn’t exist a fixed reference subspace with respect to which we can bound these distances. In lieu of a fixed reference subspace, we identify a sequence of and dimensional subspaces of , arising from the observed vectors , such that the sequence of -dimensional subspaces remains close to the underlying subspace every iteration and the first subspaces of the sequences are mutually orthogonal. We identify this sequence of and dimensional subspaces, through the column space of a sequence of matrices and such that is orthogonal to . We need to ensure that the the identified -dimensional subspaces are close to the true subspace and their distance from the subspace estimates of the noisy power method is not too large. We also need to ensure that the identified -dimensional subspaces are far from the true subspace and the estimates of the noisy power method. Then, using the sequence of identified subspaces as a reference we can relate the distance between the and . We formalize this idea in theorem 3.2 to establish the convergence behaviour of the noisy power method for non-stationary streaming PCA.
Before stating the construction of the sequence of subspaces and the convergence analysis we state our assumptions. Let be the singular value decomposition of , and . By equation 2 and triangle inequality we have that:
Let . Using Davis-Kahan sin theorem (lemma A.3), we have:
For all we have and . Note that when , for , with
with probability , for all the following holds:
Theorem 3.2 (Iteration).
Proof sketch We use lemma 3.2 to identify and iteratively construct the sequence to ensure that has small overlap generating the observations in the iteration and the space spanned by top- singular vectors generating those observations. Consequently, the column space of lies in the column space of and the largest singular value of the product of and is decreases exponentially i.e.
Similarly, using lemma 3.3 we identify and iteratively construct the sequence . We ensure that for all , the column space of is away from and the least singular value of is at least . Therefore, the distance between the true underlying space and is less than and the least singular value of the product increases exponentially i.e.
We bound the distance between and through . The construction ensures that the distance between and is at most . Due to the power iterations, the distance between and is identical to the distance between and . We can bound the distance between and by projecting onto space spanned by and its orthogonal complement and using properties 4 and 5. The sequences of subspaces and in theorem 3.2 are identified using lemma 3.2 and lemma 3.3 respectively. Beginning with , in order to establish property 4, it is necessary to ensure that in every iteration , and and . Lemma 3.2 we show that, if is almost orthogonal to , then has to be almost orthogonal to thereby facilitating the iterative construction of .
Lemma 3.2 (Orthogonal projection after perturbation).
Let be a positive definite matrix and SVD. Let and let be the set of matrices with orthonormal columns such that with . Under the conditions of theorem 3.2, for every given there exists a orthonormal matrix such that