An Improved Gap-Dependency Analysis of the Noisy Power Method

02/23/2016 ∙ by Maria-Florina Balcan, et al. ∙ Carnegie Mellon University 0

We consider the noisy power method algorithm, which has wide applications in machine learning and statistics, especially those related to principal component analysis (PCA) under resource (communication, memory or privacy) constraints. Existing analysis of the noisy power method shows an unsatisfactory dependency over the "consecutive" spectral gap (σ_k-σ_k+1) of an input data matrix, which could be very small and hence limits the algorithm's applicability. In this paper, we present a new analysis of the noisy power method that achieves improved gap dependency for both sample complexity and noise tolerance bounds. More specifically, we improve the dependency over (σ_k-σ_k+1) to dependency over (σ_k-σ_q+1), where q is an intermediate algorithm parameter and could be much larger than the target rank k. Our proofs are built upon a novel characterization of proximity between two subspaces that differ from canonical angle characterizations analyzed in previous works. Finally, we apply our improved bounds to distributed private PCA and memory-efficient streaming PCA and obtain bounds that are superior to existing results in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Principal Component Analysis (PCA) is a fundamental problem in statistics and machine learning. The objective of PCA is to find a small number of orthogonal directions in the -dimensional Euclidean space

that have the highest variance of a given sample set. Mathematically speaking, given a

positive semi-definite matrix of interest ( is usually the sample covariance matrix for data points ), one wishes to find the top- eigen-space of , where is the number of principal directions of interest and is typically much smaller than the ambient dimension . A popular algorithm for computing PCA is the matrix power method, which starts with a random matrix () with orthonormal columns and iteratively performs the following computation for :

  1. Subspace iteration: .

  2. QR factorization: , where has orthonormal columns and is an upper-triangular matrix.

It is well-known that when the number of iterations is sufficiently large, the span of the output can be arbitrarily close to , the top- eigen-space of ; that is, for arbitrarily small . One particular drawback of power method is that the rate of convergence depends on the consecutive eigengap when (i.e., has exactly the same number of columns as the target rank ). The consecutive eigengap could be very small for practical large-scale matrices. As a remedy, practitioners generally set to be slightly larger than for faster convergence and numerical stability [15]. [8] formally justifies this process by proving that under mild conditions, the dependency on could be improved to the “larger” spectral gap , for some , which may be significantly larger than the consecutive gap even if is at the same order of . 111Sec. 2 provides such an example matrix with power-law decaying spectrum. Despite the wide applicability and extensive analysis of the (exact) matrix power method, in practice it is sometimes desired to analyze a noisy version of power method, where each subspace iteration computation is corrupted with noise. Such noise could come from resource constraints such as inherent machine precision or memory storage, or artificially imposed constraints for additional objectives such as data privacy preservation. In both cases, the noise model can be expressed as , where is a noise matrix for iteration that can be either stochastic or deterministic (adversarial). Note that could differ from iteration to iteration but the QR factorization step is still assumed to be exact. The noisy power method has attracted increasing interest from both machine learning and theoretical computer science societies due to its simplicity and broad applicability [10, 12, 15, 14]. In particular, [10] establishes both convergence guarantees and error tolerance (i.e., the largest magnitude of the noise matrix

the algorithm allows to produce consistent estimates of

) of the noisy power method. [10] also applied their results to PCA with resource (privacy, memory) constraints and obtained improved bounds over existing results.

1.1 Our contributions

Improved gap dependency analysis of the noisy power method

Our main contribution is a new analysis of the noisy power method with improved gap dependency. More specifically, we improve the prior gap dependency to , where is certain integer between the target rank and the number of columns used in subspace iteration . Our results partially solve a open question in [10], which conjectured that such improvement over gap dependency should be possible if is larger than . To our knowledge, our bounds are the first to remove dependency over the consecutive spectral gap for the noisy power method.

Gap-independent bounds

As a by-product of our improved gap dependency analysis, we apply techniques in a recent paper [15] to obtain gap-independent bounds for the approximation error . This partially addresses another conjecture in [10] regarding gap-independent approximation error bounds with slightly worse bounds on magnitude of error matrices .

Applications

The PCA problem has been previously considered under various resource constraints. Two particularly important directions are private PCA [11, 6, 4, 10], where privacy of the data matrix being analyzed is formally preserved, and distributed PCA [2, 3] where data matrices are stored separately on several machines and communications among machines are constrained. In this paper we propose a distributed private PCA problem that unifies these two settings. Our problem includes the entrywise private PCA setting in [11, 10] and distributed PCA setting in [2] as special cases and we demonstrate improved bounds over existing results for both problems.

We also apply our results to the memory-efficient streaming PCA problem considered in [10, 12, 14], where data points arrive in streams and the algorithm is only allowed to use memory proportional to the size of the final output. Built upon our new analysis of the noisy power method we improve state-of-the-art sample complexity bounds obtained in [10].

Proof techniques

The noisy power method poses unique challenges for a improved gap dependency analysis. Previous such analysis for the exact power method in [8, 1]

considers a variant of the algorithm that only computes QR decomposition after the last subspace iteration. Such strategy is no longer valid for noisy power method because without per-iteration QR decomposition, the noise

will aggregate across iterations and eventually breaks the proximity between the final output and the target top- eigen-space . In the analysis of [10] the largest principal angle between and is considered for every iteration . However, such analysis cannot possibly remove the dependency over , as we discuss in Sec. 2.1. To overcome such difficulties, we propose in Eq. (3) a novel characterization between a rank- subspace and the rank- target space through an intermediate subspace , which we name as rank- perturbation on by . This quantity does not correspond to any principal angle between linear subspaces when . Built upon the shrinkage behavior of the proposed quantity across iterations, we are able to obtain improved gap dependency for the noisy power method. We hope our proof could shed light to the analysis of an even broader family of numerical linear algebra algorithms that involve noisy power iterations.

1.2 Setup

For a positive semi-definite matrix , we denote as its eigen-decomposition, where is an orthogonal matrix and is a

diagonal matrix consisting eigenvalues of

, sorted in descending order: . The spectral norm and Frobenious norm can then be expressed as and . For an integer , we define as a matrix with orthonormal columns, whose column space corresponds to the top- eigen-space of . Similarly, corresponds to the top- eigenvalues of . Let be the optimal rank- approximation of . It is well-known that is the optimal approximation for both spectral norm () and Frobenious norm ([7].

QR Factorization is a process to obtain an orthonormal column basis of a matrix. For a matrix , QR factorization gives us where is orthonormal and is an upper triangular matrix [16].

2 An improved analysis of the noisy power method

  Input: positive semi-definite data matrix , target rank , iteration rank , number of iterations .
  Output: approximated eigen-space , with orthonormal columns.
  Initialization: orthonormal by QR decomposition on random Gaussian matrix ;
  for  to  do
      1. Observe for some noise matrix ;
      2. QR factorization: , where consists of orthonormal columns;
  end for
Algorithm 1 The noisy matrix power method

The noisy power method is described in Algorithm 7. [10] provides the first general-purpose analysis of the convergence rate and noise tolerance of Algorithm 7. We cite their main theoretical result below:

Theorem 2.1 ([10]).

Fix and let . Let be the top-eigenvectors of a positive semi-definite matrix and let denote its eigenvalues. Suppose at every iteration of the noisy power method the noise matrix satisfies

for some fixed constant . Assume in addition that the number of iterations is lower bounded as

Then with probability at least

we have .

Theorem 2.1

has one major drawback: both bounds for noise tolerance and convergence rate depend crucially on the “small” singular value gap

. This gap could be extremely small for most data matrices in practice since it concerns the difference between two consecutive singular values. We show in later paragraphs an example where such gap-dependency could lead to significant deterioration in terms of both error tolerance and computing. A perhaps even more disappointing fact is that the dependency over cannot be improved under the existing analytical framework by increasing , the number of components maintained by at each iteration. On the other hand, one expects the noisy power method to be more robust to per-iteration noise when is much larger than . This intuition has been formally established in [8] under the noiseless setting and was also articulated as a conjecture in [10]:

Conjecture 2.1 ([10]).

The noise tolerance terms in Theorem 2.1 can be improved to

(1)

In this section, we provide a more refined theoretical analysis of the noisy matrix power method presented in Algorithm 7. Our analysis significantly improves the gap dependency over existing results in Theorem 2.1 and partially solves Conjecture 2.1 up to additional constant-level dependencies:

Theorem 2.2 (Improved gap-dependent bounds for noisy power method).

Let . Let be the top- eigenvectors of a positive semi-definite matrix and let denote its eigenvalues and fix . Suppose at every iteration of the noisy power method the noise matrix satisfies

for some constant . Then after

iterations, with probability at least , we have

Furthermore, for or , the low-rank approximation error is upper bounded as

where is the optimal rank- approximation of .

Discussion

Compared to existing bounds in Theorem 2.1, the noise tolerance as well as convergence rate of noisy power method is significantly improved in Theorem 2.2, where the main gap-dependent term is improved to for some intermediate singular value with . Since the singular values are non-increasing, setting a large value of in Theorem 2.2 would improve the bounds. However, cannot be too close to due to the presence of a term. In addition, the convergence rate (i.e., bound on ) specified in Theorem 2.2 reproduces recent results in [8] for noisy power method under noiseless settings (). There are three main differences between our theorems and the conjecture raised by [10]. First, the strength of projected noise also depends on . However, in many applications, this assumption is implied by the assumption. Second, we have instead of dependence. When and , then this term is the at the same order as in the conjecture. Lastly, we notice that the second term of (1) is totally independent of and their gap, which seems to be either a typo or unattainable result. Nonetheless, Theorem 2.2 has shown significant improvement on Theorem 2.1.

To further shed light on the nature of our obtained results, we consider the following example to get a more interpretable comparison between Theorem 2.2 and 2.1:

Example: power-law decaying spectrum

We consider the example where the spectrum of the input data matrix has power-law decay; that is, for some parameter . Many data matrices that arise in practical data applications have such spectral decay property [13]. The small eigengap is on the order of . As a result, the number of iterations should be at least , which implies a total running time of . On the other hand, by setting for some constant the “large” spectral gap is on the order of . Consequently, the number of iterations under the new theoretical analysis only needs to scale as and the total number of flops is . This is an improvement over existing bounds for noisy power method.

Apart from convergence rates, our new analysis also improves the noise tolerance (i.e., bounds on ) in an explicit way when the data matrix is assumed to have power-law spectral decay. More specifically, old results in [10] requires the magnitude of the noise matrix to be upper bounded by , while under the new analysis (Theorem 2.2) a bound of the form suffices, provided that for some constant and is small. This is another improvement in terms of bounds on the maximum tolerable amount of per-iteration noise.

2.1 Proof of Theorem 2.2

Before presenting our proof of the main theorem (Theorem 2.2), we first review the arguments in [10] and explain why straightforward adaptations of their analysis cannot lead to improved gap dependency. [10] considered the tangent of the th principle angle between and :

(2)

where is the orthogonal complement of the top- eigen-space of . It can then be shown that when both and are properly bounded, the angle geometrically shrinks after each power iteration; that is, for some fixed . However, as pointed outed by [10], this geometric shrinkage might not hold with larger level of noise.

To overcome such difficulties, in our analysis we consider a different characterization between (or ) and at each iteration. Let , be the top and top eigenvectors of and let be the remaining eigenvectors. For an orthonormal matrix , define the rank- perturbation on by as

(3)

Intuitively, extracts a certain rank- component from . Consider the case when , then ideally, and

is the identity matrix. Here we relieve this goal that we only test whether the first

columns of is close to . It is also different from in Eq. (2), as the definition of involves both and . We can then show the following shrinkage results for across iterations:

Lemma 2.1.

If the noise matrix at each iteration satisfies

for some sufficiently small absolute constant , define

we then have

for some sufficiently small global constant .

The following lemma bounds the rank- perturbation on by when it is initialized via QR decomposition on a random Gaussian matrix , as described in Algorithm 7.

Lemma 2.2.

With all but probability, we have that

Finally, Lemma 2.3 shows that small values imply small angles between and .

Lemma 2.3.

For any , if then .

The proofs of Lemma 2.1, 2.2 and 2.3 involve some fairly technical matrix computations and is thus deferred to Appendix A. We are now ready to prove Theorem 2.2:

Theorem 2.2.

First, the chosen ensures Corollary A.1 in Appendix A holds, therefore, the noise conditions in Theorem 2.2 imply those noise conditions in Lemma 2.1 with high probability. As a result, the following holds for all :

(4)

where and is an absolute constant. Define . Eq. (4) is then equivalent to . In addition, Lemma 2.2 yields

with high probability. Consequently, with iterations we have . can then be bounded by

Subsequently, invoking Lemma 2.3 we get , where we adopt the definition of from [10]. By Theorem 9.1 of [9], we can also obtain the desired bound on the residue norm . The constant in can be absorbed into the bounds of and .

We next simplify the bound . We first upper bound the shrinkage parameter as follows:

where the last inequality is due to that weighted mean is no larger than the maximum of two terms. Then we further have

where the last inequality results from . Subsequently, can be upper bounded as

where we use the fact that and the term is absorbed to . ∎

2.2 Gap-independent bounds

We lead a slight astray here to consider gap-independent bounds for the noisy power method, which is a straightforward application of our derived gap-dependent bounds in Theorem 2.2. It is clear that the angle cannot be gap-free, because the top- eigen-space is ill-defined when the spectral gap or is small. On the other hand, it is possible to derive gap-independent bounds for the approximation error because does not need to be close to to achieve good approximation of the original data matrix . This motivates Hardt and Price to present the following conjecture on gap-independent bounds of noisy power method:

Conjecture 2.2 ([10]).
222We rephrase the original conjecture to make not scale with singular values.

Fix , and suppose satisfies

(5)

for all iterations . Then with high probability, after iterations we have

Built upon the gap-dependent bound we derived in the previous section and a recent technique introduced in [15] for the analysis of block Lanczos methods, we are able to prove the following theorem that partially solves Conjecture 2.2.

Theorem 2.3.

Fix and suppose the noise matrix satisfies

for some constant . Then after

iterations, with probability at least , we have

The major difference between Theorem 2.3 and its targeted Conjecture 2.2 is an extra term in the noise bound of both and . Whether such a gap can be closed remains an important open question. The main idea of the proof is to find and apply Theorem 2.2 with as the new targeted rank and as the intermediate rank . A complete proof is deferred to Appendix B.

3 Application to distributed private PCA

Our main result can readily lead to improvement of several downstream applications, which will be highlighted in the this section and next. Specifically, we will discuss the benefit brought to distributed private PCA setting in this section, and memory-efficient streaming PCA in the next.

3.1 The model

In our distributed private PCA model there are computing nodes, each storing a positive semi-definite matrix . can be viewed as the sample covariance matrix of data points stored on node . There is also a central computing node, with no data stored. The objective is to approximately compute the top- eigen-space of the aggregated data matrix without leaking information of each data matrix . Each of the computing nodes can and only can communicate with the central node via a public channel, where all bits communicated are public to the other nodes as well as any malicious party. We are interested in algorithms that meet the following formal guarantees:

Privacy guarantee

We adopt the concept of -differential privacy proposed in [5]. Fix privacy parameters . Let be all bits communicated via the public channels between the computing nodes and the central node. For every and all that differs from in at most one entry with absolute difference at most 1, the following holds

(6)

where and is any measurable set of bits communicated.

Utility guarantee

Suppose is the dimensional output matrix. It is required that

with probability at least 0.9, where characterizes the error level and is the top- eigen-space of the aggregated data matrix .

Communication guarantee

The total amount of bits communicated between the computing nodes and the central node is constrained. More specifically, we assume only real numbers can be communicated via the public channels.

The model we considered is very general and reduces to several existing models of private or communication constrained PCA as special cases. Below we give two such examples that were analyzed in prior literature.

Remark 3.1 (Reduction from private PCA).

Setting in our distributed private PCA model we obtain the private PCA model previously considered in [10, 11], 333The case is actually harder than models considered in [10, 11] in that intermediate steps of noisy power method are released to the public as well. However this does not invalidate the analysis of noisy power method based private PCA algorithms because of the privacy composition rule. where neighboring data matrices differ by one entry with bounded absolute difference.

Remark 3.2 (Reduction from distributed PCA).

Setting and we obtain the distributed PCA model previously considered in [2], where columns (data points) are split and stored separately on different computing nodes.

3.2 Algorithm and analysis

We say an algorithm solves the -distributed private PCA problem if it satisfies all three guarantees mentioned in Sec. 3.1 with corresponding parameters. Algorithm 2 describes the idea of executing the noisy power method with Gaussian noise in a distributed manner.

  Input: distributedly stored data matrices , number of iterations , target rank , iteration rank , private parameters .
  Output: approximated eigen-space , with orthonormal columns.
  Initialization: orthonormal by QR decomposition on a random Gaussian matrix ; noise variance parameter ;
  for  to  do
      1. The central node broadcasts to all computing nodes;2. Computing node computes with and sends back to the central node;3. The central node computes and QR factorization .
  end for
Algorithm 2 Distributed private PCA via distributed noisy power method

The following theorem shows that Algorithm 2 solves the -distributed private PCA problem with detailed characterization of the utility parameter and communication complexity . Its proof is deferred to Appendix C.

Theorem 3.1 (Distributed private PCA).

Let be the number of nodes and be data matrices stored separately on the nodes. Fix target rank , intermediate rank and iteration rank with . Suppose the number of iterations is set as . Let be privacy parameters. Then Algorithm 2 solves the -distributed PCA problem with

Here assuming conditions in Theorem 2.2 are satisfied, and is the incoherence [11] of the aggregate data matrix ; more specifically, where is the eigen-decomposition of .

It is somewhat difficult to evaluate the results obtained in Theorem 3.1 because our work, to our knowledge, is the first to consider distributed private PCA with the public channel communication model. Nevertheless, on the two special cases of private PCA in Remark 3 and distributed PCA in Remark 3.2, our result does significantly improve existing analysis. More specifically, we have the following two corollaries based on Theorem 3.1 and Theorem 2.2.

Corollary 3.1 (Improved private PCA).

For the case of and , Algorithm 2 is -differentially private and satisfies

with probability at least 0.9. Here is the top- eigen-space of input data matrix .

Corollary 3.2 (Improved distributed PCA).

Fix error tolerance parameter and set , in Algorithm 2. We then have with high probability,

Here is the top- eigen-space of the aggregated matrix .

The proofs of Corollary 3.1 and 3.2 are simple and deferred to Appendix C. We now compare them with existing results in the literature. For private PCA, our bound has better spectral-gap dependency compared to the bound obtained in [10]. For distributed PCA, our bound achieves an exponential improvement over the communication complexity bound obtained in [2]. 444 Lemma 8 of [2] gives a communication upper bound that depends on all singular values bigger than . It is not obvious which bound is better, but in the worst case, their bound is still linear in .

4 Application to memory-efficient streaming PCA

In the streaming PCA setting a computing machine receives a stream of samples drawn i.i.d from an unknown underlying distribution . The objective is to compute the leading eigenvectors of the population covariance matrix with memory space constrained to the output size . [14] gave an algorithm for this problem based on the noisy power method. Algorithm 6 gives the details.

  Input: data stream , target rank , iteration rank , iteration number .
  Output: approximated eigen-space , with orthonormal columns.
  Initialization: uniformly sampled orthonormal matrix ; .
  for  to  do
      1. Power update: , where ;2. QR factorization: , where consists of orthonormal columns.
  end for
Algorithm 3 Memory-efficient Streaming PCA [14]

[10] are among the first ones that analyze Algorithm 6 for a broad class of distributions based on their analysis of the noisy power method. More specifically, [10] analyzed a family of distributions that have fast tail decay and proved gap-dependent sample complexity bounds for the memory-efficient streaming PCA algorithm.

Definition 4.1 (-round distributions, [10]).

A distribution over is if for every -dimension projection and all , we have that

Theorem 4.1 ([10]).

Suppose is a -round distribution over . Let be the singular values of the population covariance matrix . If Algorithm 6 is run with and satisfies 555In the notation we omit poly-logarithmic terms.

then with probability at least 0.9 we have that , where is the top- eigen-space of .

Recently, [12] proposed a modified power method that achieves a logarithmic sample complexity improvement with respect to . Nevertheless, both bounds in [10] and [12] depend on the consecutive spectral gap , which could be very small for real-world data distributions. Built upon our analysis for the noisy power method, we obtain the following result for streaming PCA with improved gap dependencies:

Theorem 4.2.

Fix . Suppose is a -round distribution over . Let be the singular values of the population covariance matrix . If Algorithm 6 is run with and satisfies

then with probability at least 0.9 we have that