Kernel PCA (Schölkopf et al., 1998) is a widely used nonlinear dimension reduction technique in machine learning for the purpose of redundancy removal and preprocessing before prediction, classification or clustering. The method is implemented by finding a low-rank approximation of the kernel-based Gram matrix determined by the data sample. To be concrete, let be a data sample of size and dimension , and let be the positive semidefinite kernel matrix determined by a predetermined kernel function in terms of . Kernel PCA amounts to finding the leading eigenvectors and eigenvalues of . Here the number of principal components is assumed to be fixed and chosen by the user for practical convenience. By the well-known Eckhart-Young Theorem (Golub and Van Loan, 2012), this is equivalent to
This factorization for low-rank approximation has been well-known in the literature (see, e.g., Burer and Monteiro (2003)).
However, when the sample size is large, the storage of the kernel matrix itself becomes challenging. Consider the example when the dimension is in thousands while the sample size is in millions. The memory cost for the data matrix is and thus in billions, while the memory cost for the kernel matrix is in trillions! On the other hand, if not storing , the implementation of standard iterative algorithms of SVD will involve one pass of computing all entries of in each iteration, usually with formidable computational cost . A natural question arises:
How to find low-rank approximations of memory-efficiently?
The following two are among the most well-known memory-efficient Kernel PCA methods in the literature. One is Nyström method (Williams and Seeger, 2001), which amounts to generating random partial columns of the kernel matrix, then finding a low-rank approximation based on these columns. In order to generate random partial columns, uniform sampling without replacement is employed in Williams and Seeger (2001), and different sampling strategies are proposed in later literature (e.g., Drineas and Mahoney (2005)). The method is convenient in implementation and efficient in both memory and computation, but unstable in terms of approximation errors. Another popular approach is stochastic approximation, e.g., Kernel Hebbian Algorithm (KHA) (Kim et al., 2005), which is memory-efficient and will approach the exact principal component solution as number of iterations goes to infinity with appropriately chosen learning rate (Kim et al., 2005). However, redundant passes to compute the kernel entries are needed, so the computational cost is intrinsically large. In addition, the method requires careful parameter tuning in practice to converge. Consequently, we are particularly interested in finding more stable memory-efficient approaches as alternatives to KHA and Nyström methods, particularly without the necessity of careful parameter tuning.
1.1 Partial matrix sampling followed by nonconvex optimization
In recent years, a series of papers have studied the problem of nonconvex matrix completion with comprehensive theoretical analyses (see, e.g., Rennie and Srebro (2005); Keshavan et al. (2010b, a); Jain et al. (2013); Zhao et al. (2015); Sun and Luo (2016); Chen and Wainwright (2015); Yi et al. (2016); Zheng and Lafferty (2016); Ge et al. (2016, 2017)). Compared to convex approaches for matrix completion (e.g., Candès and Recht (2009)), these nonconvex approaches are not only more computationally efficient, but also more convenient in storing. In particular, to further reduce the computational and memory costs, Yi et al. (2016) propose to sub-sample the observed entries in the data matrix, and then implement nonconvex algorithms to recover the underlying low-rank signal from either gross corruptions or missing data, or both.
Inspired by this idea in Yi et al. (2016), we propose the following memory-efficient method for Kernel PCA:
Step 1 (Partial kernel matrix sampling): Generate a random partial matrix of with the symmetric index set denoted as , i.e., for . For convenience of discussion, we usually represent this subsampling as such that
But in practice, we don’t store directly. We only store the its entries on and corresponding indices. Throughout this paper, is assumed to follow an Erdös-Rényi random graph with parameter later specified in Definition def:ber.
Step 2 (Nonconvex optimization): Denote . Solve the nonconvex optimization problem
where . This nonconvex optimization was previously employed in Sun and Luo (2016) and Ge et al. (2016, 2017) for the purpose of matrix completion. Under this step, we follow the more recent framework of nonconvex optimization without initialization (Ge et al., 2016, 2017): the final approximate Kernel PCA is given by any local minimum of eq:obj_psd, denoted as , from the perspective of the low-rank approximation .
To store the sampled entries of and their coordinates, the memory cost in Step 1 is . The memory cost required for Step 2 is generally . For example, if is symmetric and does not contain the diagonal entries as in Definition def:ber, the updating rule of gradient decent
is equivalent to
where the memory cost is dominated by storing , , and on . As long as is generated, the typical computational cost in Step 1 is , particularly when the radial kernel or polynomial kernels are chosen. The computational cost in each iteration of Step 2 is typically for first order algorithms, which is indicated in the above gradient descent updating rule. When , the computational cost in each iteration of Step 2 is much smaller than that in Step 1.
Our work extends the idea of Yi et al. (2016) in both motivation and assumptions. In Yi et al. (2016), the idea of partial matrix sampling followed by nonconvex optimization is proposed for matrix completion and robust PCA, where the underlying signal matrix is assumed to be exactly low-rank and highly structured in terms of bounded condition numbers and eigen-space incoherence. In contrast, we apply similar ideas to model-free low-rank approximation, where no assumptions are made on beyond being positive-semidefinite. From the perspective of applications, to the best of our knowledge, our work is the first to apply the idea of partial matrix sampling plus nonconvex optimization to memory-efficient Kernel PCA.
Moreover, accelerating low-rank approximation as well as saving memory cost by sub-sampling is also discussed in Achlioptas et al. (2002) and Achlioptas and McSherry (2007), this technique is used to speed up Kernel PCA in Achlioptas et al. (2002). However, spectral methods were employed after sub-sampling as opposed to nonconvex optimization. It is also worth mentioning the randomized one-pass algorithm discussed in, e.g., Halko et al. (2011), where the theoretical properties of a random-projection based low-rank approximation method were fully analyzed. However, although the one-pass algorithm does not require the storage of the whole matrix , in Kernel PCA one still needs to compute every entry of , which typically requires computational complexity for kernel matrix. In contrast, partial matrix sampling method proposed in the current paper only computes those selected entries.
1.2 Model-free local minimum analysis and related work
Local minimum analysis, in particular for nonconvex matrix factorization, has attracted increasing interest in recent years, such as regularized M-estimator(Loh and Wainwright, 2015), PCA/SVD (Baldi and Hornik, 1989; Li et al., 2016a; Jin et al., 2017; Zhu et al., 2017), phase retrieval (Sun et al., 2017), matrix completion (Ge et al., 2016, 2017), matrix sensing (Bhojanapalli et al., 2016; Li et al., 2016a; Zhu et al., 2017; Ge et al., 2017), low-rank optimization (Li et al., 2017), etc. Among these results, Ge et al. (2016) and Ge et al. (2017) are mostly relevant to our current work, given the fact that exactly the same nonconvex optimization problem (1.1) has already been studied therein for the motivation of matrix completion from missing data. In particular, these papers show that any local minimum yields , as long as is exactly rank-, the condition number is well-bounded, and the sampling rate is greater than a function of these values. The case with additive stochastic noise has also been discussed in Ge et al. (2016).
However, these theoretical results are not suitable guidelines for memory-efficient Kernel PCA, since the kernel matrix is not exactly low-rank, and there is no evidence that these strong assumptions on eigenvalues and eigenvectors are valid for general kernel matrices. Instead, we will discuss the properties has with no assumptions on . As a result, although our method is motivated by memory-efficient Kernel PCA, our theory is established for any positive-semidefinite matrix .
To be concrete, there are actually two questions of interest: how close is from , and how close is from , where is the best rank- approximation of by partial rank- SVD when all entries are computed. In comparison to Ge et al. (2016, 2017), our main results to be introduced in the next section have the following contributions:
In our main result thm:main_psd, which characterizes how close is from or , we do not impose any assumptions on regarding the rank, eigenvalue and eigenvector. Moreover, we only assume the sampling rate satisfies for some absolute constant .
thm:main_psd shows how the approximation error depends on the spectrum of , the choice of , the maximum value of , some norms of the “residual” matrix , and the sampling rate .
Technically speaking, we benefit from Ge et al. (2016, 2017) in various aspects. In fact, in order to analyze the properties of any local minimum , we follow the idea in Ge et al. (2017) to combine the first and second order conditions of local minima linearly to construct an auxiliary function, denoted as in our paper, and consequently all local minima satisfy the inequality as illustrated in Figure 1. If is exactly rank- and well-conditioned, Ge et al. (2017) show that for all as long as the sampling rate is large enough, and this fact can lead to the result that there is no spurious local minima. However, without any assumptions on , there might exist multiple local minima, and we instead focus on analyzing the inequality directly.
Our thm:main_psd cannot be obtained by simply modifying the analysis in Ge et al. (2017). In fact, several novel technical ideas are needed to control the terms in more carefully in a model-free setup. Here we want to highlight the deterministic inequality (Lemma 4.4) which might be of independent interest. By applying this inequality, we can control the difference between the function and its population version in a uniform and model-free framework. It also helps us improve the exact recovery results in Ge et al. (2017) into Corollary 2.3.
As aforementioned, another research line relevant to our work is regarding nonconvex matrix completion with initialization by assuming the underlying low-rank matrix is well-conditioned (see, e.g., Keshavan et al. (2010b, a); Sun and Luo (2016); Chen and Wainwright (2015); Yi et al. (2016); Zheng and Lafferty (2016)). In particular, noisy nonconvex matrix completion is also discussed in Keshavan et al. (2010a) and Chen and Wainwright (2015). A comprehensive survey of this line of research is beyond the scope of our discussion, since our work focuses on model-free local minimum analysis of nonconvex matrix factorization without initialization. Interested readers are referred to Balcan et al. (2017), where various results of required sampling rates for nonconvex (and also convex) matrix completion are summarized in Table 1 therein.
1.3 Organization and notations
The remaining of the paper is organized as follows: Our main theoretical results are stated in sec:theory; Numerical simulations and applications in memory-efficient KPCA are given in sec:simulations. Proofs are deferred to sec:proofs.
We use bold letters to denote matrices and vectors. For any vectorsand , denotes its norm, and their inner product. For any matrix , denotes its -th entry, its -th row of , and its -th column. Moreover, we use , , , , to denote its operator norm, nuclear norm, Frobenius norm, norm and norm, respectively. The vectorization of is represented by . For matrices of the same size, denote . Denote by and the gradient and Hessian of .
Denote . We use to denote a matrix whose all entries equal to one. We use to denote absolute constants, whose values may change from line to line.
2 Model-free Approximation theory
Although we focused on Kernel PCA in both motivation and methodology, as aforementioned, our main theoretical result holds for any positive semidefinite matrices. Moreover, we will show how our main result improves over the state-of-the-art exact recovery results in initialization-free nonconvex matrix completion.
2.1 Main results
To define the problem, we introduce the following sampling scheme:
Definition 1 (Off-diagonal symmetric independent model).
Assume the index set consists only of off-diagonal entries that are sampled symmetrically and independently with probability
consists only of off-diagonal entries that are sampled symmetrically and independently with probability, i.e.,
for all ;
For all , sample independently with probability ;
For all , if and only if .
Here we assume all diagonal entries are not in for the generality of the formulation, although in many applications, such as radial kernel PCA, the diagonal entries of can be easily known. For any , we also define a corresponding symmetric - matrix such that if and only if . Then we have the convention that for any , where is the Hadamard product.
Assume that the positive semidefinite matrix has the spectral decomposition
where are the spectrum, are unit and mutually perpendicular eigenvectors. The matrix is the best rank- approximation of and denotes the remaining part. In the case of multiple eigenvalues, the order in the eigenvalue decomposition (2.1) may not be unique, and then may not be unique. In this case, we consider the problem for any fixed order in (2.1) with the fixed .
Let be a positive semidefinite matrix with the spectral decomposition eq:spectral. Let be sampled according to the off-diagonal symmetric model. Suppose tuning parameters satisfy , and assume with some absolute constant . Then in an event with probability , any local minimum of (1.1) satisfies
with absolute constants.
Notice that in the literature there are also some assumption-free analysis for SVD-based estimators (e.g., Keshavan et al. (2010a); Chatterjee (2015)). For example, by applying Keshavan et al. (2010a, Theorem 1.1) to our problem setup, there holds
where denotes the low-rank approximation obtained in Keshavan et al. (2010a, Theorem 1.1). However, this cannot imply any exact recovery results even when is a highly-structured low-rank matrix and the sampling rate is large enough. On the other hand, the USVT estimator introduced in Chatterjee (2015) does not necessarily return a rank- approximation and thus not suitable for Kernel PCA.
In order to get a better interpretation of our main theorem, we introduce the following corollary:
This bound simply demonstrates a trade-off between and in choosing . In other words, our approach leads to desirable low-rank approximation as long as we can select a in such that is well-bounded while is still close to .
2.2 Implications in exact matrix completion
Due to the generality of our main result, we can further assume the positive semidefinite matrix is exact rank-, i.e., and its condition number is well bounded, which is the standard assumption in the nonconvex matrix completion literature (e.g., Keshavan et al. (2010b); Sun and Luo (2016); Chen and Wainwright (2015); Zheng and Lafferty (2016); Ge et al. (2016); Yi et al. (2016); Ge et al. (2017)). It turns out our main theorem implies a variety of exact recovery results under spikiness or incoherence assumptions.
First, the following eigen-space incoherence condition has been proposed in Candès and Recht (2009).
Definition 2 (Candès and Recht 2009).
Let be a rank- positive semidefinite matrix represented as in eq:spectral. Let
Here for any subspace of of dimension , we define
where represents the standard orthogonal basis of .
On the other hand, Ge et al. (2016) introduce the following quantity which can also be considered as a measure of spikiness. (Note that this is different from the spikiness defined in Negahban and Wainwright (2012).)
Definition 3 (Ge et al. 2016).
Let be a rank- positive semidefinite matrix represented as in eq:spectral. Let
By denoting , the incoherence can be represented as
where denotes the -th entry of . At the same time, we also have
so we have the relation
In comparison, the required sampling rate for no spurious local minimum in Ge et al. (2017) is
Under the setup and assumptions of Theorem 2.1, if we further assume (i.e., ) and
with some absolute constant, then in an event with probability , for any local minimum of objective function defined in (1.1), we have .
By the definition eq:rong_spikiness, we have
Plug it to (2.2), the corollary is proven. ∎
We propose to solve (1.1) by random initialization with iid normal entries, and then implement the gradient descent as introduced in (1.2), where the step size is determined by Armijo’s rule (Armijo, 1966). The gradient descent algorithm is implemented with sparse matrix storage in Section 3.1 for the purpose of memory-efficient KPCA, while with full matrix storage in Section 3.2 to test the performance of general low-rank approximations from missing data. In each experiment, the iterations will be terminated when or or the number of iterations surpasses . All methods are implemented in MATLAB. The experiments are running on a virtual computer with Linux KVM, with 12 cores of 2.00GHz Intel Xeon E5 processor and 16 GB memory.
3.1 Memory-efficient Kernel PCA
In this section we study the empirical performance of our memory-efficient Kernel PCA approach by applying it to the synthetic data set in Wang (2012). The data set consists of data points of dimension , which are partitioned into two classes independently with equal probability. Points in the first class are distributed uniformly on a three-dimensional sphere with radius
, while points in the second class are uniformly distributed on a three-dimensional sphere with radius. Every point is then perturbed independently by noise. The goal is to find an approximate Kernel PCA with , the radial kernel , and .
For Nyström method (Williams and Seeger, 2001), we uniformly choose columns (without replacement) to generate a rank- approximation of the kernel matrix . Note that we have the sampling rate . In comparison, to implement the nonconvex optimization (1.1), we set and , and choose in order to make sure the memory consumption is at the same level with Nyström method. Notice that in addition to recording the selected entry values, nonconvex method also needs to record the row and column indices for each selected entry, so it requires three times as much memory as Nyström method for the same sampling complexity.
For a fixed randomly generated data set, we apply both the Nyström method and our approach for 50 times. Figure 2 illustrates the relative errors for two different methods, where we use to represent the ground truth of the best rank- approximation of , and the corresponding rank- approximation obtained by Nyström or our nonconvex approach. In Figure 2, the left and right panels compare the performances of the two methods in terms of the accuracy of approximating and , respectively. One can see that our approach of partial matrix sampling plus nonconvex optimization has a comparable median performance to Nyström, while much more stable.
3.2 Numerical simulations under various spectra
In this section, we conduct numerical tests under different settings of spectrum. A positive semidefinite matrix is generated randomly as follows: its eigenvectors are the same as the left singular vectors of a random matrix with iid standard normal entries, and the spectrum will be particularly specified in each test. For each configuration of eigenvalues, we randomly generate independently for 50 times. To implement the gradient descent algorithm, we set and , and in every single numerical experiment, we also conduct spectral method as discussed in Achlioptas et al. (2002) to obtain an approximate low-rank approximation of for the purpose of comparison.
3.2.1 Full rank case
In this case, we assume that has full rank. We set , and the sampling rate . The eigenvalues are assumed to be , , and . Figure 3 shows scatter plots of relative errors of nonconvex/spectral methods in low-rank approximation for different values of . One can see that the errors for the nonconvex method (1.1) are well-bounded for different ’s, and much smaller than those by spectral methods.
3.2.2 Low-rank matrix with extreme condition numbers
We now let be of exactly low rank with different condition numbers. Let , , , , and , where the condition number takes on values . The performances of our nonconvex approach are demonstrated in Figure 4.
An interesting phenomenon is that the relative error for the proposed nonconvex method is always well-bounded and not sensitive to the condition number . This fact can be possibly explained by Corollary cor:balance, where is a good balance in terms of the condition number and its closeness to ().
3.2.3 Rank mismatching
In this section, we consider rank mismatching, i.e., from two different perspectives. In the first case, we fix (and hence ), and do numerical tests with different selected rank ; In the second case, we fix , and do numerical tests on different with different ranks.
To be concrete, in the first scenario, we set , and , and test with the selected rank . Numerical results for the first scenario are demonstrated in Figure 5. In the second scenario, we fix the selected rank and set , and randomly generate with rank from to , and nonzero eigenvalues equal to . The corresponding numerical results are given in Figure 6.
One can see from these figures that if the selected rank is less than the actual rank , our nonconvex approach performs almost as well as the best low-rank approximation in terms of approximating . On the other hand, if the selected rank is greater than the actual rank, the nonconvex method performs much better than spectral methods.
In this section, we give a proof for main theorem. In Section 4.1, we will present some useful supporting lemmas; in Section 4.2, we present a proof for our main result Theorem 2.1; finally in Section 4.3 we give proof of lemmas used in former subsections. Our proof ideas benefit from those in Ge et al. (2017) as well as Zhu et al. (2017), Jin et al. (2017).
4.1 Supporting lemmas
Here we give some useful supporting lemmas:
Lemma 4.1 (Vu 2017).
There is a constant such that the following holds. If be sampled according to the off-diagonal symmetric model, if , then
Similar to Theorem 4.1 in Candès and Recht (2009), for the off-diagonal symmetric model, we also have:
Suppose is sampled according to the off-diagonal symmetric model with probability , define subspace
where is a fixed subspace of . Let be the Euclidean projection on to . Then there is an absolute constant , for any , if with defined in (2.4), in an event with probability , we have
For the first and second order optimality condition of our objective function defined in (1.1), we have:
Lemma 4.3 (Ge et al. 2016, Proposition 4.1).
The first order optimality condition of objective function (1.1) is
and the second order optimality condition requires , we have
Finally, we are going to present our main lemma which will be used multiple times. Before we formally state it, for simplicity of notations, for any matrix , any set and any real number , we introduce following notation:
Now we are well prepared to present our main lemma:
For any and corresponding , for all , we have
for all .
We will use this result for multiple times later. Note here we do not have any assumption on and this is a deterministic result. The proof of this lemma can be considered as a more sophisticated version of those proofs for spectral lemmas in Bhojanapalli and Jain (2014) and Li et al. (2016b):
Suppose matrix can be decomposed as , let be set of revealed entries (not necessary follow any specific distribution), then for any we have
Lemma 4.6 (Ge et al. 2016, Theorem D.1).
With high probability over the choice of , for any two rank- matrices , we have
In Sun and Luo (2016), Chen and Wainwright (2015) and Zheng and Lafferty (2016), the authors give an upper bound of for any . To be more precise, they assume is sampled according to iid Bernoulli model with probability , then if with absolute constant sufficient large,
holds with high probability.
4.2 A proof of Theorem 2.1
Here we give a proof of Theorem 2.1. The proof will be mainly divided into two parts: in Section 4.2.1, we discuss the landscape of objective function and then define the auxiliary function , one can see the span of local minima of can be controlled by the span of supperlevel set of : ; in Section 4.2.2, we give a uniform upper bound of and use it to solve for possible span of supperlevel set of , which finishes the proof of our main result.
4.2.1 Landscape of objective function and the auxiliary function
Now denote . For a given , suppose has SVD , let and , where denotes the set of orthonormal matrices , then one can simply verify that is a positive semidefinite matrix. Notice by the definition we have . Moreover, it is proved in Chen and Wainwright (2015) such that .
The first and second order optimality conditions for any local minimum imply that . In other words, we have
To study the properties of the local minima of , we can consider the supperlevel set of : instead. In order to get a clear representation of , one can plug in first and second order condition listed in Lemma 4.3. Actually, by repacking terms in Ge et al. (2017, Lemma 7) and notice the fact that by definition we have , we can decompose as following:
Lemma 4.7 (Ge et al. 2017, Lemma 7).
Uniformly for all and corresponding defined as before, we have