1 Introduction
Matrix eigendecomposition is among the core and longstanding topics in numerical computing [29]. It plays fundamental roles in various scientific and engineering computing problems (such as numerical computation [9, 22] and structural analysis [25]
) as well as machine learning tasks (such as kernel approximation
[6], dimensionality reduction [14][20]). Thus far, there hasn’t been many algorithms proposed for this problem. Pioneering ones include the method of power iteration [9] and the (block) Lanczos algorithm [21], while randomized SVD [10]and online learning of eigenvectors
[8] are recently proposed. The problem can also be expressed as a quadratically constrained quadratic program (QCQP), and thus can be approached by various optimization methods, such as trace penalty minimization [27] and Riemannian optimization algorithms [7, 1, 28]. Most of these algorithms perform the batch learning, i.e., using the entire dataset to perform the update at each step. This could be well addressed by designing appropriate stochastic algorithms. However, the stateoftheart stochastic algorithm DSRGEIGS [2] requires the learning rate to repeatedly decay till vanishing in order to guarantee convergence, which results in a slow convergence of sublinear rate.We propose a new stochastic Riemannian algorithm that makes a significant breakthrough theoretically. It improves the stateoftheart sublinear convergence rate to an exponential convergence one. The algorithm is inspired by the stochastic variance reduced gradient (SVRG) optimization [12], which was originally developed to solve convex problems in the Euclidean space. We propose the general form of variance reduction, called SVRRG, in the framework of the stochastic Riemannian gradient (SRG) optimization [4], such that it is able to enjoy the convergence properties (e.g., almost sure local convergence) of the SRG framework. We then get it specialized to the Riemannian eigensolver (RGEIGS) problem so that it gives rise to our stochastic variance reduced Riemannian eigensolver, termed as SVRRGEIGS. Our theoretical analysis shows that SVRRGEIGS can use a constant learning rate, thus eliminating the need of using the decaying learning rate. Moreover, it not only possesses the global convergence in expectation compared to SRG [4], but also gains an accelerated convergence of exponential rate compared to DSRGEIGS. To the best of our knowledge, we are the first to propose and analyze the generalization of SVRG to Riemannian manifolds.
The rest of the paper is organized as follows. Section 2 briefly reviews some preliminary knowledge on matrix eigendecomposition, stochastic Riemannian gradient optimization and stochastic Riemannian eigensolver. Section 3 presents our stochastic variance reduced Riemannian eigensolver algorithm, starting from establishing the general form of variance reduction for the stochastic Riemannian gradient optimization. Theoretical analysis is conducted in Section 4, followed by the empirical study of our algorithm in Section 5. Section 6 discusses related works. Finally, Section 7 concludes the paper.
2 Preliminaries and Notations
2.1 Matrix Eigendecomposition
The eigendecomposition of a symmetric^{1}^{1}1The given matrix is assumed to be symmetric throughout the paper, i.e., . matrix can be written as , where
(identity matrix), and
is a diagonal matrix. The th column ofis called the eigenvector corresponding to the eigenvalue
(th diagonal element of ), i.e., . Assume that , and , and . In practice, matrix eigendecomposition only aims at the set of top eigenvectors . From the optimization perspective, this can be formulated as the following nonconvex QCQP problem:(1) 
where and represents the trace of a square matrix, i.e., the sum of diagonal elements of a square matrix. It can be easily verified that maximizes the trace at .
2.2 Stochastic Riemannian Gradient Optimizaiton
Given a Riemmanian manifold , the tangent space at a point , denoted as , is a Euclidean space that locally linearizes around [17]. One iterate of the Riemannian gradient optimization on takes the form similar to that of the Euclidean case [1]:
(2) 
where
is a tangent vector of
at and represents the search direction at the th step, is the learning rate (i.e., step size), and represents the retraction at that maps a tangent vector to a point on . Tangent vectors that serve as search directions are generally gradientrelated. The gradient of a function on , denoted as , depends on the Riemannian metric, which is a family of smoothly varying inner products on tangent spaces, i.e., , where for any . The Riemannian gradient is the unique tangent vector that satisfies(3) 
for any , where represents the directional derivative of in the tangent direction . Setting in (2) leads to the Riemannian gradient (RG) ascent method:
(4) 
We can also set in (2) and induce the stochastic Riemannian gradient (SRG) ascent method [4]:
(5) 
where
is an observation of the random variable
at the th step that follows some distribution and satisfies , and is the stochastic Riemannian gradient such that . According to [4], the SRG method possesses the almost sure (local) convergence under certain conditions, including and (the latter condition implies that as ).2.3 Stochastic Riemannian Eigensolver
The constraint set in problem (1) constitutes a Stiefel manifold, , which turns (1) into a Riemannian optimization problem:
(6) 
where . Note that is an embedded Riemannian submanifold of the Euclidean space [1]. With the metric inherited from the embedding space , i.e., , and using (3), we can get the Riemannian gradient^{2}^{2}2Due to the symmetry of , the Riemannian gradients under Euclidean metric and canonical metric are the same [28]. However, since the orthogonal projector used in the sequel requires the metrics for the embedded Riemannian submanifold and the embedding space to be the same, we choose the Euclidean metric here. as:
The orthogonal projection onto under this metric is given by:
(7) 
for any , where . In this paper, we use the retraction [1]
(8) 
for any . The deployment of (4) and (5) here will then generate the Riemannian eigensolver (denoted as RGEIGS) and the stochastic Riemannian eigensolver (denoted as SRGEIGS), respectively. To the best of our knowledge, there is no existing stochastic Riemannian eigensolver that uses this retraction. The closest counterpart is the DSRGEIGS that uses the Cayley transformation based retraction. However, based on the work of DSRGEIGS, it can be shown that SRGEIGS possesses the same theoretical properties as DSRGEIGS, e.g., sublinear convergence to global solutions.
3 SvrrgEigs
In this section, we propose the stochastic variance reduced Riemannian gradient (SVRRG) and specialize it to the eigensolver problem.
3.1 Svrrg
Recall that the stochastic variance reduced gradient (SVRG) [13] is built on the vanilla stochastic gradient and achieves variance reduction through constructing control variates [26]. Control variates are stochastic and zeromean, serving to augment and correct stochastic gradients towards the true gradients. Following [13], SVRG is encoded as
(9) 
where
is a version of the estimated
that is kept as a snapshot after every SGD steps, and is the full gradient at .Our task here is to develop the Riemannian counterpart SVRRG of SVRG. Denote the SVRRG as . A naive adaptation of (9) to a Riemannian manifold reads
where and . However, this adaptation is not sound theoretically: the stochastic Riemannian gradient and the control variate reside in two different tangent spaces, and thus making their difference not welldefined. We rectify this problem by the parallel transport [1], which moves tangent vectors from one point to another (accordingly from one tangent space to another) along geodesics in parallel. More specifically, we parallel transport the control variate from to . For computational efficiency, the firstorder approximation, called vector transport [1], is used.
Vector transport of a tangent vector from point to point , denoted as , is a mapping from tangent space to tangent space . When is an embedded Riemannian submanifold of a Euclidean space, vector transport can be simply defined as [1]:
where represents the orthogonal projector onto for the embedding Euclidean space. With the vector transport, we obtain the welldefined SVRRG in :
We then arrive at our SVRRG method:
(10) 
by setting in (2). Note that the SVRRG method (10) is naturally subsumed into the SRG method (5), and thus enjoys all the properties of SRG.
3.2 SvrrgEigs
With the SVRRG described above, we can now proceed to develop an effective eigensolver by specializing (6). This new eigensolver is named SVRRGEIGS. The update can be written as
(11) 
which can be decomposed into two substeps: and . Intuitively, the first substep moves along the direction from the current point to the intermediate point in the tangent space . The second substep then gets the intermediate point retracted back onto the Stiefel manifold to reach the next point .
Let’s delve into the first substep. Except for the vector transport inside , it looks much like an SVRG step since it works in the Euclidean tangent space. Assume that we have , is a random variable taking values in , , and stochastic gradient takes the form (i.e., sampling over data ). We can get the control variate as
By using the orthogonal projector in (7), the transported control variate can be written as
Accordingly, we have the SVRRG expressed as
where is a stochastic zeromean term conditioned on . Note that the factor in might be theoretically harsh^{3}^{3}3It works well empirically.
, because an eigenspace could have distinct representations which are the same up to a
orthogonal matrix, and thus it is only expected that and have the same column space at convergence. The ideal replacement would be where represents the column space. Numerically it could be achieved by replacing with where and is the SVD of [24].The first substep can now be rewritten as
As a comparison, we can similarly decompose the update steps (4) and (5) of RGEIGS and SRGEIGS into two substeps and then have:
Compared to that in RGEIGS, each step in both SRGEIGS and SVRRGEIGS amounts to taking one Riemannian gradient step in the tangent space, adding a stochastic zeromean term in the tangent space, and then retracting back to the manifold. However, the stochastic zeromean term in SRGEIGS has a constant variance. Therefore it needs the learning rate to decay to zero to reduce the variance and to ensure the convergence, and consequently compromises on the convergence rate. In contrast, SVRRGEIGS keeps boosting the variance reduction of the stochastic zeromean term during iterations. The variance of is not constant but dominated by three quantities , and . These quantities repeatedly decay till vanishing in expectation, as and are expected to get closer and closer to each other gradually. This induces a decaying variance without the learning rate involved. Therefore, SVRRGEIGS is able to use a fixed learning rate and achieve a much faster convergence rate.
4 Theoretical Analysis
We give the main theoretical results in this section. The proofs are provided in the supplementary material.
Theorem 4.1.
Consider a symmetric matrix which can be written as such that . The eigendecomposition of is as defined in Section 2.1. And the eigengap . Then the top eigenvectors can be approximated to arbitrary accuracy and with any confidence level by running epochs of our SVRRGEIGS algorithm, in the sense that the potential function
with probability at least
, provided that the following conditions about initial iterate , fixed learning rate and epoch length , are simultaneously satisfied:where the constants are positive and defined as
Note that we have no loss of generality from assuming that in the theorem. In fact, if with (which could be estimated by, e.g., Gershgorin circle theorem), we could replace with to get and arrive at the same eigenspace. Another way of addressing this generality is to adopt the idea of [24], that is, replacing the learning rate with and the eigengap with , with some of the constants in the theorem rederived. The condition on the initial iterate, i.e., , is theoretically nontrivial. However, empirically this condition can be well satisfied by running other stochastic algorithms (e.g., SRGEIGS or DSRGEIGS) or a few steps of deterministic iterative algorithms (e.g., RGEIGS), because they are good at finding suboptimal solutions. In our experiments, we use SRGEIGS for this purpose, which makes the theorem amount to a convergence analysis at a later stage of the hybrid algorithm (e.g., starting from instead of ). The convergence rate of our algorithm can be roughly identified by the iteration number which establishes an exponential global convergence rate. Compared to the sublinear rate of DSRGEIGS by [2], it achieves a significant improvement since the complexity of a single iteration in the two algorithms only differs by constants. In summary, initialized by a lowprecision eigensolver, our SVRGEIGS algorithm would obtain a highprecision solution in a limited number of epochs (data passes), which is theoretically guaranteed by Theorem 4.1.
We provide an elegant proof of Theorem 4.1 in Appendix, though it is a bit involved. For ease of exposition and understanding, we decompose this course into three steps in a way similar to [24], including the analysis on one iteration, one epoch and one run of the algorithm. Among them, the first step (i.e., one iteration analysis) lies at the core of the main proof, where the techniques we use are dramatically different from those in [3, 24] due to our new context of Rimannian manifolds, or more precisely, Stiefel manifolds. This inherently different context requires new techniques, which in turn yield an improved exponential global convergence and accordingly bring more improvements over the convergence of sublinear rate [2].
5 Experiments
In this section, we empirically verify the exponential convergence rate of our SVRRGEIGS algorithm and demonstrate its capability of finding solutions of high precision when combined with other algorithms of low precision. Specifically, we use SRGEIGS to generate a lowprecision solution for initializing SVRRGEIGS, and do the comparison with both RGEIGS and SRGEIGS. Among various implementations of RGEIGS with different choices of metric and retraction in (2), we choose the one with canonical metric and Cayley transformation based retraction [28] since its code is publically available^{4}^{4}4optman.blogs.rice.edu/. This version of RGEIGS uses the nonmonotone line search with the wellknown BarzilaiBorwein step size, which significantly reduces the iteration number, and performs well in practice. Both RGEIGS and SRGEIGS are fed with the same random initial value of
, where each entry is sampled from the standard normal distribution
and then all entries as a whole are orthogonalized. SRGEIGS uses the decaying learning rate where will be tuned.We verify the properties of our algorithm on a real symmetric matrix, Schenk^{5}^{5}5www.cise.ufl.edu/research/sparse/matrices/, of size, with nonzero entries. We partition into column blocks with block size equal to so that we can write with and each having only one column block of and all others zero. We set . For SVRRGEIGS, we are able to use a fixed learning rate
and adopt the heuristic
( represents the matrix norm), similar to that in [24]. We set and epoch length , i.e., each epoch takes passes over (including one pass for computing the full gradient). Accordingly, the epoch length of SRGEIGS is set to . In addition, we set .The performance of different algorithms is evaluated using three quality measures: feasibility , relative error function , and normalized potential function . The ground truths in these measures, including both and that is set to , are obtained using Matlab’s EIGS function for benchmarking. For each measure, lower values indicate higher quality.
Given a solution of low precision^{6}^{6}6This low precision could be problem dependent. at , our SVRRGEIGS targets a double precision, that is, or . Each algorithm terminates when the precision requirement is met or the maximum number of epoches (set as ) is reached.
We report the convergence curves in terms of each measure, on which empirical convergence rates of the algorithms can be observed. Figure 1 reports the performance of different algorithms. In terms of feasibility, both SRGEIGS and SVRRGEIGS perform well, while RGEIGS produces much poorer results. This is because the Cayley transformation based retraction used therein relies heavily on the ShermanMorrisonWoodbury formula, which suffers from the numerical instability. From Figures 1(b) and 1(c), we observe similar convergence trends for each algorithm under the two different measures. All three algorithms improve their solutions with more iteration. There are several exceptions in RGEIGS. This is due to the nonmonotone step size used in its implementation. We also observe that SRGEIGS presents an exponential convergence rate at an early stage thanks to a relatively large learning rate. However, it subsequently steps into a long period of subexponential convergence, which leads to small progress towards the optimal solution. In contrast, our SVRRGEIGS inherits the initial momentum from SRGEIGS and keeps the exponential convergence rate throughout the entire process. This enables it to approach the optimal solution at a fast speed. RGEIGS has a different trend. It converges subexponentially at the beginning and performs the worst. Though it converges fast at a later stage, it still needs more passes over data than SVRRGEIGS in order to achieve a high precision.
6 Related Work
Existing methods on eigensolvers include the power method [9], the (block) Lanczos algorithms [5], Randomized SVD [10], Riemannian methods [25, 1], and so on. All these methods performs the batch learning, while our focus in this paper is on stochastic algorithms. From this perspective, few existing works include online learning of eigenvectors [8] which aims at the leading eigenvector, i.e., , and doubly stochastic Riemannian method (DSRGEIGS) [2] where the learning rate has to decay to zero. [8] provides the regret analysis without empirical verification for their method, while DSRGEIGS belongs to one of implementations of SRGEIGS in this paper where the double stochasticity comes from sampling over both data and coordinates of Riemmanian gradients. On the other hand, since the work of [13], variance reduction (SVRG) has become an appealing technique to stochastic optimization. There are quite some variants developed from different perspectives, such as practical SVRG [11], secondorder SVRG [15], distributed or asynchronous SVRG [16, 12], and nonconvex SVRG [24, 23]. Our SVRRG belongs to nonconvex SVRG, but is addressed from the Riemannian optimization perspective. The core techniques we use are dramatically different from existing ones due to our new context.
7 Conclusion
In this paper, we proposed the generalization of SVRG to Riemannian manifolds, and established the general framework of SVRG in this setting, SVRRG, which requires the key ingredient, vector transport, to make itself welldefined. It is then deployed to the eigensolver problem and induces the SVRRGEIGS algorithm. We analyzed its theoretical properties in detail. As suggested by our theoretical results, the proposed algorithm is guaranteed to find highprecision solutions at an exponential convergence rate. The theoretical implications are verified on a real dataset. For future work, we will explore the possibility of addressing the limitations of SVRRGEIGS, e.g., dependence on eigengap and nontrivial initialization. We may also conduct more empirical investigations on the performance of SVRRGEIGS.
APPENDIX: Supplementary Material
Appendix A Useful Lemmas
In this section, some definition, basics, and a group of useful lemmas are provided. All the matrices are assumed to be real.
a.1 Definitions and Basics
a.1.1 Matrix facts: symmetry, positive semidefiniteness, trace, norm and orthogonality
The matrix () represents that is symmetric and positive semidefinite (definite), and if then as well. The trace of a square matrix, , is the sum of diagonal entries of . A useful fact about trace is the circular property, e.g., for matrices . and represents the Frobeniousnorm and spectral norm (i.e., matrix norm) of matrix , respectively. Here represents the maximum eigenvalue of an matrix,
represents the maximum singular value of an
matrix. Note that and have the same set of nonzero eigenvalues for two matrices and . Thus, . In this document, we always assume that the eigenvalues of an matrix takes the form . Thus and for any , where is called the spectral radius of a square matrix . And also , which in turn implies for any matrix . For any two matrices and that make welldefined, holds for both Frobeniousnorm and spectral norm, and holds. Furthermore, the orthogonal invariance also holds for both Frobeniousnorm and spectral norm, i.e., for columnorthonormal matrices and (i.e., and ). For , let represent its orthogonal complement, i.e., which implies and .a.1.2 Martingale
The filtration, defined on a measurable probability space, is an increasing sequence of subsigma algebras for , meaning that for all . In our context, encodes the set of all the random variables seen thus far (i.e., from to ). In this document, conditioned on refers to conditioned on for brevity. Let and be a stochastic process and a filtration, respectively, on the same probability space. Then is called a martingale (supermartingale) with respect to if for each , is measurable, , and (). Given a random variable and a constant , the probability (Markov inequality). Let be a martingale or supermartingale such that (i.e., bounded difference) where is a deterministic function of . Then for all and any , the probability (AzumaHoeffding inequality) [19].
a.2 Lemmas
Lemma A.1.
For any , it holds that
Lemma A.2.
If and , then .
Lemma A.3.
Let be square matrix, where are fixed and are stochastic zeromean. Furthermore, suppose that for some fixed , it holds with probability that

For all ,


Then
Lemma A.4.
Let be a matrix with minimal singular value and . Then
Lemma A.5.
For any matrices with orthonormal columns, let . Then
where is the SVD of .
Lemma A.6.
Let and be as defined in Section 3.2 of the main paper. Assume and . Then for any matrix with orthonormal columns, it holds that
Comments
There are no comments yet.