Matrix eigen-decomposition is among the core and long-standing topics in numerical computing . It plays fundamental roles in various scientific and engineering computing problems (such as numerical computation [9, 22] and structural analysis 
) as well as machine learning tasks (such as kernel approximation, dimensionality reduction 20]). Thus far, there hasn’t been many algorithms proposed for this problem. Pioneering ones include the method of power iteration  and the (block) Lanczos algorithm , while randomized SVD 
and online learning of eigenvectors are recently proposed. The problem can also be expressed as a quadratically constrained quadratic program (QCQP), and thus can be approached by various optimization methods, such as trace penalty minimization  and Riemannian optimization algorithms [7, 1, 28]. Most of these algorithms perform the batch learning, i.e., using the entire dataset to perform the update at each step. This could be well addressed by designing appropriate stochastic algorithms. However, the state-of-the-art stochastic algorithm DSRG-EIGS  requires the learning rate to repeatedly decay till vanishing in order to guarantee convergence, which results in a slow convergence of sub-linear rate.
We propose a new stochastic Riemannian algorithm that makes a significant breakthrough theoretically. It improves the state-of-the-art sub-linear convergence rate to an exponential convergence one. The algorithm is inspired by the stochastic variance reduced gradient (SVRG) optimization , which was originally developed to solve convex problems in the Euclidean space. We propose the general form of variance reduction, called SVRRG, in the framework of the stochastic Riemannian gradient (SRG) optimization , such that it is able to enjoy the convergence properties (e.g., almost sure local convergence) of the SRG framework. We then get it specialized to the Riemannian eigensolver (RG-EIGS) problem so that it gives rise to our stochastic variance reduced Riemannian eigensolver, termed as SVRRG-EIGS. Our theoretical analysis shows that SVRRG-EIGS can use a constant learning rate, thus eliminating the need of using the decaying learning rate. Moreover, it not only possesses the global convergence in expectation compared to SRG , but also gains an accelerated convergence of exponential rate compared to DSRG-EIGS. To the best of our knowledge, we are the first to propose and analyze the generalization of SVRG to Riemannian manifolds.
The rest of the paper is organized as follows. Section 2 briefly reviews some preliminary knowledge on matrix eigen-decomposition, stochastic Riemannian gradient optimization and stochastic Riemannian eigensolver. Section 3 presents our stochastic variance reduced Riemannian eigensolver algorithm, starting from establishing the general form of variance reduction for the stochastic Riemannian gradient optimization. Theoretical analysis is conducted in Section 4, followed by the empirical study of our algorithm in Section 5. Section 6 discusses related works. Finally, Section 7 concludes the paper.
2 Preliminaries and Notations
2.1 Matrix Eigen-decomposition
The eigen-decomposition of a symmetric111The given matrix is assumed to be symmetric throughout the paper, i.e., . matrix can be written as , where
(identity matrix), andis a diagonal matrix. The -th column of
is called the eigenvector corresponding to the eigenvalue(-th diagonal element of ), i.e., . Assume that , and , and . In practice, matrix eigen-decomposition only aims at the set of top eigenvectors . From the optimization perspective, this can be formulated as the following non-convex QCQP problem:
where and represents the trace of a square matrix, i.e., the sum of diagonal elements of a square matrix. It can be easily verified that maximizes the trace at .
2.2 Stochastic Riemannian Gradient Optimizaiton
Given a Riemmanian manifold , the tangent space at a point , denoted as , is a Euclidean space that locally linearizes around . One iterate of the Riemannian gradient optimization on takes the form similar to that of the Euclidean case :
is a tangent vector ofat and represents the search direction at the -th step, is the learning rate (i.e., step size), and represents the retraction at that maps a tangent vector to a point on . Tangent vectors that serve as search directions are generally gradient-related. The gradient of a function on , denoted as , depends on the Riemannian metric, which is a family of smoothly varying inner products on tangent spaces, i.e., , where for any . The Riemannian gradient is the unique tangent vector that satisfies
for any , where represents the directional derivative of in the tangent direction . Setting in (2) leads to the Riemannian gradient (RG) ascent method:
is an observation of the random variableat the -th step that follows some distribution and satisfies , and is the stochastic Riemannian gradient such that . According to , the SRG method possesses the almost sure (local) convergence under certain conditions, including and (the latter condition implies that as ).
2.3 Stochastic Riemannian Eigensolver
where . Note that is an embedded Riemannian sub-manifold of the Euclidean space . With the metric inherited from the embedding space , i.e., , and using (3), we can get the Riemannian gradient222Due to the symmetry of , the Riemannian gradients under Euclidean metric and canonical metric are the same . However, since the orthogonal projector used in the sequel requires the metrics for the embedded Riemannian sub-manifold and the embedding space to be the same, we choose the Euclidean metric here. as:
The orthogonal projection onto under this metric is given by:
for any , where . In this paper, we use the retraction 
for any . The deployment of (4) and (5) here will then generate the Riemannian eigensolver (denoted as RG-EIGS) and the stochastic Riemannian eigensolver (denoted as SRG-EIGS), respectively. To the best of our knowledge, there is no existing stochastic Riemannian eigensolver that uses this retraction. The closest counterpart is the DSRG-EIGS that uses the Cayley transformation based retraction. However, based on the work of DSRG-EIGS, it can be shown that SRG-EIGS possesses the same theoretical properties as DSRG-EIGS, e.g., sub-linear convergence to global solutions.
In this section, we propose the stochastic variance reduced Riemannian gradient (SVRRG) and specialize it to the eigensolver problem.
Recall that the stochastic variance reduced gradient (SVRG)  is built on the vanilla stochastic gradient and achieves variance reduction through constructing control variates . Control variates are stochastic and zero-mean, serving to augment and correct stochastic gradients towards the true gradients. Following , SVRG is encoded as
is a version of the estimatedthat is kept as a snapshot after every SGD steps, and is the full gradient at .
Our task here is to develop the Riemannian counterpart SVRRG of SVRG. Denote the SVRRG as . A naive adaptation of (9) to a Riemannian manifold reads
where and . However, this adaptation is not sound theoretically: the stochastic Riemannian gradient and the control variate reside in two different tangent spaces, and thus making their difference not well-defined. We rectify this problem by the parallel transport , which moves tangent vectors from one point to another (accordingly from one tangent space to another) along geodesics in parallel. More specifically, we parallel transport the control variate from to . For computational efficiency, the first-order approximation, called vector transport , is used.
Vector transport of a tangent vector from point to point , denoted as , is a mapping from tangent space to tangent space . When is an embedded Riemannian sub-manifold of a Euclidean space, vector transport can be simply defined as :
where represents the orthogonal projector onto for the embedding Euclidean space. With the vector transport, we obtain the well-defined SVRRG in :
We then arrive at our SVRRG method:
With the SVRRG described above, we can now proceed to develop an effective eigensolver by specializing (6). This new eigensolver is named SVRRG-EIGS. The update can be written as
which can be decomposed into two substeps: and . Intuitively, the first substep moves along the direction from the current point to the intermediate point in the tangent space . The second substep then gets the intermediate point retracted back onto the Stiefel manifold to reach the next point .
Let’s delve into the first substep. Except for the vector transport inside , it looks much like an SVRG step since it works in the Euclidean tangent space. Assume that we have , is a random variable taking values in , , and stochastic gradient takes the form (i.e., sampling over data ). We can get the control variate as
By using the orthogonal projector in (7), the transported control variate can be written as
Accordingly, we have the SVRRG expressed as
where is a stochastic zero-mean term conditioned on . Note that the factor in might be theoretically harsh333It works well empirically.
, because an eigenspace could have distinct representations which are the same up to aorthogonal matrix, and thus it is only expected that and have the same column space at convergence. The ideal replacement would be where represents the column space. Numerically it could be achieved by replacing with where and is the SVD of .
The first substep can now be rewritten as
Compared to that in RG-EIGS, each step in both SRG-EIGS and SVRRG-EIGS amounts to taking one Riemannian gradient step in the tangent space, adding a stochastic zero-mean term in the tangent space, and then retracting back to the manifold. However, the stochastic zero-mean term in SRG-EIGS has a constant variance. Therefore it needs the learning rate to decay to zero to reduce the variance and to ensure the convergence, and consequently compromises on the convergence rate. In contrast, SVRRG-EIGS keeps boosting the variance reduction of the stochastic zero-mean term during iterations. The variance of is not constant but dominated by three quantities , and . These quantities repeatedly decay till vanishing in expectation, as and are expected to get closer and closer to each other gradually. This induces a decaying variance without the learning rate involved. Therefore, SVRRG-EIGS is able to use a fixed learning rate and achieve a much faster convergence rate.
4 Theoretical Analysis
We give the main theoretical results in this section. The proofs are provided in the supplementary material.
Consider a symmetric matrix which can be written as such that . The eigen-decomposition of is as defined in Section 2.1. And the eigen-gap . Then the top eigenvectors can be approximated to arbitrary accuracy and with any confidence level by running epochs of our SVRRG-EIGS algorithm, in the sense that the potential function with probability at least
with probability at least, provided that the following conditions about initial iterate , fixed learning rate and epoch length , are simultaneously satisfied:
where the constants are positive and defined as
Note that we have no loss of generality from assuming that in the theorem. In fact, if with (which could be estimated by, e.g., Gershgorin circle theorem), we could replace with to get and arrive at the same eigen-space. Another way of addressing this generality is to adopt the idea of , that is, replacing the learning rate with and the eigen-gap with , with some of the constants in the theorem re-derived. The condition on the initial iterate, i.e., , is theoretically non-trivial. However, empirically this condition can be well satisfied by running other stochastic algorithms (e.g., SRG-EIGS or DSRG-EIGS) or a few steps of deterministic iterative algorithms (e.g., RG-EIGS), because they are good at finding sub-optimal solutions. In our experiments, we use SRG-EIGS for this purpose, which makes the theorem amount to a convergence analysis at a later stage of the hybrid algorithm (e.g., starting from instead of ). The convergence rate of our algorithm can be roughly identified by the iteration number which establishes an exponential global convergence rate. Compared to the sub-linear rate of DSRG-EIGS by , it achieves a significant improvement since the complexity of a single iteration in the two algorithms only differs by constants. In summary, initialized by a low-precision eigensolver, our SVRG-EIGS algorithm would obtain a high-precision solution in a limited number of epochs (data passes), which is theoretically guaranteed by Theorem 4.1.
We provide an elegant proof of Theorem 4.1 in Appendix, though it is a bit involved. For ease of exposition and understanding, we decompose this course into three steps in a way similar to , including the analysis on one iteration, one epoch and one run of the algorithm. Among them, the first step (i.e., one iteration analysis) lies at the core of the main proof, where the techniques we use are dramatically different from those in [3, 24] due to our new context of Rimannian manifolds, or more precisely, Stiefel manifolds. This inherently different context requires new techniques, which in turn yield an improved exponential global convergence and accordingly bring more improvements over the convergence of sub-linear rate .
In this section, we empirically verify the exponential convergence rate of our SVRRG-EIGS algorithm and demonstrate its capability of finding solutions of high precision when combined with other algorithms of low precision. Specifically, we use SRG-EIGS to generate a low-precision solution for initializing SVRRG-EIGS, and do the comparison with both RG-EIGS and SRG-EIGS. Among various implementations of RG-EIGS with different choices of metric and retraction in (2), we choose the one with canonical metric and Cayley transformation based retraction  since its code is publically available444optman.blogs.rice.edu/. This version of RG-EIGS uses the non-monotone line search with the well-known Barzilai-Borwein step size, which significantly reduces the iteration number, and performs well in practice. Both RG-EIGS and SRG-EIGS are fed with the same random initial value of
, where each entry is sampled from the standard normal distributionand then all entries as a whole are orthogonalized. SRG-EIGS uses the decaying learning rate where will be tuned.
We verify the properties of our algorithm on a real symmetric matrix, Schenk555www.cise.ufl.edu/research/sparse/matrices/, of size, with nonzero entries. We partition into column blocks with block size equal to so that we can write with and each having only one column block of and all others zero. We set . For SVRRG-EIGS, we are able to use a fixed learning rate
and adopt the heuristic( represents the matrix -norm), similar to that in . We set and epoch length , i.e., each epoch takes passes over (including one pass for computing the full gradient). Accordingly, the epoch length of SRG-EIGS is set to . In addition, we set .
The performance of different algorithms is evaluated using three quality measures: feasibility , relative error function , and normalized potential function . The ground truths in these measures, including both and that is set to , are obtained using Matlab’s EIGS function for benchmarking. For each measure, lower values indicate higher quality.
Given a solution of low precision666This low precision could be problem dependent. at , our SVRRG-EIGS targets a double precision, that is, or . Each algorithm terminates when the precision requirement is met or the maximum number of epoches (set as ) is reached.
We report the convergence curves in terms of each measure, on which empirical convergence rates of the algorithms can be observed. Figure 1 reports the performance of different algorithms. In terms of feasibility, both SRG-EIGS and SVRRG-EIGS perform well, while RG-EIGS produces much poorer results. This is because the Cayley transformation based retraction used therein relies heavily on the Sherman-Morrison-Woodbury formula, which suffers from the numerical instability. From Figures 1(b) and 1(c), we observe similar convergence trends for each algorithm under the two different measures. All three algorithms improve their solutions with more iteration. There are several exceptions in RG-EIGS. This is due to the non-monotone step size used in its implementation. We also observe that SRG-EIGS presents an exponential convergence rate at an early stage thanks to a relatively large learning rate. However, it subsequently steps into a long period of sub-exponential convergence, which leads to small progress towards the optimal solution. In contrast, our SVRRG-EIGS inherits the initial momentum from SRG-EIGS and keeps the exponential convergence rate throughout the entire process. This enables it to approach the optimal solution at a fast speed. RG-EIGS has a different trend. It converges sub-exponentially at the beginning and performs the worst. Though it converges fast at a later stage, it still needs more passes over data than SVRRG-EIGS in order to achieve a high precision.
6 Related Work
Existing methods on eigensolvers include the power method , the (block) Lanczos algorithms , Randomized SVD , Riemannian methods [25, 1], and so on. All these methods performs the batch learning, while our focus in this paper is on stochastic algorithms. From this perspective, few existing works include online learning of eigenvectors  which aims at the leading eigenvector, i.e., , and doubly stochastic Riemannian method (DSRG-EIGS)  where the learning rate has to decay to zero.  provides the regret analysis without empirical verification for their method, while DSRG-EIGS belongs to one of implementations of SRG-EIGS in this paper where the double stochasticity comes from sampling over both data and coordinates of Riemmanian gradients. On the other hand, since the work of , variance reduction (SVRG) has become an appealing technique to stochastic optimization. There are quite some variants developed from different perspectives, such as practical SVRG , second-order SVRG , distributed or asynchronous SVRG [16, 12], and non-convex SVRG [24, 23]. Our SVRRG belongs to non-convex SVRG, but is addressed from the Riemannian optimization perspective. The core techniques we use are dramatically different from existing ones due to our new context.
In this paper, we proposed the generalization of SVRG to Riemannian manifolds, and established the general framework of SVRG in this setting, SVRRG, which requires the key ingredient, vector transport, to make itself well-defined. It is then deployed to the eigensolver problem and induces the SVRRG-EIGS algorithm. We analyzed its theoretical properties in detail. As suggested by our theoretical results, the proposed algorithm is guaranteed to find high-precision solutions at an exponential convergence rate. The theoretical implications are verified on a real dataset. For future work, we will explore the possibility of addressing the limitations of SVRRG-EIGS, e.g., dependence on eigen-gap and non-trivial initialization. We may also conduct more empirical investigations on the performance of SVRRG-EIGS.
APPENDIX: Supplementary Material
Appendix A Useful Lemmas
In this section, some definition, basics, and a group of useful lemmas are provided. All the matrices are assumed to be real.
a.1 Definitions and Basics
a.1.1 Matrix facts: symmetry, positive semi-definiteness, trace, norm and orthogonality
The matrix () represents that is symmetric and positive semidefinite (definite), and if then as well. The trace of a square matrix, , is the sum of diagonal entries of . A useful fact about trace is the circular property, e.g., for matrices . and represents the Frobenious-norm and spectral norm (i.e., matrix -norm) of matrix , respectively. Here represents the maximum eigenvalue of an matrix,
represents the maximum singular value of anmatrix. Note that and have the same set of nonzero eigenvalues for two matrices and . Thus, . In this document, we always assume that the eigenvalues of an matrix takes the form . Thus and for any , where is called the spectral radius of a square matrix . And also , which in turn implies for any matrix . For any two matrices and that make well-defined, holds for both Frobenious-norm and spectral norm, and holds. Furthermore, the orthogonal invariance also holds for both Frobenious-norm and spectral norm, i.e., for column-orthonormal matrices and (i.e., and ). For , let represent its orthogonal complement, i.e., which implies and .
The filtration, defined on a measurable probability space, is an increasing sequence of sub-sigma algebras for , meaning that for all . In our context, encodes the set of all the random variables seen thus far (i.e., from to ). In this document, conditioned on refers to conditioned on for brevity. Let and be a stochastic process and a filtration, respectively, on the same probability space. Then is called a martingale (super-martingale) with respect to if for each , is -measurable, , and (). Given a random variable and a constant , the probability (Markov inequality). Let be a martingale or supermartingale such that (i.e., bounded difference) where is a deterministic function of . Then for all and any , the probability (Azuma-Hoeffding inequality) .
For any , it holds that
If and , then .
Let be square matrix, where are fixed and are stochastic zero-mean. Furthermore, suppose that for some fixed , it holds with probability that
For all ,
Let be a matrix with minimal singular value and . Then
For any matrices with orthonormal columns, let . Then
where is the SVD of .
Let and be as defined in Section 3.2 of the main paper. Assume and . Then for any matrix with orthonormal columns, it holds that