# Stochastic Variance Reduced Riemannian Eigensolver

We study the stochastic Riemannian gradient algorithm for matrix eigen-decomposition. The state-of-the-art stochastic Riemannian algorithm requires the learning rate to decay to zero and thus suffers from slow convergence and sub-optimal solutions. In this paper, we address this issue by deploying the variance reduction (VR) technique of stochastic gradient descent (SGD). The technique was originally developed to solve convex problems in the Euclidean space. We generalize it to Riemannian manifolds and realize it to solve the non-convex eigen-decomposition problem. We are the first to propose and analyze the generalization of SVRG to Riemannian manifolds. Specifically, we propose the general variance reduction form, SVRRG, in the framework of the stochastic Riemannian gradient optimization. It's then specialized to the problem with eigensolvers and induces the SVRRG-EIGS algorithm. We provide a novel and elegant theoretical analysis on this algorithm. The theory shows that a fixed learning rate can be used in the Riemannian setting with an exponential global convergence rate guaranteed. The theoretical results make a significant improvement over existing studies, with the effectiveness empirically verified.

## Authors

• 16 publications
• 3 publications
• ### Averaging Stochastic Gradient Descent on Riemannian Manifolds

We consider the minimization of a function defined on a Riemannian manif...
02/26/2018 ∙ by Nilesh Tripuraneni, et al. ∙ 0

• ### Optimal Design of Experiments on Riemannian Manifolds

Traditional optimal design of experiment theory is developed on Euclidea...
11/06/2019 ∙ by Hang Li, et al. ∙ 0

• ### An Online Riemannian PCA for Stochastic Canonical Correlation Analysis

We present an efficient stochastic algorithm (RSG+) for canonical correl...
06/08/2021 ∙ by Zihang Meng, et al. ∙ 0

• ### Variance reduction for Riemannian non-convex optimization with batch size adaptation

Variance reduction techniques are popular in accelerating gradient desce...
07/03/2020 ∙ by Andi Han, et al. ∙ 11

• ### Riemannian stochastic variance reduced gradient on Grassmann manifold

Stochastic variance reduction algorithms have recently become popular fo...
05/24/2016 ∙ by Hiroyuki Kasai, et al. ∙ 0

• ### R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate

We study smooth stochastic optimization problems on Riemannian manifolds...
11/10/2018 ∙ by Jingzhao Zhang, et al. ∙ 0

• ### A Riemannian-Stein Kernel Method

This paper presents a theoretical analysis of numerical integration base...
10/11/2018 ∙ by Alessandro Barp, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Matrix eigen-decomposition is among the core and long-standing topics in numerical computing [29]. It plays fundamental roles in various scientific and engineering computing problems (such as numerical computation [9, 22] and structural analysis [25]

) as well as machine learning tasks (such as kernel approximation

[6], dimensionality reduction [14][20]). Thus far, there hasn’t been many algorithms proposed for this problem. Pioneering ones include the method of power iteration [9] and the (block) Lanczos algorithm [21], while randomized SVD [10]

and online learning of eigenvectors

[8] are recently proposed. The problem can also be expressed as a quadratically constrained quadratic program (QCQP), and thus can be approached by various optimization methods, such as trace penalty minimization [27] and Riemannian optimization algorithms [7, 1, 28]. Most of these algorithms perform the batch learning, i.e., using the entire dataset to perform the update at each step. This could be well addressed by designing appropriate stochastic algorithms. However, the state-of-the-art stochastic algorithm DSRG-EIGS [2] requires the learning rate to repeatedly decay till vanishing in order to guarantee convergence, which results in a slow convergence of sub-linear rate.

We propose a new stochastic Riemannian algorithm that makes a significant breakthrough theoretically. It improves the state-of-the-art sub-linear convergence rate to an exponential convergence one. The algorithm is inspired by the stochastic variance reduced gradient (SVRG) optimization [12], which was originally developed to solve convex problems in the Euclidean space. We propose the general form of variance reduction, called SVRRG, in the framework of the stochastic Riemannian gradient (SRG) optimization [4], such that it is able to enjoy the convergence properties (e.g., almost sure local convergence) of the SRG framework. We then get it specialized to the Riemannian eigensolver (RG-EIGS) problem so that it gives rise to our stochastic variance reduced Riemannian eigensolver, termed as SVRRG-EIGS. Our theoretical analysis shows that SVRRG-EIGS can use a constant learning rate, thus eliminating the need of using the decaying learning rate. Moreover, it not only possesses the global convergence in expectation compared to SRG [4], but also gains an accelerated convergence of exponential rate compared to DSRG-EIGS. To the best of our knowledge, we are the first to propose and analyze the generalization of SVRG to Riemannian manifolds.

The rest of the paper is organized as follows. Section 2 briefly reviews some preliminary knowledge on matrix eigen-decomposition, stochastic Riemannian gradient optimization and stochastic Riemannian eigensolver. Section 3 presents our stochastic variance reduced Riemannian eigensolver algorithm, starting from establishing the general form of variance reduction for the stochastic Riemannian gradient optimization. Theoretical analysis is conducted in Section 4, followed by the empirical study of our algorithm in Section 5. Section 6 discusses related works. Finally, Section 7 concludes the paper.

## 2 Preliminaries and Notations

### 2.1 Matrix Eigen-decomposition

The eigen-decomposition of a symmetric111The given matrix is assumed to be symmetric throughout the paper, i.e., . matrix can be written as , where

(identity matrix), and

is a diagonal matrix. The -th column of

is called the eigenvector corresponding to the eigenvalue

(-th diagonal element of ), i.e., . Assume that , and , and . In practice, matrix eigen-decomposition only aims at the set of top eigenvectors . From the optimization perspective, this can be formulated as the following non-convex QCQP problem:

 maxX∈Rn×k:X⊤X=I(1/2)tr(X⊤AX), (1)

where and represents the trace of a square matrix, i.e., the sum of diagonal elements of a square matrix. It can be easily verified that maximizes the trace at .

### 2.2 Stochastic Riemannian Gradient Optimizaiton

Given a Riemmanian manifold , the tangent space at a point , denoted as , is a Euclidean space that locally linearizes around [17]. One iterate of the Riemannian gradient optimization on takes the form similar to that of the Euclidean case [1]:

 X(t+1)=RX(t)(αt+1ξX(t)), (2)

where

is a tangent vector of

at and represents the search direction at the -th step, is the learning rate (i.e., step size), and represents the retraction at that maps a tangent vector to a point on . Tangent vectors that serve as search directions are generally gradient-related. The gradient of a function on , denoted as , depends on the Riemannian metric, which is a family of smoothly varying inner products on tangent spaces, i.e., , where for any . The Riemannian gradient is the unique tangent vector that satisfies

for any , where represents the directional derivative of in the tangent direction . Setting in (2) leads to the Riemannian gradient (RG) ascent method:

We can also set in (2) and induce the stochastic Riemannian gradient (SRG) ascent method [4]:

 X(t+1)=RX(t)(αt+1G(yt+1,X(t))), (5)

where

is an observation of the random variable

at the -th step that follows some distribution and satisfies , and is the stochastic Riemannian gradient such that . According to [4], the SRG method possesses the almost sure (local) convergence under certain conditions, including and (the latter condition implies that as ).

### 2.3 Stochastic Riemannian Eigensolver

The constraint set in problem (1) constitutes a Stiefel manifold, , which turns (1) into a Riemannian optimization problem:

 maxX∈St(n,k)f(X), (6)

where . Note that is an embedded Riemannian sub-manifold of the Euclidean space [1]. With the metric inherited from the embedding space , i.e., , and using (3), we can get the Riemannian gradient222Due to the symmetry of , the Riemannian gradients under Euclidean metric and canonical metric are the same [28]. However, since the orthogonal projector used in the sequel requires the metrics for the embedded Riemannian sub-manifold and the embedding space to be the same, we choose the Euclidean metric here. as:

The orthogonal projection onto under this metric is given by:

 (7)

for any , where . In this paper, we use the retraction [1]

 RX(ξ)=(X+ξ)(I+ξ⊤ξ)−1/2 (8)

for any . The deployment of (4) and (5) here will then generate the Riemannian eigensolver (denoted as RG-EIGS) and the stochastic Riemannian eigensolver (denoted as SRG-EIGS), respectively. To the best of our knowledge, there is no existing stochastic Riemannian eigensolver that uses this retraction. The closest counterpart is the DSRG-EIGS that uses the Cayley transformation based retraction. However, based on the work of DSRG-EIGS, it can be shown that SRG-EIGS possesses the same theoretical properties as DSRG-EIGS, e.g., sub-linear convergence to global solutions.

## 3 Svrrg-Eigs

In this section, we propose the stochastic variance reduced Riemannian gradient (SVRRG) and specialize it to the eigensolver problem.

### 3.1 Svrrg

Recall that the stochastic variance reduced gradient (SVRG) [13] is built on the vanilla stochastic gradient and achieves variance reduction through constructing control variates [26]. Control variates are stochastic and zero-mean, serving to augment and correct stochastic gradients towards the true gradients. Following [13], SVRG is encoded as

 gt(ξt,w(t−1))=∇ψit(w(t−1))−(∇ψit(~w)−∇P(~w)), (9)

where

is a version of the estimated

that is kept as a snapshot after every SGD steps, and is the full gradient at .

Our task here is to develop the Riemannian counterpart SVRRG of SVRG. Denote the SVRRG as . A naive adaptation of (9) to a Riemannian manifold reads

where and . However, this adaptation is not sound theoretically: the stochastic Riemannian gradient and the control variate reside in two different tangent spaces, and thus making their difference not well-defined. We rectify this problem by the parallel transport [1], which moves tangent vectors from one point to another (accordingly from one tangent space to another) along geodesics in parallel. More specifically, we parallel transport the control variate from to . For computational efficiency, the first-order approximation, called vector transport [1], is used.

Vector transport of a tangent vector from point to point , denoted as , is a mapping from tangent space to tangent space . When is an embedded Riemannian sub-manifold of a Euclidean space, vector transport can be simply defined as [1]:

 T~X→X(t)(ξ~X)=PX(t)(ξ~X),

where represents the orthogonal projector onto for the embedding Euclidean space. With the vector transport, we obtain the well-defined SVRRG in :

We then arrive at our SVRRG method:

 X(t+1)=RX(t)(αt+1~G(yt+1,X(t))), (10)

by setting in (2). Note that the SVRRG method (10) is naturally subsumed into the SRG method (5), and thus enjoys all the properties of SRG.

### 3.2 Svrrg-Eigs

With the SVRRG described above, we can now proceed to develop an effective eigensolver by specializing (6). This new eigensolver is named SVRRG-EIGS. The update can be written as

 X(t+1)=(X(t)+αt+1~G(yt+1,X(t)))(I+α2t+1~G⊤(yt+1,X(t))~G(yt+1,X(t)))−1/2, (11)

which can be decomposed into two substeps: and . Intuitively, the first substep moves along the direction from the current point to the intermediate point in the tangent space . The second substep then gets the intermediate point retracted back onto the Stiefel manifold to reach the next point .

Let’s delve into the first substep. Except for the vector transport inside , it looks much like an SVRG step since it works in the Euclidean tangent space. Assume that we have , is a random variable taking values in , , and stochastic gradient takes the form (i.e., sampling over data ). We can get the control variate as

By using the orthogonal projector in (7), the transported control variate can be written as

 T~X→X(t)(G(yt+1,~X)−Gradf(~X)) = (I−X(t)X(t)⊤)(I−~X~X⊤)(At+1−A)~X+X(t)skew(X(t)⊤(I−~X~X⊤)(At+1−A)~X) = (I−X(t)X(t)⊤)(At+1−A)~X−(I−X(t)X(t)⊤)~X~X⊤(At+1−A)~X+ X(t)skew(X(t)⊤(I−~X~X⊤)(At+1−A)~X).

Accordingly, we have the SVRRG expressed as

where is a stochastic zero-mean term conditioned on . Note that the factor in might be theoretically harsh333It works well empirically.

, because an eigenspace could have distinct representations which are the same up to a

orthogonal matrix, and thus it is only expected that and have the same column space at convergence. The ideal replacement would be where represents the column space. Numerically it could be achieved by replacing with where and is the SVD of [24].

The first substep can now be rewritten as

As a comparison, we can similarly decompose the update steps (4) and (5) of RG-EIGS and SRG-EIGS into two substeps and then have:

Compared to that in RG-EIGS, each step in both SRG-EIGS and SVRRG-EIGS amounts to taking one Riemannian gradient step in the tangent space, adding a stochastic zero-mean term in the tangent space, and then retracting back to the manifold. However, the stochastic zero-mean term in SRG-EIGS has a constant variance. Therefore it needs the learning rate to decay to zero to reduce the variance and to ensure the convergence, and consequently compromises on the convergence rate. In contrast, SVRRG-EIGS keeps boosting the variance reduction of the stochastic zero-mean term during iterations. The variance of is not constant but dominated by three quantities , and . These quantities repeatedly decay till vanishing in expectation, as and are expected to get closer and closer to each other gradually. This induces a decaying variance without the learning rate involved. Therefore, SVRRG-EIGS is able to use a fixed learning rate and achieve a much faster convergence rate.

## 4 Theoretical Analysis

We give the main theoretical results in this section. The proofs are provided in the supplementary material.

###### Theorem 4.1.

Consider a symmetric matrix which can be written as such that . The eigen-decomposition of is as defined in Section 2.1. And the eigen-gap . Then the top eigenvectors can be approximated to arbitrary accuracy and with any confidence level by running epochs of our SVRRG-EIGS algorithm, in the sense that the potential function

with probability at least

, provided that the following conditions about initial iterate , fixed learning rate and epoch length , are simultaneously satisfied:

 ~b0=k−∥V⊤~X(0)∥2F<12,α∈(0,min{c0τ,c18c2τφ2}), m≥3log(2/φ)c1ατ,c3kmα2+c5k√mα2log(2/φ)≤12−~b0,

where the constants are positive and defined as

 c0 = min{132√3kτ2,1c1τ2,−(118406+144k2)+√(118406+144k2)2+18τ(1+24k2)24τ(1+24k2)}, c1 = 2τ(18τ−2α(1+2α)(1+24k2)−1184003α),c2=96(k2(1+2α)+823), c3 = 4(1+2α)+192(k2(1+2α)+74009),c4=201−5c0τ+c0c3τ,c5=√2c4.

Note that we have no loss of generality from assuming that in the theorem. In fact, if with (which could be estimated by, e.g., Gershgorin circle theorem), we could replace with to get and arrive at the same eigen-space. Another way of addressing this generality is to adopt the idea of [24], that is, replacing the learning rate with and the eigen-gap with , with some of the constants in the theorem re-derived. The condition on the initial iterate, i.e., , is theoretically non-trivial. However, empirically this condition can be well satisfied by running other stochastic algorithms (e.g., SRG-EIGS or DSRG-EIGS) or a few steps of deterministic iterative algorithms (e.g., RG-EIGS), because they are good at finding sub-optimal solutions. In our experiments, we use SRG-EIGS for this purpose, which makes the theorem amount to a convergence analysis at a later stage of the hybrid algorithm (e.g., starting from instead of ). The convergence rate of our algorithm can be roughly identified by the iteration number which establishes an exponential global convergence rate. Compared to the sub-linear rate of DSRG-EIGS by [2], it achieves a significant improvement since the complexity of a single iteration in the two algorithms only differs by constants. In summary, initialized by a low-precision eigensolver, our SVRG-EIGS algorithm would obtain a high-precision solution in a limited number of epochs (data passes), which is theoretically guaranteed by Theorem 4.1.

We provide an elegant proof of Theorem 4.1 in Appendix, though it is a bit involved. For ease of exposition and understanding, we decompose this course into three steps in a way similar to [24], including the analysis on one iteration, one epoch and one run of the algorithm. Among them, the first step (i.e., one iteration analysis) lies at the core of the main proof, where the techniques we use are dramatically different from those in [3, 24] due to our new context of Rimannian manifolds, or more precisely, Stiefel manifolds. This inherently different context requires new techniques, which in turn yield an improved exponential global convergence and accordingly bring more improvements over the convergence of sub-linear rate [2].

## 5 Experiments

In this section, we empirically verify the exponential convergence rate of our SVRRG-EIGS algorithm and demonstrate its capability of finding solutions of high precision when combined with other algorithms of low precision. Specifically, we use SRG-EIGS to generate a low-precision solution for initializing SVRRG-EIGS, and do the comparison with both RG-EIGS and SRG-EIGS. Among various implementations of RG-EIGS with different choices of metric and retraction in (2), we choose the one with canonical metric and Cayley transformation based retraction [28] since its code is publically available. This version of RG-EIGS uses the non-monotone line search with the well-known Barzilai-Borwein step size, which significantly reduces the iteration number, and performs well in practice. Both RG-EIGS and SRG-EIGS are fed with the same random initial value of

, where each entry is sampled from the standard normal distribution

and then all entries as a whole are orthogonalized. SRG-EIGS uses the decaying learning rate where will be tuned.

We verify the properties of our algorithm on a real symmetric matrix, , of size, with nonzero entries. We partition into column blocks with block size equal to so that we can write with and each having only one column block of and all others zero. We set . For SVRRG-EIGS, we are able to use a fixed learning rate

( represents the matrix -norm), similar to that in [24]. We set and epoch length , i.e., each epoch takes passes over (including one pass for computing the full gradient). Accordingly, the epoch length of SRG-EIGS is set to . In addition, we set .

The performance of different algorithms is evaluated using three quality measures: feasibility , relative error function , and normalized potential function . The ground truths in these measures, including both and that is set to , are obtained using Matlab’s EIGS function for benchmarking. For each measure, lower values indicate higher quality.

Given a solution of low precision666This low precision could be problem dependent. at , our SVRRG-EIGS targets a double precision, that is, or . Each algorithm terminates when the precision requirement is met or the maximum number of epoches (set as ) is reached.

We report the convergence curves in terms of each measure, on which empirical convergence rates of the algorithms can be observed. Figure 1 reports the performance of different algorithms. In terms of feasibility, both SRG-EIGS and SVRRG-EIGS perform well, while RG-EIGS produces much poorer results. This is because the Cayley transformation based retraction used therein relies heavily on the Sherman-Morrison-Woodbury formula, which suffers from the numerical instability. From Figures 1(b) and 1(c), we observe similar convergence trends for each algorithm under the two different measures. All three algorithms improve their solutions with more iteration. There are several exceptions in RG-EIGS. This is due to the non-monotone step size used in its implementation. We also observe that SRG-EIGS presents an exponential convergence rate at an early stage thanks to a relatively large learning rate. However, it subsequently steps into a long period of sub-exponential convergence, which leads to small progress towards the optimal solution. In contrast, our SVRRG-EIGS inherits the initial momentum from SRG-EIGS and keeps the exponential convergence rate throughout the entire process. This enables it to approach the optimal solution at a fast speed. RG-EIGS has a different trend. It converges sub-exponentially at the beginning and performs the worst. Though it converges fast at a later stage, it still needs more passes over data than SVRRG-EIGS in order to achieve a high precision.

## 6 Related Work

Existing methods on eigensolvers include the power method [9], the (block) Lanczos algorithms [5], Randomized SVD [10], Riemannian methods [25, 1], and so on. All these methods performs the batch learning, while our focus in this paper is on stochastic algorithms. From this perspective, few existing works include online learning of eigenvectors [8] which aims at the leading eigenvector, i.e., , and doubly stochastic Riemannian method (DSRG-EIGS) [2] where the learning rate has to decay to zero. [8] provides the regret analysis without empirical verification for their method, while DSRG-EIGS belongs to one of implementations of SRG-EIGS in this paper where the double stochasticity comes from sampling over both data and coordinates of Riemmanian gradients. On the other hand, since the work of [13], variance reduction (SVRG) has become an appealing technique to stochastic optimization. There are quite some variants developed from different perspectives, such as practical SVRG [11], second-order SVRG [15], distributed or asynchronous SVRG [16, 12], and non-convex SVRG [24, 23]. Our SVRRG belongs to non-convex SVRG, but is addressed from the Riemannian optimization perspective. The core techniques we use are dramatically different from existing ones due to our new context.

## 7 Conclusion

In this paper, we proposed the generalization of SVRG to Riemannian manifolds, and established the general framework of SVRG in this setting, SVRRG, which requires the key ingredient, vector transport, to make itself well-defined. It is then deployed to the eigensolver problem and induces the SVRRG-EIGS algorithm. We analyzed its theoretical properties in detail. As suggested by our theoretical results, the proposed algorithm is guaranteed to find high-precision solutions at an exponential convergence rate. The theoretical implications are verified on a real dataset. For future work, we will explore the possibility of addressing the limitations of SVRRG-EIGS, e.g., dependence on eigen-gap and non-trivial initialization. We may also conduct more empirical investigations on the performance of SVRRG-EIGS.

## Appendix A Useful Lemmas

In this section, some definition, basics, and a group of useful lemmas are provided. All the matrices are assumed to be real.

### a.1 Definitions and Basics

#### a.1.1 Matrix facts: symmetry, positive semi-definiteness, trace, norm and orthogonality

The matrix () represents that is symmetric and positive semidefinite (definite), and if then as well. The trace of a square matrix, , is the sum of diagonal entries of . A useful fact about trace is the circular property, e.g., for matrices . and represents the Frobenious-norm and spectral norm (i.e., matrix -norm) of matrix , respectively. Here represents the maximum eigenvalue of an matrix,

represents the maximum singular value of an

matrix. Note that and have the same set of nonzero eigenvalues for two matrices and . Thus, . In this document, we always assume that the eigenvalues of an matrix takes the form . Thus and for any , where is called the spectral radius of a square matrix . And also , which in turn implies for any matrix . For any two matrices and that make well-defined, holds for both Frobenious-norm and spectral norm, and holds. Furthermore, the orthogonal invariance also holds for both Frobenious-norm and spectral norm, i.e., for column-orthonormal matrices and (i.e., and ). For , let represent its orthogonal complement, i.e., which implies and .

#### a.1.2 Martingale

The filtration, defined on a measurable probability space, is an increasing sequence of sub-sigma algebras for , meaning that for all . In our context, encodes the set of all the random variables seen thus far (i.e., from to ). In this document, conditioned on refers to conditioned on for brevity. Let and be a stochastic process and a filtration, respectively, on the same probability space. Then is called a martingale (super-martingale) with respect to if for each , is -measurable, , and (). Given a random variable and a constant , the probability (Markov inequality). Let be a martingale or supermartingale such that (i.e., bounded difference) where is a deterministic function of . Then for all and any , the probability (Azuma-Hoeffding inequality) [19].

### a.2 Lemmas

The proofs of Lemma A.1-A.5 can be found in [24].

###### Lemma A.1.

For any , it holds that

 tr(BC)≥tr(B(C−D))andtr(BC)≥tr((B−D)C).

If and , then .

###### Lemma A.3.

Let be square matrix, where are fixed and are stochastic zero-mean. Furthermore, suppose that for some fixed , it holds with probability that

• For all ,

Then

 E[tr((B1+Z1)(B2+Z2)−1)]≥tr(B1B−12)−β2(1+γ/δ)δ2.
###### Lemma A.4.

Let be a matrix with minimal singular value and . Then

 1−∥B⊤B∥2F∥B∥2F≥σ2k(k−∥B∥2F).
###### Lemma A.5.

For any matrices with orthonormal columns, let . Then

 B⋆=Q2Q⊤1,∥C−DB⋆∥2F≤∥C−DB∥2Fand∥C−DB⋆∥2F≤2(k−∥C⊤D∥2F),

where is the SVD of .

###### Lemma A.6.

Let and be as defined in Section 3.2 of the main paper. Assume and . Then for any matrix with orthonormal columns, it holds that

 ∣∣∥V⊤X(t+1)∥2F−∥V⊤X(t)∥2F∣∣≤20kα1−5α.
###### Proof.

Note that since and thus [1]. Based on the proof of Lemma 9 in [24], it suffices for us to show that and . In fact, from Section 3.2 of the main paper, we have

 N = ~G(yt+1,X(t)) = (I−X(t)X(t)⊤)At+1X(t)− (I−X(t)X