 # R-SPIDER: A Fast Riemannian Stochastic Optimization Algorithm with Curvature Independent Rate

We study smooth stochastic optimization problems on Riemannian manifolds. Via adapting the recently proposed SPIDER algorithm fang2018spider (a variance reduced stochastic method) to Riemannian manifold, we can achieve faster rate than known algorithms in both the finite sum and stochastic settings. Unlike previous works, by not resorting to bounding iterate distances, our analysis yields curvature independent convergence rates for both the nonconvex and strongly convex cases.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We analyze fast stochastic algorithms for the following optimization problem:

 minx∈M f(x) ≜ Eξ[f(x;ξ)], (1)

where is a Riemannian manifold equipped with the metric , and

is a random variable. We assume that for any

, the function is geodesically -smooth (see Section 2

). This class of functions includes as special cases important problems such as principal component analysis (PCA), independent component analysis (ICA), dictionary learning, mixture modeling, among others. Moreover, the finite-sum problem (

) is a special case in which finite number of component functions are chosen uniformly at random (e.g., in Empirical Risk Minimization).

When solving problems with parameters constrained to lie on a manifold, a naive approach is to alternate between optimizing the cost in a suitable ambient Euclidean space and “projecting” onto the manifold. For example, two well-known methods to compute the leading eigenvector of symmetric matrices, power iteration and Oja’s algorithm

(Oja, 1992), are in essence projected gradient and projected stochastic gradient algorithms. For certain manifolds (e.g., positive definite matrices), projections can be quite expensive to compute and possibly the Euclidean approach may have poor numerical conditioning (Yuan et al., 2017).

An effective alternative is to use Riemannian optimization, which directly operates on the manifold in question. This mode of operation allows Riemannian optimization to view the constrained optimization problem (1) as an unconstrained problem on a manifold, and thus, to be “projection-free.” More important is the conceptual viewpoint: by casting the problem in a Riemannian framework, one can discover insights into problem geometry that can translate into not only more precise mathematical analysis but also more efficient optimization algorithms.

The Euclidean version of (1) where and

is the Euclidean inner-product has been the subject of intense algorithmic development in machine learning and optimization, starting with the classical work of

Robbins and Monro (1951). However, both batch and stochastic gradient methods suffer from high computation load. For solving finite sum problems with components, the full-gradient method requires derivatives at each step; the stochastic method requires only one derivative, but at the expense of slower convergence to an

-accurate solution. These issues have motivated much of the progress on faster stochastic optimization in vector spaces by using variance reduction

(Schmidt et al., 2013; Johnson and Zhang, 2013; Defazio et al., 2014; Konečnỳ and Richtárik, 2013). Along with many recent works (see related work), these algorithms achieve faster convergence than the original gradient descent algorithms in multiple settings.

Riemannian counterparts of batch and stochastic optimization algorithms have witnessed growing interest recently. Zhang and Sra (2016) present the first global complexity analysis of batch and stochastic gradient methods for geodesically convex functions. Later work (Zhang et al., 2016; Kasai et al., 2016; Sato et al., 2017) improves the convergence rate for finite-sum problems by using variance reduction techniques. In this paper, we develop this line of work further, and improve rates based on a more careful control of variance analyzed in (Fang et al., 2018; Nguyen et al., 2017). Another important aspect of our work is that by pursuing a slightly different analysis, we are able to remove the assumption that all iterates remain in a compact subset of the Riemannian manifold. Such an assumption was crucial to most previous Riemannian methods, but was not always fully justified.

#### Contributions.

We summarize the key contributions of this paper below.

• [leftmargin=*]

• We introduce R-Spider, a Riemannian variance reduced stochastic gradient method based on the recent SPIDER algorithm (Fang et al., 2018). We analyze R-Spider

for optimizing geodesically smooth stochastic nonconvex functions. To our knowledge, we obtain the first rate faster than Riemannian stochastic gradient descent for general nonconvex stochastic Riemannian optimization.

• We specialize R-Spider to (Riemannian) nonconvex finite-sum problems. Our rate improves the best known rates and match the lower bound as in the Euclidean case.

• We propose two variations of R-Spider for geodesically strongly convex problems and for Riemannian gradient dominated costs. For these settings, we achieve the best known rates in terms of number of samples and the condition number .

• Importantly, we provide convergence guarantees that are independent of the Riemannian manifold’s diameter and its sectional curvature. This contribution is important in two main aspects. First, the best known theoretical upper bounds are improved. Second, the algorithm no longer assumes bounded diameter of the Riemannian manifold, which helps lift the assumption crucial for previous work that required all the iterates generated by the algorithm to remain in a compact set.

We briefly summarize the rates obtained in Table 1.

### 1.1 Related Work

#### Variance reduction in stochastic optimization.

Variance reduction techniques, such as control variates, are widely used in Monte Carlo simulations (Rubinstein and Kroese, 2011). In linear spaces, variance reduced methods for solving finite-sum problems have recently witnessed a huge surge of interest (e.g. Schmidt et al., 2013; Johnson and Zhang, 2013; Defazio et al., 2014; Bach and Moulines, 2013; Konečnỳ and Richtárik, 2013; Xiao and Zhang, 2014; Gong and Ye, 2014). They have been shown to accelerate finite sum optimization for strongly convex objectives and convex objectives. Later work by Lin et al. (2015); Allen-Zhu (2017a) further accelerates the rates in convex problems using techniques similar to Nesterov’s acceleration method (Nesterov, 2013). For nonconvex problems, Reddi et al. (2016); Allen-Zhu (2017b); Lei et al. (2017); Fang et al. (2018); Nguyen et al. (2017) also achieved faster rate than the vanilla (stochastic) gradient descent method in both finite sum settings and stochastic settings. Our analysis is inspired mainly by Fang et al. (2018); Zhang et al. (2016). The analysis can also be applied to (Wang et al., 2018) and achieve matching rate assuming access to proximal oracle on Riemannian manifold.

#### Riemannian optimization.

Earlier references can be found in Udriste (1994); Absil et al. (2009), where analysis is limited to asymptotic convergence (except (Udriste, 1994, Theorem 4.2)). Stochastic Riemannian optimization has been previously considered in Bonnabel (2013); Liu et al. (2004), though with only asymptotic convergence analysis, and without any rates. Many applications of Riemannian optimization are known, including matrix factorization on fixed-rank manifold (Vandereycken, 2013; Tan et al., 2014), dictionary learning (Cherian and Sra, 2015; Sun et al., 2015), optimization under orthogonality constraints (Edelman et al., 1998; Moakher, 2002)

, covariance estimation

(Wiesel, 2012), learning elliptical distributions (Zhang et al., 2013; Sra and Hosseini, 2013), Poincaré embeddings (Nickel and Kiela, 2017)(Hosseini and Sra, 2015). Zhang and Sra (2016) provide the first global complexity analysis for first-order Riemannian algorithms, but their analysis is restricted to geodesically convex problems with full or stochastic gradients. Boumal et al. (2016) analyzed iteration complexity of Riemannian trust-region methods, whereas Bento et al. (2017) studied non-asymptotic convergence of Riemannian gradient, subgradient and proximal point methods. Tripuraneni et al. (2018); Zhang and Sra (2018) analyzed aspects other than variance reduction to accelerate the convergence of first order optimization methods on Riemannian manifolds. Zhang et al. (2016); Sato et al. (2017) analyzed variance reduction techniques on Riemannian manifolds, and their rate has remain best-known up to our knowledge. In this work, we improve upon their results. Zhou et al. (2018) worked on the same problem in parallel and achieved the same rate. The difference between this work and Zhou et al. (2018) is that our algorithm uses a constant step size and adaptive sample size. This enables us to bound instead of . Hence our result is slightly stronger and further simplifies later proof for the convergence of gradient dominated functions.

## 2 Preliminaries

Before formally discussing Riemannian optimization, let us recall some foundational concepts of Riemannian geometry. For a thorough review one can refer to any classic text, e.g.,(Petersen, 2006).

A Riemannian manifold is a real smooth manifold equipped with a Riemannain metric . The metric induces an inner product structure in each tangent space associated with every . We denote the inner product of as ; and the norm of is defined as . The angle between is defined as . A geodesic is a constant speed curve that is locally distance minimizing. An exponential map maps in to on , such that there is a geodesic with and . If between any two points in there is a unique geodesic, the exponential map has an inverse and the geodesic is the unique shortest path with the geodesic distance between .

Parallel transport maps a vector to , while preserving norm, and roughly speaking, “direction,” analogous to translation in . A tangent vector of a geodesic remains tangent if parallel transported along . Parallel transport preserves inner products.

#### Function Classes.

We now define some key terms. A set is called geodesically convex if for any , there is a geodesic with and for . Throughout the paper, we assume that the function in (1) is defined on a Riemannian manifold .

In the following we do not explicitly write Riemannian metric or the index of tangent space to simplify notation, as they should be obvious from the context: inner product of is defined as ; norm of is defined as .

Based on the above notations, we define the following properties of the function in (1).

###### Definition 1 (Strong convexity).

A function is said to be geodesically -strongly convex if for any ,

 f(y)≥f(x)+⟨gx,Exp−1x(y)⟩x+μ2∥Exp−1x(y)∥2.
###### Definition 2 (Smoothness).

A differentiable function is said to be geodesically -smooth if its gradient is -Lipschitz, i.e. for any ,

 ∥gx−Γxygy∥≤L∥Exp−1x(y)∥,

where is the parallel transport from to .

Observe that compared to the Euclidean setup, the above definition requires a parallel transport operation to “transport” to . It can be proved that if is -smooth, then for any ,

 f(y)≤f(x)+⟨gx,Exp−1x(y)⟩x+L2∥Exp−1x(y)∥2. (2)
###### Definition 3 (PL inequality).

is -gradient dominated if is a global minimizer of and for every

 f(x)−f(x∗)≤τ∥∇f(x)∥2. (3)

As in the Euclidean case, gradient dominated is implied by strongly convex.

An Incremental First-order Oracle (IFO) (Agarwal and Bottou, 2015) in (1) takes in a point , and generates a random sample . The oracle then returns a pair . In finite-sum setting, takes values in and each random sample corresponds to one of component functions. We measure non-asymptotic complexity in terms of IFO calls.

## 3 Riemannian SPIDER

In this section we introduce the R-SPIDER algorithm. In particular, we propose one variant for nonconvex problems, and two for gradient-dominated problems. Each variation aims to optimize a particular dependency on function parameters. Our proposed algorithm differs from the Euclidean SPIDER (Fang et al., 2018) in two key aspects: the variance reduction step uses parallel transport to combine gradients from different tangent spaces; and the exponential map is used (instead of the update ).

We would like to point out that if retractions instead of exponential maps are used in the proposed algorithms, our analysis will still hold if such that , , where .

### 3.1 General smooth nonconvex functions

Our proposed algorithm for solving nonconvex Riemannian optimization problems is shown in Algorithm 1. denotes the unbiased gradient estimator obtained by averaging samples. We first analyze the global complexity for solving nonconvex stochastic Riemannian optimization problems. In particular, we make the following assumptions

###### Assumption 1 (Smoothness).

For any fixed is geodesically smooth in .

###### Assumption 2 (Bounded objective).

Function is bounded below. Define where

###### Assumption 3 (Bounded variance).

Under the above assumptions, we make the following choice of parameters for running Algorithm  1.

 S1=2σ2/ϵ2,ηk=12L,q=1/ϵ,T=4ML/ϵ2. (4)

We now state the following theorem for optimizing stochastic nonconvex functions.

###### Theorem 1 (Stochastic objective).

Under Assumptions 1, 2, 3 and the parameter choice in (4), Algorithm 1 terminates in iterations. The output satisfies

 E[∥∇f(x)∥2]≤10ϵ2.

Furthermore, the algorithm makes less than IFO calls in expectation.

###### Proof.

See Appendix A. The gist of our proof is to show that with sufficiently small variance of the gradient estimate, the algorithm either substantially decreases the objective function every iterations, or terminates because the gradient norm is small. ∎

Then we study the nonconvex problem under the finite-sum setting. In this setting, we assume . Hence we can write

 f(x)=1nn∑i=1fi(x). (5)

We further make the following choice of parameters:

 S1=n,ηk=12L,q=⌈n1/2⌉,T=4ML/ϵ2 (6)

Then we have the following gurrantee.

###### Theorem 2 (Finite-sum objective).

Under Assumptions 1, 2 and the parameter choice in (6), Algorithm 1 terminates in iterations. The output satisfies

 E[∥∇f(x)∥2]≤10ϵ2.

Furthermore, the algorithm makes less than IFO calls in expectation.

###### Proof.

See Appendix B. The proof is almost the same as the proof for Theorem 1. ∎

The proof of the two theorems in this section follows by carefully applying the variance reduction technique proposed in Fang et al. (2018) onto the Riemannian manifold using the tools in Zhang et al. (2016). Unlike the SVRG-like algorithms in Zhang et al. (2016); Kasai et al. (2016), we can avoid analyzing the term , where is the snapshot point. Consequently, we do not need to resort to the trigonometric distance bound (see Zhang and Sra (2016)) and the convergence rate doesn’t depend on the sectional curvature bound.

Further, the convergence rates in both cases match their Euclidean counterparts. Remarkably, the rate under the finite-sum setting meets the lower bound as proved by Fang et al. (2018).

In this section, we study the finite-sum problems with the following assumption.

###### Assumption 4.

We denote as the condition number. To solve such problems, we propose two algorithms. The first algorithm is shown in Algorithm 2. It follows the same idea as in Zhang et al. (2016); Reddi et al. (2016). We have the following theorem on its convergence rate.

###### Theorem 3.

Under Assumptions 1, 2, 4 and the parameter choice , after iterations, Algorithm 2 returns a solution that satisfies

 E[f(xK)−f(x∗)]≤2−KM0.

Further, we need number of IFO calls to achieve accuracy in expectation.

The proof of Theorem 3 follows from Theorem 2 and the gradient dominated property, as shown in Appendix C.

The second algorithm, shown in Algorithm 3, aims to achieve better complexity dependency on . With the following choice of parameters

 η=12L,q=⌈4Lτlog(4)⌉,M0≥M=f(x0)−f(x∗), (7)

we can make the following statement.

###### Theorem 4.

Under Assumptions 1, 2, 4 and the parameter choice in (7), after iterations, Algorithm 3 returns a solution that satisfies

 E[f(xT)−f(x∗)]≤2−KM0.

Further, the total expected number of IFO calls is . In other word, to achieve accuracy, we need number of IFO calls in expectation.

###### Proof.

See Appendix D. The algorithm adaptively choose sample sizes based on the distance of the last update. The expected number of samples queried can be bounded by total sum of squared distances, which can further be bounded by the change in the objective value. ∎

In summary, Algorithm 2 achieves IFO complexity , while Algorithm 3 achieves sample complexity . It is unclear to us whether there exists an algorithm that performs uniformly better than both of the proposed algorithms. Further, we wish to point out that, as strong convexity implies gradient dominance, the convergence rates for the above algorithms also apply to strongly g-convex functions.

## 4 Discussion

We introduce Riemannian SPIDER algorithms, a fast variance reduced stochastic gradient algorithm for Riemannian optimization. We analyzed the convergence rates of these algorithms for general smooth geodesically nonconvex functions under both finite-sum and stochastic settings, as well as for gradient dominated functions under the finite-sum setting. We showed that these algorithms improved the best known IFO complexity. We also removed the iteration complexity dependency on the curvature of the manifold.

There are a few open problems. First, the original SPIDER algorithm in Fang et al. (2018) and Algorithm  1 require very small stepsize. In practice, this usually results in very slow convergence rate. Even though the SPIDER-boost algorithm (Wang et al., 2018) and Algorithm 3 utilizes a constant large stepsize, the former one requires random termination of the algorithm, while the latter one requires very large sample size in each iteration. Therefore, none of these algorithms tend to perform well in practice if the implementation follows the theory exactly. Designing and testing practical algorithms with nice theoretical guarantees is left as future work.

Further, we approached the gradient-dominated functions with two different algorithms and got two different convergence rates. We suspect that it is not possible to achieve the best of both worlds at the same time. Proving such a lower bound or finding an algorithm that uniformly dominates both algorithms are both interesting research topics.

## Appendix A Proof of Theorem 1

The proof of Theorem 1, Theorem 2 and Theorem 4 all follows from three steps: bound the variance of the gradient estimator; prove that function value decrease in expectation per iteration; bound the number of iterations with Assumption 2.

First we bound the variance of the estimator at each step.

###### Lemma 1.

Under Assumptions1, 2, 3 and parameter setting in (4), , let . Then the iterates of Algorithm 1 satisfy

 E[∥vk−∇f(xk)∥2|Fk0]≤ϵ2
###### Proof.

Let be the sigma field generated by the random variable . Then forms a filtration. We can write the following equations

 E[∥vk−∇f(xk)∥2|Fk] (8) =E[∥Γxkxk−1[vk−1−∇f(xk−1)]∥2|Fk] +E[∥∇fS2(xk)−∇f(xk)+Γxkxk−1[∇f(xk−1)−∇fS2(xk−1)]∥2|Fk] +2E[⟨Γxkxk−1[vk−1−∇f(xk−1)],∇fS2(xk)−∇f(xk)+Γxkxk−1[∇f(xk−1)−∇fS2(xk−1)]⟩|Fk] (9) =E[∥vk−1−∇f(xk−1)∥2|Fk]+ E[∥∇fS2(xk)−∇f(xk)+Γxkxk−1[∇f(xk−1)−∇fS2(xk−1)]∥2|Fk]. (10)

The first equality follows by the identities

 vk=∇fS2(xk)−Γxkxk−1[∇fS2(xk−1)−vk−1], ∇f(xk)=∇f(xk−1)−∇f(xk−1)+∇f(xk).

The second equality follows by the fact that and

are unbiased estimators. Denote

where is a sampled function from the distribution of as defined in (1). Note that . Hence we have

 E[∥∇fS2(xk)−∇f(xk)+Γxkxk−1[∇f(xk−1)−∇fS2(xk−1)]∥2|Fk]=E[∥1S2S2∑i=1zi∥2|Fk] =1S2E[∥zi∥2|Fk] =1S2E[∥∇f(xk;ξ)−∇f(xk)−Γxkxk−1(∇f(xk−1;ξ)−∇f(xk−1)))∥2|Fk].

Substitue in (8) and we get that

 E[∥vk−∇f(xk)∥2|Fk]≤ E[∥vk−1−∇f(xk−1)∥2|Fk] + 1S2E[∥∇f(xk;ξ)−Γxkxk−1[∇f(xk−1;ξ)]∥2|Fk] ≤ E[∥vk−1−∇f(xk−1)∥2|Fk]+L2E[∥Exp−1xk−1(xk)∥2|Fk]/S2 ≤ E[∥vk−1−∇f(xk−1)∥2|Fk]+L2η2k−1∥vk−1∥2/S2, ≤ E[∥vk−1−∇f(xk−1)∥2|Fk]+ϵ22q

where the first inequality follows by . The last inequality follows by the value of . Apply the bound recursively and denote , we get

 E[∥vk−∇f(xk)∥2|Fk0]≤E[∥vk0−∇f(xk0)∥2|Fk0]+qϵ22q≤ϵ2.

Second, we show that the function value decreases.

###### Lemma 2.

Under Assumptions1, 2, 3 and parameter setting in (4), , let . Then

 E[f(xk+1)−f(xk)|Fk0]≤E[−∥vk∥28L+14L∥∇f(xk)−vk∥2|Fk0]
###### Proof.

By geodesically smoothness and (2), we have

 f(xk+1)−f(xk)≤⟨∇f(xk),Exp−1x(xk+1)⟩+L2∥Exp−1xk(xk+1)∥2 ≤−ηk⟨∇f(xk),vk⟩+L2η2k∥vk∥2 =(−ηk+Lη2k2)∥vk∥2−ηk⟨∇f(xk)−vk,vk⟩ ≤(−ηk/2+Lη2k2)∥vk∥2+ηk2∥∇f(xk)−vk∥2.

The second inequality follows from the update rule of in Algorithm 1. The last inequality follows from . After taking expectation, we get

The inequality follows by .

Finally we are ready to prove the theorem.

###### Proof of Theorem 1.

First, we rearrange and do a telescopic sum of the inequality in Lemma 4.

 T−1∑k=0E[∥vk∥28L]≤E[f(x0)−f(xT)+T−1∑k=014L∥∇f(xk)−vk∥2] (11)

Notice that

 1TT−1∑k=0E[∥∇f(xk)∥2]≤1TT−1∑k=0E[2∥vk∥2+2∥vk−∇f(xk)∥2]≤16LM/T+6ϵ2. (12)

Substitute in and we proved the theorem. The expected number of IFO calls can be computed as

 TS1/q+E[T−1∑k=0S2,k]≤8MLσ2/ϵ3+E[T−1∑k=0qL2∥Exp−1xk−1(xk)∥22ϵ2] (13) ≤8MLσ2/ϵ3+q8ϵ2(8ML+2Tϵ2)≤8ML(σ2+3)/ϵ3 (14)

## Appendix B Proof of Theorem 2

The proof techniques are exactly the same as those in Section A, with the only changes being that and we use (6) rather than (4) as the parameters. We state two corresponding lemmas and leave out the details to avoid repetition.

###### Lemma 3.

Under Assumptions 1, 2 and the parameter choice in (6), , let . Then the iterates of Algorithm 1 satisfy

 E[∥vk−∇f(xk)∥2|Fk0]≤ϵ2
###### Lemma 4.

Under Assumptions 1, 2 and the parameter choice in (6), , let . Then

 E[f(xk+1)−f(xk)|Fk0]≤E[−∥vk∥28L+14L∥∇f(xk)−vk∥2|Fk0]
###### Proof of Theorem 2.

The proof follows exactly the same arguments as the proof of Theorem 1. ∎

## Appendix C Proof of Theorem 3

###### Proof.

By Theorem 2, we know that when the R-SPIDER-nonconvex algorithm terminates in iteration , it returns satisfying

 E[∥∇f(xt)∥2]≤M02tτ.

By Assumption 4, we know that

 E[f(xt)−f(x∗)]≤E[∥∇f(xt)∥2]τ≤M02t, (15)

By theorem 2, in iteration the R-SPIDER-nonconvex algorithm makes less than

 n+8L(1+√n)(f(xt)−f∗)ϵ2t=n+2t80τL(1+√n)(f(xt)−f∗)M0

IFO calls in expectation. Take expectation and substitute in the bound in (15), we get that the expected number of IFO calls is less than . ∎

## Appendix D Proof of Theorem 4

###### Lemma 5.

Let . Under Assumptions 1, 2, 4 and the parameter choice in (7), let be intermediate values of Algorithm3. ,

###### Proof.

Based on the implementation of Algorithm 3, for , notice that . We denote for simplicity. Following the proof procedure in Lemma 1,

 E[∥vk−∇f(xk)∥2|Fk]≤E[∥vk−1−∇f(xk−1)∥2|Fk]+L2∥Exp−1xk−1(xk)∥2/Sk (16) ≤E[∥vk−1−∇f(xk−1)∥2|Fk]+δ/q. (17)

Apply this recursively and denote , we get

 E[∥vk−∇f(xk)∥2|Fk0]≤E[∥vk0−∇f(xk0)∥2|Fk0]+qδ/q≤δ. (18)

Then we prove that function value decreases in each epoch (

iterations).

###### Lemma 6.

Let , . Under Assumptions 1, 2, 4 and the parameter choice in (7), if we run Algorithm 3 for iteration, we have

 E[f(xk0+q)−f(x∗)|Fk0]≤max{(f(xk0)−f(x∗))/2,δk0τ}. (19)
###### Proof.

By geodesically smooth, we have

 f(xk+1)−f(xk)≤⟨∇f(xk),Exp−1xk(xk+1)⟩+L2∥Exp−1xk(xk+1)∥2 (20) =−η⟨∇f(xk),vk⟩+Lη22∥vk∥2 (21) =−12L⟨∇f(xk),vk⟩+18L∥vk∥2. (22)

After taking expectation, we get

 (23)

We used the fact that

 E[∥vk∥2|Fk0]=E[∥vk−∇f(xk)∥2|Fk0]+E[∥∇f(xk)∥2|Fk0]+E[2⟨vk−∇f(xk),∇f(xk)⟩|Fk0].

Let . By Assumption 4 and Lemma 5, we get

 E[Δk+1|Fk0]≤E[(1−14Lτ)Δk+14Lδk0|Fk0]. (24)

By choice of parameter defined in 7, we get for

 E[Δk+1|Fk]≤E[ρΔk+14Lδk0|Fk0]. (25)

Since , we know that after iterations, we have

 E[Δk0+q|Fk0]≤Δk0/4+δk0τlog(4)≤max{Δk0/2,2δk