# Averaging Stochastic Gradient Descent on Riemannian Manifolds

We consider the minimization of a function defined on a Riemannian manifold M accessible only through unbiased estimates of its gradients. We develop a geometric framework to transform a sequence of slowly converging iterates generated from stochastic gradient descent (SGD) on M to an averaged iterate sequence with a robust and fast O(1/n) convergence rate. We then present an application of our framework to geodesically-strongly-convex (and possibly Euclidean non-convex) problems. Finally, we demonstrate how these ideas apply to the case of streaming k-PCA, where we show how to accelerate the slow rate of the randomized power method (without requiring knowledge of the eigengap) into a robust algorithm achieving the optimal rate of convergence.

• 15 publications
• 34 publications
• 146 publications
• 256 publications
05/09/2013

11/22/2011

### Stochastic gradient descent on Riemannian manifolds

Stochastic gradient descent is a simple approach to find the local minim...
05/26/2016

### Stochastic Variance Reduced Riemannian Eigensolver

We study the stochastic Riemannian gradient algorithm for matrix eigen-d...
12/29/2021

### Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems

Motivated by the problem of online canonical correlation analysis, we pr...
09/30/2015

### Convergence of Stochastic Gradient Descent for PCA

We consider the problem of principal component analysis (PCA) in a strea...
06/16/2021

### Averaging on the Bures-Wasserstein manifold: dimension-free convergence of gradient descent

We study first-order optimization algorithms for computing the barycente...
01/17/2022

### Generalization in Supervised Learning Through Riemannian Contraction

We prove that Riemannian contraction in a supervised learning setting im...

## 1 Introduction

We consider stochastic optimization of a (potentially non-convex) function defined on a Riemannian manifold

, and accessible only through unbiased estimates of its gradients. The framework is broad—encompassing fundamental problems such as principal components analysis (PCA)

(Edelman et al., 1998), dictionary learning (Sun et al., 2017), low-rank matrix completion (Boumal and Absil, 2011)

and tensor factorization

(Ishteva et al., 2011).

The classical setting of stochastic approximation in , first appearing in the work of Robbins and Monro (1951)

, has been thoroughly explored in both the optimization and machine learning communities. A key step in the development of this theory was the discovery of Polyak-Ruppert averaging—a technique in which the iterates are averaged along the optimization path. Such averaging provably reduces the impact of noise on the problem solution and improves convergence rates in certain important settings

(Polyak, 1990; Ruppert, 1988).

By contrast, the general setting of stochastic approximation on Riemannian manifolds has been far less studied. There are many open questions regarding achievable rates and the possibility of accelerating these rates with techniques such as averaging. The problems are twofold: it is not always clear how to extend a gradient-based algorithm to a setting which is missing the global vector-space structure of Euclidean space, and, equally importantly, classical analysis techniques often rely on the Euclidean structure and do not always carry over. In particular, Polyak-Ruppert averaging relies critically on the Euclidean structure of the configuration space, both in its design and its analysis. We therefore ask:

Can the classical technique of Polyak-Ruppert iterate averaging be adapted to the Riemannian setting? Moreover, do the advantages of iterate averaging, in terms of rate and robustness, carry over from the Euclidean setting to the Riemannian setting?

Indeed, in the traditional setting of Euclidean stochastic optimization, averaged optimization algorithms not only improve convergence rates, but also have the advantage of adapting to the hardness of the problem (Bach and Moulines, 2011)

. They provide a single, robust algorithm that achieves optimal rates with and without strong convexity, and they also achieve the statistically optimal asymptotic variance. In the presence of strong convexity, setting the step size proportional to

is sufficient to achieve the optimal rate. However, as highlighted by Nemirovski et al. (2008) and Bach and Moulines (2011), the convergence of such a scheme is highly sensitive to the choice of constant prefactor in the step size; an improper choice of can lead to an arbitrarily slow convergence rate. Moreover, since is often never known, properly calibrating

is often impossible (unless an explicit regularizer is added to the cost function, which adds an extra hyperparameter).

In this paper, we provide a practical iterate-averaging scheme that enhances the robustness and speed of stochastic, gradient-based optimization algorithms, applicable to a wide range of Riemannian optimization problems—including those that are (Euclidean) non-convex. Principally, our framework extends the classical Polyak-Ruppert iterate-averaging scheme (and its inherent benefits) to the Riemannian setting. Moreover, our results hold in the general stochastic approximation setting and do not rely on any finite-sum structure of the objective.

Our main contributions are:

• The development of a geometric framework to transform a sequence of slowly converging iterates on , produced from SGD, to an iterate-averaged sequence with a robust, fast rate.

• A general formulation of geometric iterate averaging for a class of locally smooth and geodesically-strongly-convex optimization problems.

• An application of our framework to the (non-convex) problem of streaming PCA, where we show how to transform the slow rate of the randomized power method (with no knowledge of the unknown eigengap) into an algorithm that achieves the optimal rate of convergence and which empirically outperforms existing algorithms.

### 1.1 Related Work

Stochastic Optimization: The literature on (Euclidean) stochastic optimization is vast, having been studied through the lens of machine learning (Bottou, 1998; Shalev-Shwartz et al., 2009), optimization (Nesterov and Vial, 2008), and stochastic approximation (Kushner and Yin, 2003). Polyak-Ruppert averaging first appeared in the works of Polyak (1990) and Ruppert (1988); Polyak and Juditsky (1992) then provided asymptotic normality results for the distribution of the averaged iterate sequence. Bach and Moulines (2011) later generalized these results, providing non-asymptotic guarantees for the rate of convergence of the averaged iterates. An important contribution of Bach and Moulines (2011) was to present a unified analysis showing that iterate averaging coupled with sufficiently slow learning rates could achieve the optimal convergence in all settings (i.e., with and without strong convexity).
Riemannian Optimization: Riemannian optimization has not been explored in the machine learning community until relatively recently. Udriste (1994) and Absil et al. (2009) provide comprehensive background on the topic. Most existing work has primarily focused on providing asymptotic convergence guarantees for non-stochastic algorithms (see, e.g., Absil et al., 2007; Ring and Wirth, 2012, who analyze the convergence of Riemannian trust-region and Riemannian L-BFGS methods, respectively).

Bonnabel (2013) provided the first asymptotic convergence proof of stochastic gradient descent (SGD) on Riemannian manifolds while highlighting diverse applications of the Riemannian framework to problems such as PCA. The first global complexity results for first-order Riemannian optimization, utilizing the notion of functional g-convexity, were obtained in the foundational work of Zhang and Sra (2016). The finite-sum, stochastic setting has been further investigated by Zhang et al. (2016) and Sato et al. (2017), who developed Riemannian SVRG methods. However, the potential utility of Polyak-Ruppert averaging in the Riemannian setting has been unexplored.

## 2 Results

We consider the optimization of a function over a compact, connected subset ,

 minx∈X⊂Mf(x),

with access to a (noisy) first-order oracle . Given a sequence of iterates in  produced from the first-order optimization of ,

 xn=Rxn−1(−γn∇fn(xn−1)), (1)

that are converging to a strict local minimum of , denoted by , we consider (and analyze the convergence of) a streaming average of iterates:

 ~xn=R~xn−1(1nR−1~xn−1(xn)). (2)

Here we use to denote a retraction mapping (defined formally in Section 3), which provides a natural means of moving along a vector (such as the gradient) while restricting movement to the manifold. As an example, when we can take as vector addition by . In this setting, Eq. (1) reduces to the standard gradient update and Eq. (2) reduces to the ordinary average . In the update in Eq. (1), we will always consider step-size sequences of the form for and , which satisfy the usual stochastic approximation step-size rules and (see, e.g., Benveniste et al., 1990).

Intuitively, our main result states that if the iterates converge to at a slow rate, their streaming Riemannian average will converge to at the the optimal rate. This result requires several technical assumptions, which are standard generalizations of those appearing in the Riemannian optimization and stochastic approximation literatures (detailed in Section 4). The critical assumption we make is that all iterates remain bounded in —where the manifold behaves well and the algorithm is well-defined (Assumption 2). The notion of slow convergence to an optimum is formalized in the following assumption:

###### Assumption 1.

If for a sequence of iterates evolving in Eq. (1), then

 E[∥Δn∥2]=O(γn).

Assumption 1 can be verified in a variety of optimization problems, and we provide such examples in Section 6. As is unknown, is not computable but is primarily a tool for our analysis. Importantly, is a tangent vector in . Note also that the norm is locally equivalent to the geodesic distance on (see Section 3).

We use to denote the covariance of the noisy gradients at the optima . Formally, our main convergence result regarding Polyak-Ruppert averaging in the manifold setting is as follows (where Assumptions 2 through 6 will be presented later):

###### Theorem 1.

Let Assumptions 1, 2, 3, 4, 5, and 6 hold for the iterates evolving according to Eq. (1) and Eq. (2). Then satisfies:

 √n~ΔnD→N(0,∇2f(x⋆)−1Σ∇2f(x⋆)−1).

If we additionally assume a bound on the fourth moment of the iterates—of the form

—then a non-asymptotic result holds:

 E[∥~Δn∥2]≤1n\tr[∇2f(x⋆)−1Σ∇2f(x⋆)−1]+O(n−2α)+O(nα−2).

We make several remarks regarding this theorem:

• The asymptotic result in Theorem 1 is a generalization of the classical asymptotic result of Polyak and Juditsky (1992). In particular, the leading term has variance independently of the step-size choice . In the presence of strong convexity, SGD can achieve the rate with a carefully chosen step size, (for ). However, the result is fragile: too small a value of can lead to an arbitrarily slow convergence rate, while too large a can lead to an “exploding,” non-convergent sequence (Nemirovski et al., 2008). In practice determining is often as difficult as the problem itself.

• Theorem 1 implies that the distance (measured in ) of the streaming average to the optimum, asymptotically saturates the Cramer-Rao bound on the manifold (Smith, 2005; Boumal, 2013)—asymptotically achieving the statistically optimal covariance111Note the estimator is only asymptotically unbiased, and hence the Cramer-Rao bound is only meaningful in the asymptotic limit. However, this result can also be understood as saturating the Hàjek-Le Cam local asymptotic minimax lower bound (Van der Vaart, 1998, Ch. 8).. SGD, even with the carefully calibrated step-size choice of , does not achieve this optimal asymptotic variance (Nevelson and Hasminski, 1973).

We exhibit two applications of this general result in Section 6. Next, we introduce the relevant background and assumptions that are necessary to prove our theorem.

## 3 Preliminaries

We recall some important concepts from Riemannian geometry. Do Carmo (2016) provides more a thorough review, with Absil et al. (2009) providing a perspective particularly relevant for Riemannian optimization.

As a base space we consider a Riemannian manifold —a smooth manifold equipped with a Riemannian metric containing a compact, connected subset . At all , the metric induces a natural inner product on the tangent space , denoted by —this inner product induces a norm on each tangent space denoted by . The metric also provides a means of measuring the length of a parametrized curve from to the manifold; a geodesic is a constant speed curve that is locally distance-minimizing with respect to the distance  induced by .

When considering functions we will use to denote the Riemannian gradient of at , and , the Riemannian Hessian of at . When considering functions between manifolds , we will use to denote the differential of the mapping at (its linearization) (see Absil et al., 2009, for more formal definitions of these objects).

The exponential map maps to such that there is a geodesic with , , and ; although it may not be defined on the whole tangent space. If there is a unique geodesic connecting , the exponential map will have a well-defined inverse , such that the length of the connecting geodesic is . We also use and to denote a retraction mapping and its inverse (when well defined), which is an approximation to the exponential map (and its inverse). is often computationally cheaper to compute then the entire exponential map . Formally, the map is defined as a first-order retraction if and —so locally must move in the “direction” of . The map is a second-order retraction if also satisfies for all , where denotes the acceleration vector field (Absil et al., 2009, Sec. 5.4). This condition ensures satisfies a “zero-acceleration” initial condition. Note that locally, for close to , the retraction satisfies .

If we consider the example where the manifold is a sphere (i.e., with the round metric ), the exponential map along a vector will generate a curve that is a great circle on the underlying sphere. A nontrivial example of a retraction on the sphere can be defined by first moving along the tangent vector in the embedded Euclidean space , and then projecting this point to the closest point on the sphere.

We further define the parallel translation as the map transporting a vector to , along a path connecting to , such that the vector stays “constant” by satisfying a zero-acceleration condition. This is illustrated in Figure 1. The map is an isometry. We also consider, a different vector transport map which is the differential of the retraction (Absil et al., 2009, Sec. 8.1).

Following Huang et al. (2015), we will call a function on retraction convex on (with respect to ) if for all and all satisfying , is convex for all such that ; similarly is retraction strongly convex on if is strongly convex under the same conditions. If is the exponential map, this reduces to the definition of geodesic convexity (see the work of Zhang and Sra, 2016, for further details).

## 4 Assumptions

We introduce several assumptions on the manifold , function , and the noise process that will be relevant throughout the paper.

### 4.1 Assumptions on M

First, we assume the iterates of the algorithm in Eq. (1) and Eq. (2) remain in where the manifold “behaves well.” Formally,

###### Assumption 2.

For a sequence of iterates defined in Eq. (1), there exists a compact, connected subset such that for all , and . Furthermore, is totally retractive (with respect to the retraction ) and the function is retraction strongly convex on for all . Also, is a second-order retraction at .

Assumption 2 is restrictive, but standard in prior work on stochastic approximation on manifolds (e.g., Bonnabel, 2013; Zhang et al., 2016; Sato et al., 2017). As further detailed by Huang et al. (2015), a totally retractive neighborhood is such that for all there exists such that where is a diffeomorphism on . A totally retractive neighborhood is analogous to the concept of a totally normal neighborhood (see, e.g., Do Carmo, 2016, Chap. 3, Sec. 3). Principally, Assumption 2 ensures that the retraction map (and its respective inverse) are well-defined when applied to the iterates of our algorithm.

If is a Hadamard manifold, the exponential map (and its inverse) is defined everywhere on , although this globally may not be true for a retraction . Similarly, if is a compact manifold the first statement of Assumption 2 is always satisfied. Moreover, in the case of the exponential map, is strongly convex in a ball around whose radius depends on the curvature, as explained by Afsari (2011) and Sakai (1996, Ch. IV, Sec. 2 Lemma 2.9). For our present purpose, we also assume the retraction agrees with the Riemannian exponential map up to second order near . This assumption, that is a second-order retraction, is fairly general and is satisfied by projection-like retraction maps on matrix manifolds (see, e.g., Absil and Malick, 2012).

### 4.2 Assumptions on f

We now introduce some regularity assumptions on the function ensuring sufficient differentiability and strong convexity at :

###### Assumption 3.

The function is twice-continuously differentiable on . Further the Hessian of the function at , , satisfies, for all and ,

 ⟨v,∇2f(x⋆)v⟩≥μ∥v∥2>0.

Continuity of the Hessian also ensures local retraction strong convexity in a neighborhood of (Absil et al., 2009, Prop. 5.5.6). Moreover, since the function is twice-continuously differentiable on its Hessian is Lipschitz on this compact set. We formalize this as follows:

###### Assumption 4.

There exists such that the Hessian of the function , , is -Lipschitz at . That is, for all and ,

 ∥Γx⋆y∘∇2f(y)∘Γyx⋆−∇2f(x⋆)∥op≤M∥R−1x⋆(y)∥.

Note that is not necessarily symmetric under the exchange of and . This term could also be replaced with , since these expressions will be locally equivalent in a neighborhood of , but would come at the cost of a less transparent analysis.

### 4.3 Assumptions on the noise

We state several assumptions on the noise process that will be relevant throughout our discussion. Let be an increasing sequence of sigma-fields. We will assume access to a sequence of noisy estimates of the true gradient of the function ,

###### Assumption 5.

The sequence of (random) vector fields is -measurable, square-integrable and unbiased:

 ∀x∈X, ∀n≥1, E[∇fn(x)|Fn−1]=∇f(x).

This general framework subsumes two situations of interest.

• Statistical Learning (on Manifolds): minimizing a loss function

over , given a sequence of i.i.d. observations in , with access only to noisy, unbiased estimates of the gradient (Aswani et al., 2011).

• Stochastic Approximation (on Manifolds): minimizing a function over , with access only to the (random) vector field at each iteration. Here the gradient vector field is perturbed by a square-integrable martingale-difference sequence (for all , ) (Bonnabel, 2013).

Lastly, we will assume the vector fields are individually Lipschitz and have bounded covariance at the optimum :

###### Assumption 6.

There exists such that for all and , the vector field satisfies

 E[∥Γx⋆x∇fn(x)−∇fn(x⋆)∥2|Fn−1]≤L2 ∥R−1x⋆(x)∥2,

there exists such that for all , and a symmetric positive-definite matrix such that,

 E[∇fn(x⋆)⊗∇fn(x⋆)|Fn−1]=Σ a.s.

These are natural generalizations of standard assumptions in the optimization literature (Fabian, 1968) to the setting of Riemannian manifolds222Assuming bounded gradients does not contradict Assumption 3, since we are constrained to the compact set .. Note that the assumption,
could be slightly relaxed (as detailed in Appendix C.2), but allows us to state our main result more cleanly.

## 5 Proof Sketch

We provide an overview of the arguments that comprise the proof of Theorem 1 (full details are deferred to Appendix C). We highlight three key steps. First, since we assume the iterates produced from SGD converge to within of , we can perform a Taylor expansion of the recursion in Eq. (1), to relate the points on the manifold to vectors in the tangent space . This generates a (perturbed) linear recursion governing the evolution of the vectors . Recall that as is unknown, is not accessible, but is primarily a tool for our analysis. Second, we can show a fast convergence rate for the averaged vectors , using techniques from the Euclidean setting. Finally, we once again use a local expansion of Eq. (2) to connect the averaged tangent vectors to the streaming, Riemannian average —transferring the fast rate for the inaccessible vector to the computable point . Throughout our analysis we extensively use Assumption 2, which restricts the iterates to the subset .

### 5.1 From M to Tx⋆M

We begin by linearizing the progress of the SGD iterates in the tangent space of by considering the evolution of .

• First, as the are all defined in the vector space , Taylor’s theorem applied to along with Eq. (1) allows us to conclude that

 Δn+1=Δn−γn+1[Λxnx⋆]−1(∇fn+1(xn))+O(γ2n+1).

See Lemma 4 for more details.

• Second, we use the manifold version of Taylor’s theorem and appropriate Lipschitz conditions on the gradient to further expand the gradient term as

 Γx⋆xn∇fn+1(xn)=∇2f(x⋆)Δn+∇fn+1(x⋆)+ξn+1+O(∥Δn∥2),

where the noise term is controlled as , and . See Lemma 5 for more details.

• Finally, we argue that the operator is a local isometry up to second-order terms,

 [Λxnx⋆]−1Γxnx⋆=I+O(\normΔn2),

which crucially rests on the fact is a second-order retraction. See Lemma 6 for more details.

• Assembling the aforementioned lemmas allows us to derive a (perturbed) linear recursion, governing the tangent vectors as

 Δn+1=Δn−γn+1∇2f(x⋆)Δn−γn+1∇fn+1(x⋆)−γn+1ξn+1+O(\normΔn2γn+γ2n). (3)

See Theorem 7 for more details.

### 5.2 Averaging in Tx⋆M

Our next step is to prove both asymptotic and non-asymptotic convergence rates for a general, perturbed linear recursion (resembling Eq. (3)) of the form,

 Δn=Δn−1−γn∇2f(x⋆)Δn−1+γn(εn+ξn+en), (4)

under appropriate assumptions on the error and noise , sequences detailed in Appendix C.2. Under these assumptions we can derive an asymptotic rate for the average, , under a first-moment condition on :

 √n¯ΔnD→N(0,∇2f(x⋆)−1Σ∇2f(x⋆)−1),

and, under a slightly stronger second-moment condition on we have:

 E[∥¯Δn∥2]≤1n\tr[∇2f(x⋆)−1Σ∇2f(x⋆)−1]+O(n−2α)+O(nα−2),

where denotes the asymptotic covariance of the noise . The proof techniques are similar to those of Polyak and Juditsky (1992) and Bach and Moulines (2011) so we do not detail them here. See Theorems 8 and 9 for more details. Note that is not computable, but does have an interesting interpretation as an upper bound on the Riemannian center-of-mass, , of a set of iterates in (see Section C.2.3 and Afsari, 2011, for more details).

### 5.3 From Tx⋆M back to M

Using the previous arguments, we can conclude that the averaged vector obeys both asymptotic and non-asymptotic Polyak-Ruppert-type results. However, is not computable. Rather, corresponds to the computable, Riemannian streaming average defined in Eq. (2). In order to conclude our result, we argue that and are close up to terms. The argument proceeds in two steps:

• Using the fact that is retraction convex we can conclude that implies that as well. See Lemma 11 for more details.

• Then, we can locally expand Eq. (2) to find that,

 ~Δn+1=~Δn+1n+1(Δn+1−~Δn)+~en,

where . Rearranging and summing this recursion shows that for , showing these terms are close. See Lemma 12 for details.

## 6 Applications

We now introduce two applications of our Riemannian iterate-averaging framework.

### 6.1 Application to Geodesically-Strongly-Convex Functions

In this section, we assume that is globally geodesically convex over and take , which allows the derivation of global convergence rates. This function class encapsulates interesting problems such as the matrix Karcher mean problem (Bini and Iannazzo, 2013) which is non-convex in Euclidean space but geodesically strongly convex with an appropriate choice of metric on .

Zhang and Sra (2016) show for geodesically-convex , that averaged SGD with step size achieves the slow convergence rate. If in addition, is geodesically strongly convex on , they obtain the fast rate. However, their result is not algorithmically robust, requiring a delicate specification of the step size , which is often practically impossible due to a lack of knowledge of . Assuming smoothness of , our iterate-averaging framework provides a means of obtaining a robust and global convergence rate. First, we make the following assumption:

###### Assumption 7.

The function is -geodesically-strongly-convex on , for , and the set is geodesically convex.

Then using our main result in Theorem 1, with , we have:

###### Proposition 2.

Let Assumptions 2, 4, 5, 6, and 7 hold for the iterates evolving in Eq. (1) and Eq. (2) and take the retraction to be the exponential map . Then,

 E[∥~Δn∥2]≤1n\tr[∇2f(x⋆)−1Σ∇2f(x⋆)−1]+O(n−2α)+O(nα−2).

We make several remarks.

• In order to show the result, we first derive a slow rate of convergence for SGD, by arguing that and where and is a geometry-dependent constant (see Proposition 15 for more details). The result follows by combining these results and Theorem 1.

• As in Theorem 1 we also obtain convergence in law and the statistically optimal covariance.

• Importantly, taking the step size to be provides a single, robust algorithm achieving both the slow rate in the absence of strong convexity (by Zhang and Sra (2016)) and the fast rate in the presence of strong convexity. Thus (Riemannian) averaged SGD automatically adapts to the strong-convexity in the problem without any prior knowledge of its existence (i.e., the value of ).

### 6.2 Streaming Principal Component Analysis (PCA)

The framework of geometric optimization is far-reaching, containing even (Euclidean) non-convex problems such as PCA. Recall the classical formulation of streaming -PCA: we are given a stream of i.i.d. symmetric positive-definite random matrices such that

, with eigenvalues

sorted in decreasing order, and hope to approximate the subspace of the top eigenvectors, . Sharp convergence rates for streaming PCA (with ) were first obtained by Jain et al. (2016) and Shamir (2016a) using the randomized power method. Shamir (2016b) and Allen-Zhu and Li (2017) later extended this work to the more general streaming -PCA setting. These results are powerful—particularly because they provide global convergence guarantees.

For streaming -PCA, a similar dichotomy to the convex setting exists: in the absence of an eigengap () one can only attain the slow rate, while the fast rate is achievable when the eigengap is positive (). However, as before, a practically burdensome requirement of these fast , global-convergence guarantees is that the step sizes of their corresponding algorithms depend explicitly on the unknown eigengap333In this example, the eigengap is analogous to the strong-convexity parameter . of the matrix .

By viewing the -PCA problem as minimizing the Rayleigh quotient, , over the Grassmann manifold, we show how to apply the Riemannian iterate-averaging framework developed here to derive a fast, robust algorithm,

 Xn=RXn−1(γnHnXn−1) % and ~Xn=R~Xn−1(1nXnX⊤n~Xn−1), (5)

for streaming -PCA. Recall that the Grassmann manifold is the set of the -dimensional subspaces of a -dimensional Euclidean space which we equip with the projection-like, second-order retraction map . Observe that the randomized power method update (Oja and Karhunen, 1985), , in Eq. (5), is almost identical to the Riemannian SGD update, , in Eq. (1). The principal difference between both is that in the randomized power method, the Euclidean gradient is used instead of the Riemannian gradient. Similarly, the average , considered in Eq. (5), closely resembles the (Riemannian) streaming average in Eq. (2) (see Appendix E.2).

In fact we can argue that the randomized power method, Riemannian SGD, and the classic Oja iteration (the linearization of the randomized power method in ) are equivalent up to corrections (see Lemma 17). The average also admits the same linearization as the Riemannian streaming average up to corrections (see Lemma 19).

Using results from Shamir (2016b) and Allen-Zhu and Li (2017) we can then argue that the randomized power method iterates satisfy a slow rate of convergence under suitable conditions on their initialization. Hence, the present framework is applicable and we can use geometric iterate averaging to obtain a local, robust, accelerated convergence rate. In the following, we will use to denote the standard basis vectors in .

###### Theorem 3.

Let Assumption 2 hold for the set , for some constant , where minimizes over the -Grassmann manifold. Denote, , and the 4th-order tensor . Further assume that a.s., and that . Then if and evolve according to Eq. (5), there exists a positive-definite matrix , such that satisfies:

 √n~ΔnD→N(0,C) with C=k∑j′=1d∑i′=k+1k∑j=1d∑i=k+1Cii′jj′√λiλj⋅√λi′λj′(λj−λi)⋅(λj′−λi′)(vie⊤j)⊗(vi′e⊤j′).

We make the following observations:

• If the 4th-order tensor satisfies444For example if for – so is a rank-one stream of Gaussian random vectors – this condition is satisfied. See the proof of Theorem 3 for more details. for constant , the aforementioned covariance structure simplifies to,

 C=κk∑j=1d∑i=k+1λiλj(λj−λi)2(vie⊤j)⊗(vie⊤j).

This asymptotic variance matches the result of Reiß and Wahl (2016), achieving the same statistical performance as the empirical risk minimizer and matching the lower bound of Cai et al. (2013) obtained for the (Gaussian) spiked covariance model.

• Empirically, even using a constant step size in Eq. (5) appears to yield convergence in a variety of situations; however, we can see a numerical counterexample in Appendix F. We leave it as an open problem to understand the convergence of the iterate-averaged, constant step-size algorithm in the case of Gaussian noise (Bougerol and Lacroix, 1985).

• Assumption 2 could be relaxed using a martingale concentration result showing the iterates are restricted to

with high probability similar to the work of

Shamir (2016a) and Allen-Zhu and Li (2017).

Note that we could also derive an analogous result to Theorem 3 for the (averaged) Riemannian SGD algorithm in Eq. (1) and Eq. (2). However, we prefer to present the algorithm in Eq. (5) since it is simpler and directly averages the (commonly used) randomized power method.

## 7 Experiments

Here, we illustrate the practical utility of our results on a synthetic, streaming -PCA problem using the SGD algorithm defined in Eq. (5). We take and . The stream

is normally-distributed with a covariance matrix

with random eigenvectors, and eigenvalues decaying as , for , and , for where . All results are averaged over ten repetitions.

##### Robustness to Conditioning.

In Figure 2 we consider two covariance matrices with different conditioning and we compare the behavior of SGD and averaged SGD for different step sizes (constant (cst), proportional to (-1/2) and (-1)). When the covariance matrix is well-conditioned, with a large eigengap (left plot), we see that SGD converges at a rate which depends on the step size whereas averaged SGD converges at a rate independently of the step-size choice. For poorly conditioned problems (right plot), the convergence rate deteriorates to for non-averaged SGD with step size , and averaged SGD with both constant and step sizes. The step size performs poorly with and without averaging.

##### Robustness to Incorrect Step-Size.

In Figure 3 we consider a well-conditioned problem and compare the behavior of SGD and averaged SGD with step size proportional to and to investigate the robustness to the choice of the constant . For both algorithms we take three different constant prefactors , and . For the step size proportional to (left plot), both SGD and averaged SGD are robust to the choice of . For SGD, the iterates eventually converge at a rate, with a constant offset proportional to . However, averaged SGD enjoys the fast rate for all choices of . For the step size proportional to (right plot), if is too small, the rate of convergence is extremely slow for SGD and averaged SGD.

## 8 Conclusions

We have constructed and analyzed a geometric framework on Riemannian manifolds that generalizes the classical Polyak-Ruppert iterate-averaging scheme. This framework is able to accelerate a sequence of slowly-converging iterates to an iterate-averaged sequence with a robust rate. We have also presented two applications, to the class of geodesically-strongly-convex optimization problems and to streaming -PCA. Note that our results only apply locally, requiring the iterates to be constrained to lie in a compact set . Considering a projected variant of our algorithm as in Flammarion and Bach (2017) is a promising direction for further research that may allow us to remove this restriction. Another interesting direction is to provide a global-convergence result for the iterate-averaged PCA algorithm presented here.

### Acknowledgements

The authors thank Nicolas Boumal and John Duchi for helpful discussions. Francis Bach acknowledges support from the European Research Council (grant SEQUOIA 724063), and Michael Jordan acknowledges support from the Mathematical Data Science program of the Office of Naval Research under grant number N00014-15-1-2670.

## References

• Absil and Malick [2012] P.-A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM J. Optim., 22(1):135–158, 2012.
• Absil et al. [2004] P-A Absil, R. Mahony, and R. Sepulchre. Riemannian geometry of Grassmann manifolds with a view on algorithmic computation. Acta Applicandae Mathematicae, 80(2):199–220, 2004.
• Absil et al. [2007] P.-A. Absil, C.G. Baker, and K.A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, Jul 2007.
• Absil et al. [2009] P-A Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.
• Afsari [2011] B. Afsari. Riemannian center of mass: existence, uniqueness, and convexity. Proc. Amer. Math. Soc., 139(2):655–673, 2011.
• Allen-Zhu and Li [2017] Z. Allen-Zhu and Y. Li. First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate. In Proceedings of the 58th Symposium on Foundations of Computer Science, FOCS ’17, 2017.
• Aswani et al. [2011] A. Aswani, P. Bickel, and C. Tomlin. Regression on manifolds: estimation of the exterior derivative. Ann. Statist., 39(1):48–81, 2011.
• Bach and Moulines [2011] F. Bach and E. Moulines.

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems, pages 451–459, 2011.
• Benveniste et al. [1990] A. Benveniste, P. Priouret, and M. Métivier. Adaptive Algorithms and Stochastic Approximations. Springer, 1990.
• Bini and Iannazzo [2013] D. A Bini and B. Iannazzo. Computing the Karcher mean of symmetric positive definite matrices. Linear Algebra and its Applications, 438(4):1700–1710, 2013.
• Bonnabel [2013] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
• Bottou [1998] L. Bottou. Online algorithms and stochastic approximations. In

Online Learning and Neural Networks

. Cambridge University Press, Cambridge, UK, 1998.
• Bougerol and Lacroix [1985] P. Bougerol and J. Lacroix. Products of Random Matrices with Applications to Schrödinger Operators, volume 8 of Progress in Probability and Statistics. Birkhäuser, 1985.
• Boumal [2013] N. Boumal. On intrinsic Cramér-Rao bounds for Riemannian submanifolds and quotient manifolds. IEEE Trans. Signal Process., 61(7):1809–1821, 2013.
• Boumal and Absil [2011] N. Boumal and P.-A. Absil. RTRMC: A Riemannian trust-region method for low-rank matrix completion. In Advances in Neural Information Processing Systems 24, pages 406–414. 2011.
• Cai et al. [2013] T. T. Cai, Z. Ma, and Y. Wu. Sparse PCA: optimal rates and adaptive estimation. Ann. Statist., 41(6):3074–3110, 2013.
• Do Carmo [2016] M. P. Do Carmo. Differential Geometry of Curves and Surfaces. Courier Dover Publications, 2016.
• Edelman et al. [1998] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
• Fabian [1968] V. Fabian. On asymptotic normality in stochastic approximation. Ann. Math. Statist, 39:1327–1332, 1968.
• Flammarion and Bach [2017] N. Flammarion and F. Bach. Stochastic composite least-squares regression with convergence rate . In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 831–875. PMLR, 07–10 Jul 2017.
• Horn and Johnson [1990] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, 1990.
• Hosseini and Sra [2015] R. Hosseini and S. Sra. Matrix manifold optimization for Gaussian mixtures. In Advances in Neural Information Processing Systems, pages 910–918, 2015.
• Huang et al. [2015] W. Huang, K. A Gallivan, and P-A Absil. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM Journal on Optimization, 25(3):1660–1685, 2015.
• Ishteva et al. [2011] M. Ishteva, P.-A. Absil, S. Van Huffel, and L. De Lathauwer. Best low multilinear rank approximation of higher-order tensors, based on the Riemannian trust-region scheme. SIAM J. Matrix Anal. Appl., 32(1):115–135, 2011.
• Jain et al. [2016] P. Jain, C. Jin, S. M. Kakade, P. Netrapalli, and A. Sidford. Streaming PCA: matching matrix Bernstein and near-optimal finite sample guarantees for Oja’s algorithm. In Conference on Learning Theory, pages 1147–1164, 2016.
• Kushner and Yin [2003] H. Kushner and G G. Yin. Stochastic Approximation and Recursive Algorithms and Applications. Springer, 2003.
• Moakher [2002] M. Moakher. Means and averaging in the group of rotations. SIAM Journal on Matrix Analysis and Applications, 24(1):1–16, 2002.
• Nemirovski et al. [2008] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim., 19(4):1574–1609, 2008.
• Nesterov and Vial [2008] Y. Nesterov and J.-P. Vial. Confidence level solutions for stochastic programming. Automatica J. IFAC, 44(6):1559–1568, 2008.
• Nevelson and Hasminski [1973] M. B. Nevelson and R. Z. Hasminski. Stochastic Approximation and Recursive Estimation. American Mathematical Society, 1973.
• Oja [1982] E. Oja.

Simplified neuron model as a principal component analyzer.

Journal of Mathematical Biology, 15(3):267–273, Nov 1982.
• Oja and Karhunen [1985] E. Oja and J. Karhunen.

On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix.

J. Math. Anal. Appl., 106(1):69–84, 1985.
• Polyak [1990] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i Telemekhanika, 51(7):98–107, 1990.
• Polyak and Juditsky [1992] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992.
• Reiß and Wahl [2016] M. Reiß and M. Wahl. Non-asymptotic upper bounds for the reconstruction error of PCA. arXiv preprint arXiv:1609.03779, 2016.
• Ring and Wirth [2012] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
• Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
• Ruppert [1988] D. Ruppert. Efficient estimations from a slowly convergent robbins-monro process. Technical report, Cornell University Operations Research and Industrial Engineering, 1988.
• Sakai [1996] T. Sakai. Riemannian Geometry, volume 149 of Translations of Mathematical Monographs. American Mathematical Society, 1996.
• Sato et al. [2017] H. Sato, H. Kasai, and B. Mishra. Riemannian stochastic variance reduced gradient. arXiv preprint arXiv:1702.05594, 2017.
• Shalev-Shwartz et al. [2009] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In Proceedings of the International Conference on Learning Theory (COLT), 2009.
• Shamir [2016a] O. Shamir. Convergence of stochastic gradient descent for PCA. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 257–265. PMLR, 20–22 Jun 2016a.
• Shamir [2016b] O. Shamir. Fast stochastic algorithms for SVD and PCA: convergence properties and convexity. In International Conference on Machine Learning, pages 248–256, 2016b.
• Smith [2005] S. T. Smith. Covariance, subspace, and intrinsic Cramér-Rao bounds. IEEE Trans. Signal Process., 53(5):1610–1630, 2005.
• Sun et al. [2017] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inform. Theory, 63(2):885–914, 2017.
• Udriste [1994] C. Udriste. Convex Functions and Optimization Methods on Riemannian Manifolds, volume 297. Springer Science & Business Media, 1994.
• Van der Vaart [1998] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 1998.
• Waldmann [2012] S. Waldmann. Geometric wave equations. arXiv preprint arXiv:1208.4706, 2012.
• Yang [1995] B. Yang. Projection approximation subspace tracking. Trans. Sig. Proc., 43(1):95–107, January 1995.
• Zhang and Sra [2016] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pages 1617–1638, 2016.
• Zhang et al. [2016] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In Advances in Neural Information Processing Systems, pages 4592–4600, 2016.

## Appendix A Appendices

In Appendix B we provide the proof of Theorem 1. In Appendix C we prove the relevant lemmas mirroring the proof sketch in Section 5. In Appendix D we provide proofs of the results for the application discussed in Section 6.1 about geodesically-strongly-convex optimization. Section E contains background and proofs of results discussed in Section 6.2 regarding streaming -PCA. Section F contains further experiments on synthetic PCA showing a counterexample to the convergence of averaged, constant step-size SGD mentioned in Section 7 in the main text.

Throughout this section we will denote a sequence of vectors to be , for scalar functions , if there exists a constant , such that for all almost surely.

## Appendix B Proofs for Section 2

Here we provide the proof of Theorem 1. The first statement follows by combining Theorems 7, 8, Lemma 12 and Slutsky’s theorem. The second statement follows by using Theorems 7, 9, and Lemma 13.

## Appendix C Proofs for Section 5

Here we detail the proofs results necessary to conclude our main result sketched in Section 5.

### c.1 Proofs in Section 5.1

We begin with the proofs of the geometric lemmas detailed in Section 5.1, showing how to linearize the progress of the SGD iterates in the tangent space of by considering the evolution of . Note that since by Assumption 2, for all , , the vectors all belong to the compact set .

In the course of our argument it will be useful to consider the function (which crucially is a function defined on a vector space) and further , the linearization of the retraction map. The first recursion we will study is that of :

###### Lemma 4.

Let Assumption 2 hold. If for a sequence of iterates evolving as in Eq. (1), then there exists a constant depending on such that,

 Δn+1=Δn−γn+1[Λxnx⋆]−1(∇fn+1(xn))+γn+1gn,

where .

###### Proof.

Using the chain rule for the differential of a mapping on a manifold and the first-order property of the retraction (

) we have that:

 DFx,y(0x)=D(R−1y∘Rx)(0x)=DR−1y(Rx(0x))∘DRx(0x)=[DRy(R−1y(Rx(0x)))]−1∘ITxM=[DRy(R−1y(x))]−1=[Λxy]−1,

where the last line follows by the inverse function theorem on the manifold . Smoothness of the retraction implies the Hessian of is uniformly bounded in norm on the compact set . We use to denote a bound on the operator norm of the Hessian of in this compact set. In the present situation, we have that . Since is a function defined on vector spaces the result follows using a Taylor expansion, , the previous statements regarding the differential of , and the uniform bounds on the second-order terms. In particular, the second-order term in the Taylor expansion is upper bounded as so the bound on the error term follows. ∎

We now further develop this recursion by also considering an asymptotic expansion of the gradient term near the optima.

###### Lemma 5.

Let Assumptions 4, 5, and 6 hold. If for a sequence of iterates evolving as in Eq. (1), then there exist sequences and such that

 Γx⋆xn∇fn+1(xn)=∇2f(x⋆)Δn+∇fn