Riemannian Adaptive Optimization Methods

10/01/2018 ∙ by Gary Becigneul, et al. ∙ ETH Zurich 0

Several first order stochastic optimization methods commonly used in the Euclidean domain such as stochastic gradient descent (SGD), accelerated gradient descent or variance reduced methods have already been adapted to certain Riemannian settings. However, some of the most popular of these optimization tools - namely Adam , Adagrad and the more recent Amsgrad - remain to be generalized to Riemannian manifolds. We discuss the difficulty of generalizing such adaptive schemes to the most agnostic Riemannian setting, and then provide algorithms and convergence proofs for geodesically convex objectives in the particular case of a product of Riemannian manifolds, in which adaptivity is implemented across manifolds in the cartesian product. Our generalization is tight in the sense that choosing the Euclidean space as Riemannian manifold yields the same algorithms and regret bounds as those that were already known for the standard algorithms. Experimentally, we show faster convergence and to a lower train loss value for Riemannian adaptive methods over their corresponding baselines on the realistic task of embedding the WordNet taxonomy in the Poincare ball.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Developing powerful stochastic gradient-based optimization algorithms is of major importance for a variety of application domains. It particular, for computational efficiency, it is common to opt for a first order method, when the number of parameters to be optimized is great enough. Such cases have recently become ubiquitous in engineering and computational sciences, from the optimization of deep neural networks to learning embeddings over large vocabularies.

This new need resulted in the development of empirically very successful first order methods such as Adagrad (Duchi et al., 2011), Adadelta (Zeiler, 2012), Adam (Kingma & Ba, 2015) or its recent update Amsgrad (Reddi et al., 2018).

Note that these algorithms are designed to optimize parameters living in a Euclidean space , which has often been considered as the default geometry to be used for continuous variables. However, a recent line of work has been concerned with the optimization of parameters lying on a Riemannian manifold, a more general setting allowing non-Euclidean geometries. This family of algorithms has already found numerous applications, including for instance solving Lyapunov equations (Vandereycken & Vandewalle, 2010), matrix factorization (Tan et al., 2014), geometric programming (Sra & Hosseini, 2015), dictionary learning (Cherian & Sra, 2017) or hyperbolic taxonomy embedding (Nickel & Kiela, 2017; Ganea et al., 2018a; De Sa et al., 2018; Nickel & Kiela, 2018).

A few first order stochastic methods have already been generalized to this setting (see section 6), the seminal one being Riemannian stochastic gradient descent (rsgd) (Bonnabel, 2013), along with new methods for their convergence analysis in the geodesically convex case (Zhang & Sra, 2016). However, the above mentioned empirically successful adaptive methods, together with their convergence analysis, remain to find their respective Riemannian counterparts.

Indeed, the adaptivity of these algorithms can be thought of as assigning one learning rate per coordinate of the parameter vector. However, on a Riemannian manifold, one is generally not given an intrinsic coordinate system, rendering meaningless the notions

sparsity or coordinate-wise update.

Our contributions.

In this work we (i) explain why generalizing these adaptive schemes to the most agnostic Riemannian setting in an intrinsic manner is compromised, and (ii) propose generalizations of the algorithms together with their convergence analysis in the particular case of a product of manifolds where each manifold represents one “coordinate” of the adaptive scheme. Finally, we (iii) empirically support our claims on the realistic task of hyperbolic taxonomy embedding.

Our initial motivation.

The particular application that motivated us in developing Riemannian versions of Adagrad and Adam was the learning of symbolic embeddings in non-Euclidean spaces. As an example, the GloVe algorithm (Pennington et al., 2014) an unsupervised method for learning Euclidean word embeddings capturing semantic/syntactic relationships benefits significantly from optimizing with Adagrad compared to using Sgd, presumably because different words are sampled at different frequencies. Hence the absence of Riemannian adaptive algorithms could constitute a significant obstacle to the development of competitive optimization-based Riemannian embedding methods. In particular, we believe that the recent rise of embedding methods in hyperbolic spaces could benefit from such developments (Nickel & Kiela, 2017, 2018; Ganea et al., 2018a, b; De Sa et al., 2018; Vinh et al., 2018).

2 Preliminaries and notations

2.1 Differential geometry

We recall here some elementary notions of differential geometry. For more in-depth expositions, we refer the interested reader to Spivak (1979) and Robbin & Salamon (2011).

Manifold, tangent space, Riemannian metric.

A manifold of dimension is a space that can locally be approximated by a Euclidean space , and which can be understood as a generalization to higher dimensions of the notion of surface. For instance, the sphere embedded in is an -dimensional manifold. In particular, is a very simple -dimensional manifold, with zero curvature. At each point , one can define the tangent space , which is an -dimensional vector space and can be seen as a first order local approximation of around . A Riemannian metric is a collection of inner-products on , varying smoothly with . It defines the geometry locally on . For and , we also write . A Riemannian manifold is a pair .

Induced distance function, geodesics.

Notice how a choice of a Riemannian metric induces a natural global distance function on . Indeed, for , we can set to be equal to the infimum of the lengths of smooth paths between and in , where the length of a path is given by integrating the size of its speed vector , in the corresponding tangent space: . A geodesic in is a smooth curve which locally has minimal length. In particular, a shortest path between two points in is a geodesic.

Exponential and logarithmic maps.

Under some assumptions, one can define at point the exponential map . Intuitively, this map folds the tangent space on the manifold. Locally, if , then for small , tells us how to move in as to take a shortest path from with initial direction . In , . In some cases, one can also define the logarithmic map as the inverse of .

Parallel transport.

In the Euclidean space, if one wants to transport a vector from to , one simply translates along the straight-line from to . In a Riemannian manifold, the resulting transported vector will depend on which path was taken from to . The parallel transport of a vector from a point in the direction and in a unit time, gives a canonical way to transport with zero acceleration along a geodesic starting from , with initial velocity .

2.2 Riemannian optimization

Consider performing an sgd update of the form

(1)

where denotes the gradient of objective 111to be interpreted as the objective with the same parameters, evaluated at the minibatch taken at time . and is the step-size. In a Riemannian manifold , for smooth , Bonnabel (2013) defines Riemannian sgd by the following update:

(2)

where denotes the Riemannian gradient of at . Note that when is the Euclidean space , these two match, since we then have .

Intuitively, applying the exponential map enables to perform an update along the shortest path in the relevant direction in unit time, while remaining in the manifold.

In practice, when is not known in closed-form, it is common to replace it by a retraction map , most often chosen as , which is a first-order approximation of .

2.3 Amsgrad, Adam, Adagrad

Let’s recall here the main algorithms that we are taking interest in.

Adagrad.

Introduced by Duchi et al. (2011), the standard form of its update step is defined as222a small is often added in the square-root for numerical stability, omitted here for simplicity.

(3)

Such updates rescaled coordinate-wise depending on the size of past gradients can yield huge improvements when gradients are sparse, or in deep networks where the size of a good update may depend on the layer. However, the accumulation of all past gradients can also slow down learning.

Adam.

Proposed by Kingma & Ba (2015), the Adam update rule is given by

(4)

where can be seen as a momentum term and is an adaptivity term. When , one essentially recovers the unpublished method Rmsprop (Tieleman & Hinton, 2012), the only difference to Adagrad being that the sum is replaced by an exponential moving average, hence past gradients are forgotten over time in the adaptivity term . This circumvents the issue of Adagrad that learning could stop too early when the sum of accumulated squared gradients is too significant. Let us also mention that the momentum term introduced by Adam for has been observed to often yield huge empirical improvements.

Amsgrad.

More recently, Reddi et al. (2018) identified a mistake in the convergence proof of Adam. To fix it, they proposed to either modify the Adam algorithm with

(5)

which they coin Amsgrad, or to choose an increasing schedule for , making it time dependent, which they call AdamNc (for non-constant).

3 Adaptive schemes in Riemannian manifolds

3.1 The difficulty of designing adaptive schemes in the general setting

Intrinsic updates.

It is easily understandable that writing any coordinate-wise update requires the choice of a coordinate system. However, on a Riemannian manifold , one is generally not provided with a canonical coordinate system. The formalism only allows to work with certain local coordinate systems, also called charts, and several different charts can be defined around each point . One usually says that a quantity defined using a chart is intrinsic to if its definition does not depend on which chart was used. For instance, it is known that the Riemannian gradient of a smooth function can be defined intrinsically to , but its Hessian is only intrinsically defined at critical points. It is easily seen that the rsgd update of Eq. (2) is intrinsic, since it only involves and , which are objects intrinsic to . However, it is unclear whether it is possible at all to express either of Eqs. (3,4,5) in a coordinate-free or intrinsic manner.

A tempting solution.

Note that since an update is defined in a tangent space, one could be tempted to fix a canonical coordinate system in the tangent space at the initialization , and parallel-transport along the optimization trajectory, adapting Eq. (3) to:

(6)

where and denote coordinate-wise division and square respectively, these operations being taken relatively to coordinate system . In the Euclidean space, parallel transport between two points and does not depend on the path it is taken along because the space has no curvature. However, in a general Riemannian manifold, not only does it depend on the chosen path but curvature will also give to parallel transport a rotational component333The rotational component of parallel transport inherited from curvature is called the holonomy., which will almost surely break the sparsity of the gradients and hence the benefit of adaptivity. Besides, the interpretation of adaptivity as optimizing different features (i.e. gradient coordinates) at different speeds is also completely lost here, since the coordinate system used to represent gradients depends on the optimization path. Finally, note that the techniques we used to prove our theorems would not apply to updates defined in the vein of Eq. (6).

3.2 Adaptivity is possible across manifolds in a product

From now on, we assume additional structure on , namely that it is the cartesian product of Riemannian manifolds , where is the induced product metric:

(7)
Product notations.

The induced distance function on is known to be given by , where is the distance in . The tangent space at is given by , and the Riemannian gradient of a smooth function at point is simply the concatenation of the Riemannian gradients of each partial map . Similarly, the exponential, log map and the parallel transport in are the concatenations of those in each .

Riemannian Adagrad.

We just saw in the above discussion that designing meaningful adaptive schemes intuitively corresponding to one learning rate per coordinate in a general Riemannian manifold was difficult, because of the absence of intrinsic coordinates. Here, we propose to see each component of as a “coordinate”, yielding a simple adaptation of Eq. (3) as

(8)
On the adaptivity term.

Note that we take (squared) Riemannian norms in the adaptivity term rescaling the gradient. In the Euclidean setting, this quantity is simply a scalar , which is related to the size of an sgd update of the coordinate, rescaled by the learning rate (see Eq. (1)): . By analogy, note that the size of an rsgd update in (see Eq. (2)) is given by , hence we also recover , which indeed suggests replacing the scalar by when transforming a coordinate-wise adaptive scheme into a manifold-wise adaptive one.

4 Ramsgrad, RadamNc: convergence guarantees

In section 2, we briefly presented Adagrad, Adam and Amsgrad. Intuitively, Adam can be described as a combination of Adagrad with a momentum (of parameter ), with the slight modification that the sum of the past squared-gradients is replaced with an exponential moving average, for an exponent . Let’s also recall that Amsgrad implements a slight modification of Adam, allowing to correct its convergence proof. Finally, AdamNc is simply Adam, but with a particular non-constant schedule for and . On the other hand, what is interesting to note is that the schedule initially proposed by Reddi et al. (2018) for in AdamNc, namely , lets recover the sum of squared-gradients of Adagrad. Hence, AdamNc without momentum (i.e. ) yields Adagrad.

Assumptions and notations.

For , we assume is a geodesically complete Riemannian manifold with sectional curvature lower bounded by . As written in Eq. (7), let be the product manifold of the ’s. For each , let be a compact, geodesically convex set and define , the set of feasible parameters. Define to be the projection operator, i.e. is the unique minimizing . Denote by , and the parallel transport, exponential and log maps in , respectively. For , if for , denote by and by the corresponding components of and . In the sequel, let be a family of differentiable, geodesically convex functions from to . Assume that each has a diameter bounded by and that for all , and , . Finally, our convergence guarantees will bound the regret, defined at the end of rounds as , so that . Finally, denotes any isometry from to , for .

Following the discussion in section 3.2 and especially Eq. (8), we present Riemannian Amsgrad in Figure 0(a). For comparison, we show next to it the standard Amsgrad algorithm in Figure 0(b).

, , ,
Set , , and
for  to  do
     
     
     
     
     
     
end for
(a) Ramsgrad in .
, , ,
Set , and
for  to  do
     
     
     
     
     
end for
(b) Amsgrad in .
Figure 1: Comparison of the Riemannian and Euclidean versions of Amsgrad.

Write . As a natural choice for , one could first parallel-transport444The idea of parallel-transporting from to previously appeared in Cho & Lee (2017). from to using , and then from to along a minimizing geodesic.

As can be seen, if for all , Ramsgrad and Amsgrad coincide: we then have , , , , , . From these algorithms, Radam and Adam are obtained simply by removing the operations, i.e. replacing with . The convergence guarantee that we obtain for Ramsgrad is presented in Theorem 1, where the quantity is defined by Zhang & Sra (2016) as

(9)

For comparison, we also show the convergence guarantee of the original Amsgrad in appendix C. Note that when for all , convergence guarantees between Ramsgrad and Amsgrad coincide as well. Indeed, the curvature dependent quantity in the Riemannian case then becomes equal to , recovering the convergence theorem of Amsgrad. It is also interesting to understand at which speed does the regret bound worsen when the curvature is small but non-zero: by a multiplicative factor of approximately (see Eq.(9)). Similar remarks hold for RadamNc, whose convergence guarantee is shown in Theorem 2. Finally, notice that in Theorem 2 yields a convergence proof for Radagrad, whose update rule we defined in Eq. (8).

Theorem 1 (Convergence of Ramsgrad).

Let and be the sequences obtained from Algorithm 0(a), , , for all and . We then have:

(10)
Proof.

See appendix A. ∎

Theorem 2 (Convergence of RadamNc).

Let and be the sequences obtained from RadamNc, , , , , . We then have:

(11)
Proof.

See appendix B. ∎

The role of convexity.

Note how the notion of convexity in Theorem 5 got replaced by the notion of geodesic convexity in Theorem 1. Let us compare the two definitions: the differentiable functions and are respectively convex and geodesically convex if for all , :

(12)

But how does this come at play in the proofs? Regret bounds for convex objectives are usually obtained by bounding using Eq. (12) for any , which boils down to bounding each . In the Riemannian case, this term becomes .

The role of the cosine law.

How does one obtain a bound on ? For simplicity, let us look at the particular case of an sgd update, from Eq. (1). Using a cosine law, this yields

(13)

One now has two terms to bound: (i) when summing over , the first one simplifies as a telescopic summation; (ii) the second term will require a well chosen decreasing schedule for . In Riemannian manifolds, this step is generalized using the analogue lemma 6 introduced by Zhang & Sra (2016), valid in all Alexandrov spaces, which includes our setting of geodesically convex subsets of Riemannian manifolds with lower bounded sectional curvature. The curvature dependent quantity of Eq. (10) appears from this lemma, letting us bound .

The benefit of adaptivity.

Let us also mention that the above bounds significantly improve for sparse (per-manifold) gradients. In practice, this could happen for instance for algorithms embedding each word (or node of a graph) in a manifold and when just a few words are updated at a time.

On the choice of .

The fact that our convergence theorems (see lemma 3) do not require specifying suggests that the regret bounds could be improved by exploiting momentum/acceleration in the proofs for a particular . Note that this remark also applies to Amsgrad (Reddi et al., 2018).

5 Experiments

We empirically assess the quality of the proposed algorithms: Radam, Ramsgrad and Radagrad compared to the non-adaptive Rsgd method (Eq. 2). For this, we follow (Nickel & Kiela, 2017) and embed the transitive closure of the WordNet noun hierarchy (Miller et al., 1990) in the -dimensional Poincaré model of hyperbolic geometry which is well-known to be better suited to embed tree-like graphs than the Euclidean space (Gromov, 1987; De Sa et al., 2018). In this case, each word is embedded in the same space of constant curvature , thus . The choice of the Poincaré model is justified by the access to closed form expressions for all the quantities used in Alg. 0(a):

  • Metric tensor:

    , where is the conformal factor.

  • Riemannian gradients are rescaled Euclidean gradients: .

  • Distance function and geodesics, (Nickel & Kiela, 2017; Ungar, 2008; Ganea et al., 2018b).

  • Exponential and logarithmic maps: , where is the generalized Mobius addition (Ungar, 2008; Ganea et al., 2018b).

  • Parallel transport along the unique geodesic from to : . This formula was derived from (Ungar, 2008; Ganea et al., 2018b), gyr being given in closed form in (Ungar, 2008, Eq. (1.27)).

Dataset & Model.

The transitive closure of the WordNet taxonomy graph consists of 82,115 nouns and 743,241 hypernymy Is-A relations (directed edges ). These words are embedded in

such that the distance between words connected by an edge is minimized, while being maximized otherwise. We minimize the same loss function as 

(Nickel & Kiela, 2017) which is similar with log-likelihood, but approximating the partition function using sampling of negative word pairs (non-edges), fixed to 10 in our case. Note that this loss does not use the direction of the edges in the graph555In a pair , denotes the parent, i.e. entails .

(14)
Metrics.

We report both the loss value and the mean average precision (MAP) (Nickel & Kiela, 2017): for each directed edge , we rank its distance among the full set of ground truth negative examples . We use the same two settings as (Nickel & Kiela, 2017), namely: reconstruction (measuring representation capacity) and link prediction (measuring generalization). For link prediction we sample a validation set of edges from the set of transitive closure edges that contain no leaf node or root. We only focused on 5-dimensional hyperbolic spaces.

Training details.

For all methods we use the same “burn-in phase” described in (Nickel & Kiela, 2017)

for 20 epochs, with a fixed learning rate of 0.03 and using RSGD with retraction as explained in Sec. 

2.2. Solely during this phase, we sampled negative words based on their graph degree raised at power 0.75. This strategy improves all metrics. After that, when different optimization methods start, we sample negatives uniformly.

Optimization methods.

Experimentally we obtained slightly better results for Radam over Ramsgrad, so we will mostly report the former. Moreover, we unexpectedly observed convergence to lower loss values when replacing the true exponential map with its first order approximation i.e. the retraction in both Rsgd and in our adaptive methods from Alg. 0(a). One possible explanation is that retraction methods need fewer steps and smaller gradients to “escape” points sub-optimally collapsed on the ball border of compared to fully Riemannian methods. As a consequence, we report “retraction”-based methods in a separate setting as they are not directly comparable to their fully Riemannian analogues.

Figure 2: Results for methods doing updates with the exponential map. From left to right we report: training loss, MAP on the train set, MAP on the validation set.
Figure 3: Results for methods doing updates with the retraction. From left to right we report: training loss, MAP on the train set, MAP on the validation set.
Results.

We show in Tables 2 and 3 results for “exponential” based and “retraction” based methods. We ran all our methods with different learning rates from the set {0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0, 3.0}. For the Rsgd baseline we show in orange the best learning rate setting, but we also show the previous lower (slower convergence, in blue) and the next higher (faster overfitting, in green) learning rates. For Radam and Ramsgrad we only show the best settings. We always use and for these methods as these achieved the lowest training loss. Radagrad was consistently worse, so we do not report it. As can be seen, Radam always achieves the lowest training loss. On the MAP metric for both reconstruction and link prediction settings, the same method also outperforms all the other methods for the full Riemannian setting (i.e. Tab. 2). Interestingly, in the “retraction” setting, Radam reaches the lowest training loss value and is on par with Rsgd on the MAP evaluation for both reconstruction and link prediction settings. However, Ramsgrad is faster to converge in terms of MAP for the link prediction task, suggesting that this method has a better generalization capability.

6 Related work

After Riemannian sgd was introduced by Bonnabel (2013), a pletora of other first order Riemannian methods arose, such as Riemannian svrg (Zhang et al., 2016), Riemannian Stein variational gradient descent (Liu & Zhu, 2017), Riemannian accelerated gradient descent (Liu et al., 2017; Zhang & Sra, 2018) or averaged rsgd (Tripuraneni et al., 2018), along with new methods for their convergence analysis in the geodesically convex case (Zhang & Sra, 2016)

. Stochastic gradient Langevin dynamics was generalized as well, to improve optimization on the probability simplex

(Patterson & Teh, 2013).

Let us also mention that a first version of Riemannian Adam for the Grassmann manifold was previously introduced by Cho & Lee (2017), proposing to transport the momentum term using parallel translation, which is an idea that we preserved. However, their algorithm completely removes the adaptive component, since the adaptivity term becomes a scalar. No adaptivity across manifolds is discussed, which is the main point of our discussion. Moreover, no convergence analysis is provided.

7 Conclusion

Driven by recent work in learning non-Euclidean embeddings for symbolic data, we propose to generalize popular adaptive optimization tools (e.g. Adam, Amsgrad, Adagrad) to Cartesian products of Riemannian manifolds in a principled and intrinsic manner. We derive convergence rates that are similar to the Euclidean corresponding models. Experimentally we show that our methods outperform popular non-adaptive methods such as Rsgd on the realistic task of hyperbolic word taxonomy embedding.

Acknowledgments

Gary Bécigneul is funded by the Max Planck ETH Center for Learning Systems. Octavian Ganea is funded by the Swiss National Science Foundation (SNSF) under grant agreement number 167176.

References

  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Claudio Gentile. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences, 64(1):48–75, 2002.
  • Bonnabel (2013) Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
  • Cherian & Sra (2017) Anoop Cherian and Suvrit Sra. Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE transactions on neural networks and learning systems, 28(12):2859–2871, 2017.
  • Cho & Lee (2017) Minhyung Cho and Jaehyung Lee.

    Riemannian approach to batch normalization.

    In Advances in Neural Information Processing Systems, pp. 5225–5235, 2017.
  • De Sa et al. (2018) Christopher De Sa, Albert Gu, Christopher Ré, and Frederic Sala. Representation tradeoffs for hyperbolic embeddings. 2018. URL https://www.cs.cornell.edu/~cdesa/papers/arxiv2018_hyperbolic.pdf.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    , 12(Jul):2121–2159, 2011.
  • Ganea et al. (2018a) Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical embeddings. In International Conference on Machine Learning, 2018a.
  • Ganea et al. (2018b) Octavian-Eugen Ganea, Gary Bécigneul, and Thomas Hofmann. Hyperbolic neural networks. In Advances in Neural Information Processing Systems, 2018b.
  • Gromov (1987) Mikhael Gromov. Hyperbolic groups. In Essays in group theory, pp. 75–263. Springer, 1987.
  • Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Liu & Zhu (2017) Chang Liu and Jun Zhu. Riemannian stein variational gradient descent for bayesian inference. arXiv preprint arXiv:1711.11216, 2017.
  • Liu et al. (2017) Yuanyuan Liu, Fanhua Shang, James Cheng, Hong Cheng, and Licheng Jiao. Accelerated first-order methods for geodesically convex optimization on riemannian manifolds. In Advances in Neural Information Processing Systems 30, pp. 4868–4877. 2017.
  • Miller et al. (1990) George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J Miller. Introduction to wordnet: An on-line lexical database. International journal of lexicography, 3(4):235–244, 1990.
  • Nickel & Kiela (2018) Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic geometry. In International Conference on Machine Learning, 2018.
  • Nickel & Kiela (2017) Maximillian Nickel and Douwe Kiela. Poincaré embeddings for learning hierarchical representations. In Advances in Neural Information Processing Systems, pp. 6341–6350, 2017.
  • Patterson & Teh (2013) Sam Patterson and Yee Whye Teh. Stochastic gradient riemannian langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pp. 3102–3110, 2013.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pp. 1532–43, 2014.
  • Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In ICLR, 2018.
  • Robbin & Salamon (2011) Joel W Robbin and Dietmar A Salamon. Introduction to differential geometry. ETH, Lecture Notes, preliminary version, January, 2011.
  • Spivak (1979) Michael Spivak. A comprehensive introduction to differential geometry. volume four. 1979.
  • Sra & Hosseini (2015) Suvrit Sra and Reshad Hosseini. Conic geometric optimization on the manifold of positive definite matrices. SIAM Journal on Optimization, 25(1):713–739, 2015.
  • Tan et al. (2014) Mingkui Tan, Ivor W Tsang, Li Wang, Bart Vandereycken, and Sinno Jialin Pan. Riemannian pursuit for big matrix recovery. In International Conference on Machine Learning, pp. 1539–1547, 2014.
  • Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  • Tripuraneni et al. (2018) Nilesh Tripuraneni, Nicolas Flammarion, Francis Bach, and Michael I Jordan. Averaging stochastic gradient descent on riemannian manifolds. In Conference On Learning Theory, COLT 2018, Stockholm, Sweden, 6-9 July 2018., 2018.
  • Ungar (2008) Abraham Albert Ungar. A gyrovector space approach to hyperbolic geometry. Synthesis Lectures on Mathematics and Statistics, 1(1):1–194, 2008.
  • Vandereycken & Vandewalle (2010) Bart Vandereycken and Stefan Vandewalle. A riemannian optimization approach for computing low-rank solutions of lyapunov equations. SIAM Journal on Matrix Analysis and Applications, 31(5):2553–2579, 2010.
  • Vinh et al. (2018) Tran Dang Quang Vinh, Yi Tay, Shuai Zhang, Gao Cong, and Xiao-Li Li. Hyperbolic recommender systems. arXiv preprint arXiv:1809.01703, 2018.
  • Zeiler (2012) Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • Zhang & Sra (2016) Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on Learning Theory, pp. 1617–1638, 2016.
  • Zhang & Sra (2018) Hongyi Zhang and Suvrit Sra. Towards riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812, 2018.
  • Zhang et al. (2016) Hongyi Zhang, Sashank J Reddi, and Suvrit Sra. Riemannian svrg: Fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4592–4600, 2016.

Appendix A Proof of Theorem 1

Proof.

Denote by and consider the geodesic triangle defined by , and . Now let , , and . Combining the following formula666Note that since each is geodesically convex, logarithms are well-defined.:

(15)

with the following inequality (given by lemma 6):

(16)

yields

(17)

where the use the notation for when it is clear which metric is used. By definition of , we can safely replace by in the above inequality. Plugging into Eq. (17) gives us

(18)

Now applying Cauchy-Schwarz’ and Young’s inequalities to the last term yields

(19)

From the geodesic convexity of for , we have

(20)

Let’s look at the first term. Using and with a change of indices, we have

(21)
(22)