1 Introduction
Largescale machine learning applications are predominantly trained using stochastic gradient descent (SGD)
(Bottou et al., 2018) based algorithms today. A (sub)class of such algorithms that has become increasingly common lately adapts the learning rate of each coordinate of the gradient vector based on past iterates (McMahan & Streeter, 2010; Duchi et al., 2011). A key motivation for this is to have different learning rates for different coordinates (Zeiler, 2012; Pennington et al., 2014), a feature which vanilla SGD lacks. ADAM (Kingma & Ba, 2015), arguably the most popular adaptive gradient method, additionally employs a momentum term to modify the search direction as well. Adaptive gradient methods have enjoyed varying degrees of success in various applications (Pennington et al., 2014; Wilson et al., 2017; Zaheer et al., 2018; Shah et al., 2018).In this paper, we focus on adaptive stochastic gradient algorithms on Riemannian manifolds. Riemannian geometry is a generalization of the Euclidean geometry (Lee, 2003)
. It includes several nonEuclidean spaces such as set of symmetric positivedefinite matrices, and set of orthogonal matrices, to name a few. Numerous machine learning problems can be cast as an optimization problem on Riemannian manifolds. Examples include principal component analysis (PCA), matrix completion
(Mishra & Sepulchre, 2014; Boumal & Absil, 2015), learning taxonomy or latent hierarchies (Nickel & Kiela, 2017, 2018; Ganea et al., 2018), deep metric learning (Harandi et al., 2017a), applications in computer vision
(Harandi et al., 2017b), among others. Several Euclidean algorithms, e.g., steepest descent, conjugate gradients, and trustregions, have been generalized to Riemannian manifolds (Absil et al., 2008).Bonnabel (2013) first proposed the Riemannian SGD (RSGD) algorithm, which is a generalization of the (Euclidean) SGD algorithm to Riemannian manifolds. However, such a generalization of Euclidean adaptive gradient algorithm is relatively unexplored. A key difficulty in this regard is the absence of a canonical coordinate system on manifolds and the inherent nonlinear geometry of the manifold. Recent works in this direction compute Euclideanstyle adaptive weights in the Riemannian setting, ignoring the geometry of the underlying manifold. Roy et al. (2018)
propose a RMSPropstyle algorithm for manifolds, viewing the gradient as a vector and computing a corresponding adaptive weight vector. On the other hand,
Cho & Lee (2017); Bécigneul & Ganea (2019) adapt the step size by computing a scalar weight instead of directly adapting the gradient in a momentumbased Riemannian AMSGrad/ADAMstyle algorithm.We develop a novel approach of adapting the Riemannian gradient, which allows to exploit the structure of the underlying manifolds. In particular, we propose to adapt the row and column subspaces of the Riemannian stochastic gradient . Euclidean adaptive algorithms compute (positive) adaptive weight vectors for the gradient vectors. Our model computes left and right adaptive weight matrices for denoted by and , respectively. adapts the row subspace of and adapts the column subspace of . Both and are positive definite matrices and are computed using the row covariance and column covariance matrices of , respectively. For computational efficiency, we model and as diagonal matrices, taking cue from AdaGrad (Duchi et al., 2011). Overall, we propose computationally efficient Riemannian adaptive stochastic gradient algorithms, henceforth collectively termed as RASA.
Under a set of mild conditions, our analysis guarantees the convergence of our algorithms with a rate of convergence of order , where is the number of iterations. To the best of our knowledge, ours is the first Riemannian adaptive gradient algorithm to provide convergence analysis for nonconvex stochastic optimization setting. Among the existing works, Bécigneul & Ganea (2019) provide convergence analysis only for geodesically convex functions.
Empirically, we compare our algorithms with the existing Riemannian adaptive gradient algorithms. In applications such as principal components analysis and matrix completion, we observe that our algorithms perform better than the baselines in most experiments both on synthetic and realworld datasets.
The main contributions of this work are:
we propose a principled approach for modeling adaptive weights for Riemannian stochastic gradient. We model adaptive weight matrices for row and column subspaces exploiting the geometry of the manifold.
we develop efficient Riemannian adaptive stochastic gradient algorithms, based on the proposed modeling approach.
we provide convergence analysis of our algorithm, under a set of mild conditions. Our algorithms achieve a rate of convergence order , where is the number of iterations, for nonconvex stochastic optimization.
The paper is organized as follows. Section 2 discusses preliminaries of the Riemannian (stochastic) optimization framework and adaptive stochastic gradient algorithms in the Euclidean space. In Section 3, we present our Riemannian adaptive stochastic algorithms. Section 4 provides the convergence analysis of the proposed algorithms. Related works are discussed in Section 5, while empirical results are presented in Section 6. Section 7 concludes the paper.
2 Preliminaries
In this section, we briefly summarize various concepts of Riemannian geometry and optimization. We refer the interested readers to (Absil et al., 2008) for more details.
Notions on Riemannian manifolds
Informally, a manifold is a generalization of the notion of surface to higher dimensions. A manifold of dimension can be locally approximated by . For example, the Stiefel manifold is the set of orthonormal matrices and has dimension (Absil et al., 2008). The first order approximation of around a point is a dimensional vector space and is known as the tangent space . The tangent space contains all tangent vectors to at . Corresponding to each tangent space , we define an inner product , which varies smoothly with . This inner product is termed as the Riemannian metric in differential geometry. A Riemannian manifold comprises a manifold along with the collection of Riemannian metric .
It should be noted that induces a norm on the tangent space at : , where . This in turn induces a (local) distance function on , which is employed to compute length of a path between two points on . A geodesic between two points is a path of shortest length on the manifold, generalizing the notion of straight line in Euclidean space to manifolds.
Minimizing an objective function over the manifold requires the notion of gradients at every point . Correspondingly, the Riemannian gradient at is defined as an element of the tangent space .
Riemannian stochastic gradient update
The Riemannian stochastic gradient descent (RSGD) update (Bonnabel, 2013) is given by
(1) 
where is a (decaying) stepsize. The term represents a Riemannian stochastic gradient and is typically computed as the Riemannian gradient of the objective function on a batch of data , i.e., . The retraction, , maps tangent space onto with a local rigidity condition that preserves the gradients at (Absil et al., 2008, Sec. 4.1). The exponential mapping is an instance of the retraction.
It should be noted that when the manifold is the Euclidean space and the Riemannian metric is the standard Euclidean inner product, the RSGD update (1) simplifies to the (vanilla) stochastic gradient descent (SGD) update (Bottou et al., 2018):
(2) 
where is a stochastic gradient commonly computed as the gradient of the objective function on a batch of data , i.e., . The retraction function in the Euclidean space is given by .
Euclidean adaptive stochastic gradient updates
The adaptive stochastic gradient methods employ the past gradients to compute a local distance measure and subsequently rescale the learning rate of each coordinate of the (Euclidean) model parameter. The AdaGrad algorithm (Duchi et al., 2011) introduced the following update, when :
(3) 
where is a diagonal matrix corresponding to the vector , i.e., , and the symbol ‘’ denotes the entrywise product (Hadamard product). RMSProp (Tieleman & Hinton, 2012) proposed an exponentially moving average of the adaptive term as follows: , where is another hyperparameter.
Various stochastic gradient algorithms also employ an adaptive momentum term, ADAM (Kingma & Ba, 2015) being the most popular of them. In this paper, we restrict our discussion to algorithms without a momentum term because the calculation of the momentum term needs additional vector transport operation every iteration (Absil et al., 2008, Sec. 8.1). However, we include existing adaptive momentum based Riemannian stochastic gradient algorithms (Bécigneul & Ganea, 2019) as baselines in our experiments.
3 Riemannian adaptive stochastic gradient
In this section, we propose novel adaptive stochastic gradient algorithms in Riemannian setting. We first introduce a few notations, which will be useful in the rest of the manuscript. For given vectors and , or is elementwise square root, is elementwise square, denotes elementwise division, and denotes elementwise maximum. For a given matrix , denotes its vectorial representation. The term
represents the identity matrix of
dimension.We consider the following problem
(4) 
where elements of the manifold are represented as matrices of size . Such a geometry is true for most Riemannian manifolds employed in machine learning applications. Prominent examples of such manifolds include the Stiefel, Grassmann, spherical, symmetric positive definite, spectrahedron, and hyperbolic manifolds, to name a few.
In this paper, we are interested in a stochastic optimization setting where we assume that at each time step , the algorithm computes a feasible point given the current iterate . For simplicity, the Riemannian stochastic gradient at a given iterate is denoted by . We exploit the matrix structure of by proposing separate adaptive weight matrices corresponding to row and column subspaces with and , respectively. These weights are computed in an exponentially weighted manner as
(5) 
where is a hyperparameter. It should be noted that and correspond to row and column covariance matrices, respectively. We propose to adapt^{1}^{1}1For numerical stability while computing square root of and in (6), a small is added to their diagonal entries. the Riemannian gradient as
(6) 
We observe that and weight matrices suitably modify the Riemannian gradient while respecting the matrix structure of , rather than plainly viewing as a vector in . For instance in the Grassmann manifold, our right adaptive weight matrix essentially provides different learning rates to different column subspaces instead of giving different learning rates to the individual entries (coordinates) of . Our framework allows adapting only the row or the column space of as well by simply substituting or , respectively. In vectorial representation, the adaptive update (6) may be viewed as
(7) 
where .
It should be noted that does not lie on the tangent plane . Hence, the proposed adaptive Riemannian gradient is , where is a linear operator to project onto the tangent space . Overall, the proposed update is
(8) 
where is defined as in (6).
Diagonal adaptation and the proposed algorithm
The proposed full matrix update (8
) is computationally prohibitive for high dimensional data. Hence, taking cue from the Euclidean adaptive gradient algorithms
(Duchi et al., 2011), we propose diagonal adaptation of adaptive weight matrices. In particular, we model and as diagonal matrices corresponding to vectors and , which are computed as(9) 
where the function returns the diagonal vector of a square matrix. The diagonal adaptation significantly reduces the effective dimension of adaptive weights to and the computational complexity of adapting with (diagonal) weight matrices to .
The final algorithm is summarized in Algorithm 1 and is henceforth referred to as RASA (Riemannian Adaptive Stochastic gradient Algorithm). RASA generalizes Euclidean AdaGradtype and RMSProptype algorithms to Riemannian matrix manifolds.
We employ and in our final update (step , Algorithm 1). The nondecreasing sequence of the adaptive weights (along with nonincreasing sequence of step size ) ensures the convergence of RASA. Such a sequence for adaptive weights was introduced by Reddi et al. (2018) in ADAM to provide convergence guarantees. We present our convergence analysis of Algorithm 1 in Section 4.
We end this section noting that the effective dimension of our adaptive weights (vectors and ) is , which is less than the ambient dimension of the Riemannian gradient . In Euclidean algorithms such as ADAM, AMSGrad (Reddi et al., 2018), AdaGrad, or RMSProp, the effective dimension of the adaptive weights is same as the Euclidean gradient. To observe the effect of the proposed reduction in dimension of adaptive weights in Riemannian setting, we develop a baseline Riemannian adaptive gradient algorithm RASAvec, which follows RMSProp’s procedure of computing adaptive weights from the Riemannian gradient. Hence, for RASAvec, the Riemannian gradient is adapted as
(10) 
where the adaptive weights matrix is , , and . Figure 1 compares the performance of the proposed RASA against RASAvec and RSGD on principal component analysis and matrix completion problems. We develop three variants of RASA: RASAR (adapts only the column subspace), RASAL (adapts only the row subspace), and RASALR (adapts both row and column subspaces). The results illustrate the benefits of the proposed modeling approach. Our experiments confirm this observation across datasets.
4 Convergence rate analysis
For the purpose of analyzing the convergence rate for the proposed Algorithm 1, we view our proposed update, step 8 in Algorithm 1, as
(11) 
where represents the iteration number, , and is defined as
(12) 
where the symbol ‘’ represents the Kronecker product and denotes the vectorized representation of . Furthermore, denotes the vectorized representation of . Hence, , , , and are dimensional vectors in our analysis. We first summarize essential assumptions and a lemma before discussing our main results.
Note that our analysis also extends to that of RASAvec directly.
4.1 Definitions and assumptions
For simplicity, the analysis hereinafter assumes the standard Riemannian metric as (Euclidean inner product) and the standard Riemannian norm as (Euclidean norm), where .
Definition 4.1.
UpperHessian bounded (Absil et al., 2008). The function is said to be upperHessian bounded in with respect to retraction if there exists a constant such that , for all and with , and all such that for all .
The above class of functions in the Riemannian setting corresponds to the set of continuous functions with Lipschitz continuous gradients in the Euclidean space.
We now state our assumptions on problem (4).
Assumption 1.
For problem (4), we assume the following:
(A1) The function is continuously differentiable and is lower bounded, i.e., where is an optimal solution of (4).
(A2) The function has bounded Riemannian stochastic gradient, i.e., or equivalently .
(A3) The function is upperHessian bounded.
Existing works (Zhou et al., 2018; Chen et al., 2019), which focus on the convergence of (Euclidean) ADAM (Kingma & Ba, 2015), use a Euclidean variant of A2. Furthermore, A2 holds when the manifold is compact like the Grassmann manifold (Absil et al., 2008), or through slight modification of the objective function and the algorithm (Kasai et al., 2018).
Finally, we define Retraction smooth functions via the following lemma.
Lemma 4.2.
4.2 Convergence analysis of Algorithm 1
Our proof structure extends existing convergence analysis of ADAMtype algorithms in the Euclidean space (Zhou et al., 2018; Chen et al., 2019) into the Riemannian setting. However, such an extension is nontrivial and key challenges include

projection onto the tangent space: the Riemannian gradient adaptively weighted by and needs to be projected back onto a tangent space. Our analysis needs to additionally take care of this.
To this end, we first provide a lemma for the upper bounds of the full Riemannian gradient as well as the elements of the vector , where and are defined in steps of Algorithm 1.
Lemma 4.3.
We now present our main result in Theorem 4.4.
Theorem 4.4.
Proof sketch: We first derive a Retraction smooth inequality from Lemma 4.2 with respect to the adaptive gradient . Exploiting the symmetric property of , we obtain the upper bound of . Here, to remove the dependency of on , is evaluated instead of . Then, taking the expectation and telescoping the inequality, we obtain the desired result. The complete proof is in Section A.
Remark 1: The first term of (4.4) represents the weighted term of the sum of squared step length, and appears in the standard RSGD (Bonnabel, 2013). From this point of view, the advantage of the adaptive gradient algorithms can be seen that they reduce the effect of the term compared with the standard RSGD algorithm.
Remark 2: Theorem 4.4 reproduces the results of (Chen et al., 2019, Theorem 3.1) when is defined as the Euclidean space.
Corollary 4.5 (Convergence rate of Algorithm 1).
5 Related works
The asymptotic convergence of stochastic gradient descent on Riemannian manifolds (RSGD) was first proved by Bonnabel (2013). Among the first works that aim to adapt stochastic gradients in optimization on Riemannian matrix manifolds is the cRMSProp algorithm (Roy et al., 2018). They effectively perceive adaptive weights as vectors. More concretely, if elements in have matrix representation of size , then cRMSProp computes an adaptive weight matrix and adapts the gradient matrix as . It can be observed that this interpretation of adaptive weights ignores the matrix structure of , , and in general , and process adaptive weights similar to the Euclidean algorithms discussed in Section 2. Roy et al. (2018) also constrain the adaptive weight matrix to lie on the tangent plane, thereby requiring computationally expensive vector transport operations in each iteration. It should also be noted that Roy et al. (2018) do not provide convergence analysis for cRMSProp.
Another direction of work by Cho & Lee (2017); Bécigneul & Ganea (2019) derive the ADAM algorithm (Kingma & Ba, 2015) on matrix manifolds. In particular, they employ a momentum term along with the Riemannian gradient to compute the search direction. However, the adaptive weight vector in the ADAM algorithm is now substituted with a scalar weight, which effectively adapts the step size rather than the search direction. While Cho & Lee (2017) do not discuss convergence analysis, Bécigneul & Ganea (2019) provide convergence guarantees limited to geodesically convex functions (Zhang & Sra, 2016). It should be noted that both require vector transport operations in each iteration due to the momentum term.
Our approach, on the other hand, preserves the underlying matrix structure of the manifold and compute adaptive weight matrices corresponding to row and column subspaces of the Riemannian gradient. We do not enforce any constraint on the adaptive weight matrices and avoid parallel transport altogether. Like most optimization algorithms on manifolds, our algorithm can be easily be generalized to product of manifolds.
Existing works have also explored other directions of improvement to RSGD in specific settings, similar to the Euclidean counterpart. These include variance reduction
(Zhang et al., 2016; Sato et al., 2017), averaged RSGD (Tripuraneni et al., 2018), and recursive gradients (Kasai et al., 2018), among others.6 Experiments
In this section, we compare our proposed adaptive algorithm, RASA, and its variants with the following baseline Riemannian stochastic algorithms.

RSGD (Bonnabel, 2013) is the vanilla Riemannian stochastic gradient algorithm.

cRMSProp proposed by Roy et al. (2018).

cRMSPropM: a variant of cRMSProp, which uses Riemannian gradients for adaptive weights computations instead of the Euclidean gradients.

Radagrad proposed by Bécigneul & Ganea (2019) considers scalar adaptive weight. Hence, it adapts the step size instead of the gradient.

Radam and Ramsgrad by Bécigneul & Ganea (2019): include momentum terms similar to ADAM and AMSGrad. Additionally, like Radagrad, they adapt only the step size.
We compare the baselines with our proposed variants RASAL, RASAR, and RASALR that either adapt the row (left) subspace, column (right) subspace, or both (left and right subspaces).
All the considered algorithms are implemented in the Matlab toolbox Manopt (Boumal et al., 2014).
The algorithms are evaluated in terms of the number of iterations and are initialized from the same point. The algorithms are stopped when the maximum iterations count reaches a predefined value. We fix the batchsize to (except in the MovieLens datasets, where it is set to ). The step size sequence is generated as (Chen et al., 2019; Reddi et al., 2018), where is the iteration count. The initial step size is initialized from multiple values. Performance of algorithms across different values of are also provided in supplementary material. for adaptive algorithms (all except RSGD) is fixed to . The additional momentumrelated term (used in Radam and Ramsgrad) is set to . Our simulations are performed in Matlab on a 4.0 GHz Intel Core i7 machine with 32 GB RAM.
We address the principal component analysis (PCA) and the independent component analysis (ICA) problems on the Stiefel manifold
: the set of orthogonal frames in for some (Absil et al., 2008, Sec. 3.3). The elements of are represented as matrices of size . We also consider the lowrank matrix completion (MC) problem on the Grassmann manifold , which is the set of dimensional subspaces in and is a Riemannian quotient manifold of the Stiefel manifold (Absil et al., 2008, Sec. 3.4).6.1 Principal components analysis
Given a set of data points , the PCA problem amounts to learning an orthogonal projector that minimizes the sum of squared residual errors between projected data points and the original data points. It is formulated as . This problem is equivalent to .
We first evaluate our proposed algorithms on a synthetic dataset of size (Case P1). The initial step size is selected from the range . Figure 2 (a) shows optimality gap with besttuned step sizes of each algorithm. The minimum loss for the optimality gap computation is obtained by the Matlab command pca. We observe that our algorithms RASAL/R/LR obtain the best solution.On the other hand, cRMSProp and its modified variant cRMSPropM converge poorly. Radagrad and its momentum based variants (Ramsgrad and Radam) initially proceed well but finally converge to a solution slightly worse than RSGD.
We additionally evaluate RASA on the MNIST (Case P2) and COIL20 (Case P3) datasets in the same setting as described in Case P1. MNIST contains handwritten digits data of (LeCun et al., ) and has images for training and images for testing with images of size . For MNIST, . COIL20 (Nene et al., 1996) contains normalized camera images of the objects that were taken from different angles. We use images that are resized into pixels. For COIL20, . In both MNIST and COIL20, Figures 2 (b)(c), we observe that RASALR outperforms other baselines. This emphasizes the benefit of adapting the row and column subspaces over individual entries on Riemannian manifolds.
In Figure 3, we show the performance of algorithms on COIL20 dataset when initialized with different values of . It can be seen that the proposed algorithms are more robust to changes in . As discussed before, RASALR obtains the best result on this dataset when .
6.2 Joint diagonalization in ICA
The ICA or the blind source separation problem refers to separating a signal into components so that the components are as independent as possible (Hyvärinen & Oja, 2000). A particular preprocessing step is the whitening step that is proposed through joint diagonalization on the Stiefel manifold (Theis et al., 2009). To that end, the optimization problem is , where defines the sum of the squared diagonal elements of A. The symmetric matrices s are of size and can be cumulant matrices or timelagged covariance matrices of different signal samples (Theis et al., 2009).
We use YaleB (Georghiades et al., 2001) dataset (Case I1), which contains human subjects images under different poses and illumination conditions. The image size of the original images is . From this dataset, we create a region covariance matrix (RCM) descriptors (Porikli & Tuzel, 2006; Tuzel et al., 2006; Pang et al., 2008) for images, and the resultant RCMs are of size . The initial step size is selected from and we set . For YaleB, . We also use CIFAR100 (Krizhevsky & Hinton, 2009) dataset (Case I2), which consists of color images in classes with images per class. The RCM descriptors of size from images are created and is set . For CIFAR100, . We show plots for relative optimality gap (ratio of optimality gap to the optimal objective value) for all the algorithms in Figure 4. The optimal solution value for optimality gap computation is obtained from the algorithm proposed in (Theis et al., 2009). We observe that adapting both left and right subspaces (RASALR) yields the best results.
6.3 Lowrank matrix completion (MC) on MovieLens
The lowrank matrix completion problem amounts to completing an incomplete matrix , say of size , from a small number of observed entries by assuming a lowrank model for . If is the set of the indices for which we know the entries in , then the rank MC problem amounts to finding matrices and that minimizes the error . Partitioning and exploiting the least squares structure (Boumal & Absil, 2015), the problem is equivalent to , where and is the set of observed indices for th column.
We show the results on the MovieLens datasets (Harper & Konstan, 2015) which consist of ratings of movies given by different users. The MovieLens10M dataset consists of ratings by users for movies. The MovieLens1M dataset consists of ratings by users for movies. We randomly split the data into / train/test partitions. We run different algorithms for rank and regularization parameter as suggested by Boumal & Absil (2015). Figures 5 (a)(d) show that our algorithm RASALR achieves faster convergence during the training phase and the best generalization performance on the test set. cRMSProp and cRMSPropM did not run successfully on MovieLens datasets.
7 Discussion and conclusion
We observe that the proposed modeling of adaptive weights on the row and column subspaces obtains better empirical performance than existing baselines (Roy et al., 2018; Cho & Lee, 2017; Bécigneul & Ganea, 2019). In particular, we do not observe additional benefit of the adaptive momentum methods (Radam and Ramsgrad) in the Riemannian setting when the adaptive updates exploit the geometry better.
Overall, we presented a principled approach for modeling adaptive weights for Riemannian stochastic gradient. We developed computationally efficient algorithms and discussed the convergence analysis. Our experiments validate the benefits of the proposed algorithms on different applications.
References
 Absil et al. (2008) Absil, P.A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
 Bécigneul & Ganea (2019) Bécigneul, B. and Ganea, O.E. Riemannian adaptive optimization methods. In ICLR, 2019.
 Bonnabel (2013) Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. on Automatic Control, 58(9):2217–2229, 2013.
 Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization mehtods for largescale machine learning. SIAM Rev., 60(2):223–311, 2018.
 Boumal & Absil (2015) Boumal, N. and Absil, P.A. Lowrank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475:200–239, 2015.
 Boumal et al. (2014) Boumal, N., Mishra, B., Absil, P.A., and Sepulchre, R. Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res., 15(1):1455–1459, 2014.
 Chen et al. (2019) Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of Adamtype algorithms for nonconvex optimization. In ICLR, 2019.

Cho & Lee (2017)
Cho, M. and Lee, J.
Riemannian approach to batch normalization.
In NIPS, 2017.  Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 Ganea et al. (2018) Ganea, O.E., Bécigneul, B., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, 2018.

Georghiades et al. (2001)
Georghiades, A., Belhumeur, P., and Kriegman, D.
From few to many: Illumination cone models for face recognition under variable lighting and pose.
IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.  Goldberg et al. (2001) Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: a constant time collaborative filtering algorithm. Inform. Retrieval, 4(2):133–151, 2001.
 Harandi et al. (2017a) Harandi, M., Salzmann, M., and Hartley, R. Joint dimensionality reduction and metric learning: A geometric take. In ICML, 2017a.
 Harandi et al. (2017b) Harandi, M., Salzmann, M., and Richard, H. Dimensionality reduction on spd manifolds: The emergence of geometryaware methods. IEEE Trans. Pattern Anal. Mach. Intell., 2017b.
 Harper & Konstan (2015) Harper, F. M. and Konstan, J. A. The MovieLens datasets: history and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19:1–19:19, 2015.
 Huang et al. (2015) Huang, W., Gallivan, K. A., and Absil, P.A. A Broyden class of quasiNewton methods for Riemannian optimization. SIAM J. Optim., 25(3):1660–1685, 2015.
 Hyvärinen & Oja (2000) Hyvärinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(45):411–430, 2000.
 Kasai et al. (2018) Kasai, H., Sato, H., and Mishra, B. Riemannian stochastic recursive gradient algorithm. In ICML, 2018.
 Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
 Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

(21)
LeCun, Y., Cortes, C., and Burges, C. J.
The MNIST database.
http://yann.lecun.com/exdb/mnist/.  Lee (2003) Lee, J. M. Introduction to smooth manifolds, volume 218 of Graduate Texts in Mathematics. SpringerVerlag, New York, second edition, 2003.
 McMahan & Streeter (2010) McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In COLT. 2010.
 Mishra & Sepulchre (2014) Mishra, B. and Sepulchre, R. R3MC: A Riemannian threefactor algorithm for lowrank matrix completion. In IEEE CDC, pp. 1137–1142, 2014.
 Nene et al. (1996) Nene, S. A., Nayar, S. K., and Murase, H. Columbia university image library (COIL20). Technical Report CUCS00596, 1996.
 Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In NIPS, 2017.
 Nickel & Kiela (2018) Nickel, M. and Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In ICML, 2018.
 Pang et al. (2008) Pang, Y., Yuan, Y., and Li, X. Gaborbased region covariance matrices for face recognition. IEEE Trans. Circuits Syst. Video Technol., 18(7):989–993, 2008.
 Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word representation. In EMNLP, 2014.
 Porikli & Tuzel (2006) Porikli, F. and Tuzel, O. Fast construction of covariance matrices for arbitrary size image windows. In ICIP, 2006.
 Reddi et al. (2018) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of Adam and beyond. In ICLR, 2018.

Roy et al. (2018)
Roy, S. K., Mhammedi, Z., and Harandi, M.
Geometry aware constrained optimization techniques for deep learning.
In CVPR, 2018.  Sato et al. (2017) Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic variance reduced gradient. arXiv preprint: arXiv:1702.05594, 2017.
 Shah et al. (2018) Shah, V., Kyrillidis, A., and Sanghavi, S. Minimum norm solutions do not always generalize well for overparameterized problems. arXiv preprint arXiv:1811.07055, 2018.
 Theis et al. (2009) Theis, F. J., Cason, T. P., and Absil, P.A. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In ICA, 2009.
 Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 Tripuraneni et al. (2018) Tripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. Averaging stochastic gradient descent on Riemannian manifolds. In COLT, 2018.
 Tuzel et al. (2006) Tuzel, O., Porikli, F., and Meer, P. Region covariance: a fast descriptor for detection and classification. In ECCV, 2006.
 Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. In NIPS. 2017.
 Zaheer et al. (2018) Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In NeurIPS. 2018.
 Zeiler (2012) Zeiler, M. D. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
 Zhang & Sra (2016) Zhang, H. and Sra, S. Firstorder methods for geodesically convex optimization. In COLT, 2016.
 Zhang et al. (2016) Zhang, H., Reddi, S. J., and Sra, S. Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In NIPS, 2016.
 Zhou et al. (2018) Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
Appendix A Proofs
This section provides the complete proofs of the lemmas and the convergence analysis.
Hereinafter, we use
to express expectation with respect to the joint distribution of all random variables. For example,
is determined by the realizations of the independent random variables , the total expectation of for any can be taken as . We also use to denote an expected value taken with respect to the distribution of the random variable .a.1 Proof of Lemma 4.3
Proof.
For any , we have
Next, we derive the upper bound of the th element () of , i.e., . Denoting as for simplicity, we have from the algorithm. Now, we first assume that . Then, addressing due to , we have
Consequently, for any , we have . Similarly, having for , and assuming , we have . Thus, we have .
Now, when , due to . Therefore, supposing that , and also that and , we have
Thus, for any , we have .
This completes the proof. ∎
a.2 Proof of Theorem 4.4
This subsection provides the complete proof of Theorem 4.4.
For this purpose, we first provide an essential lemma for projection matrix as presented below.
Lemma A.1 (Projection matrix).
Suppose that is the constraint of the tangent space at to be projected onto, then the projection matrix is derived as
It should be noted that is symmetric.
Now, we provide the proof of Theorem 4.4.
Proof of Theorem 4.4.
For simplicity, we assume that the parameter represents its vectorized form throughout the analysis below. We also assume that the Riemannian stochastic gradients and the Riemannian gradient at represent their vectorized forms of their original matrix forms, which are and , respectively. Accordingly, we assume a projection matrix as in Lemma A.1 as the projection operator .
Since is retraction smooth in Lemma 4.2, we have
(A.1) 
We first bound the second term in the right hand side of above. Here, we notice that
where the first equality holds because is symmetric as stated at Lemma A.1 and the second equality holds because is on the tangent space and is just . This observation can be also obtained by noticing , where in Lemma A.1, as presented below.
where the first equality holds because belongs to the normal space to the tangent space , and its inner product with the tangent vector is zero. We can also bound the second term of (A.1) as
where the inequality holds because of the contraction of the projection operator.
Consequently, (A.1) results in
(A.2) 
Now, we consider the case (A.2) with . We have
Taking expectation and rearranging terms of above yields
(A.3)  
Next, we consider the case (A.2) with . We have