Adaptive stochastic gradient algorithms on Riemannian manifolds

02/04/2019 ∙ by Hiroyuki Kasai, et al. ∙ Microsoft University of Electro-Communications 0

Adaptive stochastic gradient algorithms in the Euclidean space have attracted much attention lately. Such explorations on Riemannian manifolds, on the other hand, are relatively new, limited, and challenging. This is because of the intrinsic non-linear structure of the underlying manifold and the absence of a canonical coordinate system. In machine learning applications, however, most manifolds of interest are represented as matrices with notions of row and column subspaces. In addition, the implicit manifold-related constraints may also lie on such subspaces. For example, the Grassmann manifold is the set of column subspaces. To this end, such a rich structure should not be lost by transforming matrices into just a stack of vectors while developing optimization algorithms on manifolds. We propose novel stochastic gradient algorithms for problems on Riemannian manifolds by adapting the row and column subspaces of gradients. Our algorithms are provably convergent and they achieve the convergence rate of order O( (T)/√(T)), where T is the number of iterations. Our experiments illustrate that the proposed algorithms outperform existing Riemannian adaptive stochastic algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale machine learning applications are predominantly trained using stochastic gradient descent (SGD)

(Bottou et al., 2018) based algorithms today. A (sub)class of such algorithms that has become increasingly common lately adapts the learning rate of each coordinate of the gradient vector based on past iterates (McMahan & Streeter, 2010; Duchi et al., 2011). A key motivation for this is to have different learning rates for different coordinates (Zeiler, 2012; Pennington et al., 2014), a feature which vanilla SGD lacks. ADAM (Kingma & Ba, 2015), arguably the most popular adaptive gradient method, additionally employs a momentum term to modify the search direction as well. Adaptive gradient methods have enjoyed varying degrees of success in various applications (Pennington et al., 2014; Wilson et al., 2017; Zaheer et al., 2018; Shah et al., 2018).

In this paper, we focus on adaptive stochastic gradient algorithms on Riemannian manifolds. Riemannian geometry is a generalization of the Euclidean geometry (Lee, 2003)

. It includes several non-Euclidean spaces such as set of symmetric positive-definite matrices, and set of orthogonal matrices, to name a few. Numerous machine learning problems can be cast as an optimization problem on Riemannian manifolds. Examples include principal component analysis (PCA), matrix completion

(Mishra & Sepulchre, 2014; Boumal & Absil, 2015), learning taxonomy or latent hierarchies (Nickel & Kiela, 2017, 2018; Ganea et al., 2018), deep metric learning (Harandi et al., 2017a)

, applications in computer vision

(Harandi et al., 2017b), among others. Several Euclidean algorithms, e.g., steepest descent, conjugate gradients, and trust-regions, have been generalized to Riemannian manifolds (Absil et al., 2008).

Bonnabel (2013) first proposed the Riemannian SGD (RSGD) algorithm, which is a generalization of the (Euclidean) SGD algorithm to Riemannian manifolds. However, such a generalization of Euclidean adaptive gradient algorithm is relatively unexplored. A key difficulty in this regard is the absence of a canonical coordinate system on manifolds and the inherent non-linear geometry of the manifold. Recent works in this direction compute Euclidean-style adaptive weights in the Riemannian setting, ignoring the geometry of the underlying manifold. Roy et al. (2018)

propose a RMSProp-style algorithm for manifolds, viewing the gradient as a vector and computing a corresponding adaptive weight vector. On the other hand,

Cho & Lee (2017); Bécigneul & Ganea (2019) adapt the step size by computing a scalar weight instead of directly adapting the gradient in a momentum-based Riemannian AMSGrad/ADAM-style algorithm.

We develop a novel approach of adapting the Riemannian gradient, which allows to exploit the structure of the underlying manifolds. In particular, we propose to adapt the row and column subspaces of the Riemannian stochastic gradient . Euclidean adaptive algorithms compute (positive) adaptive weight vectors for the gradient vectors. Our model computes left and right adaptive weight matrices for denoted by and , respectively. adapts the row subspace of and adapts the column subspace of . Both and are positive definite matrices and are computed using the row covariance and column covariance matrices of , respectively. For computational efficiency, we model and as diagonal matrices, taking cue from AdaGrad (Duchi et al., 2011). Overall, we propose computationally efficient Riemannian adaptive stochastic gradient algorithms, henceforth collectively termed as RASA.

Under a set of mild conditions, our analysis guarantees the convergence of our algorithms with a rate of convergence of order , where is the number of iterations. To the best of our knowledge, ours is the first Riemannian adaptive gradient algorithm to provide convergence analysis for non-convex stochastic optimization setting. Among the existing works, Bécigneul & Ganea (2019) provide convergence analysis only for geodesically convex functions.

Empirically, we compare our algorithms with the existing Riemannian adaptive gradient algorithms. In applications such as principal components analysis and matrix completion, we observe that our algorithms perform better than the baselines in most experiments both on synthetic and real-world datasets.

The main contributions of this work are:
  we propose a principled approach for modeling adaptive weights for Riemannian stochastic gradient. We model adaptive weight matrices for row and column subspaces exploiting the geometry of the manifold.
  we develop efficient Riemannian adaptive stochastic gradient algorithms, based on the proposed modeling approach.
  we provide convergence analysis of our algorithm, under a set of mild conditions. Our algorithms achieve a rate of convergence order , where is the number of iterations, for non-convex stochastic optimization.

The paper is organized as follows. Section 2 discusses preliminaries of the Riemannian (stochastic) optimization framework and adaptive stochastic gradient algorithms in the Euclidean space. In Section 3, we present our Riemannian adaptive stochastic algorithms. Section 4 provides the convergence analysis of the proposed algorithms. Related works are discussed in Section 5, while empirical results are presented in Section 6. Section 7 concludes the paper.

2 Preliminaries

In this section, we briefly summarize various concepts of Riemannian geometry and optimization. We refer the interested readers to (Absil et al., 2008) for more details.

Notions on Riemannian manifolds

Informally, a manifold is a generalization of the notion of surface to higher dimensions. A manifold of dimension can be locally approximated by . For example, the Stiefel manifold is the set of orthonormal matrices and has dimension (Absil et al., 2008). The first order approximation of around a point is a dimensional vector space and is known as the tangent space . The tangent space contains all tangent vectors to at . Corresponding to each tangent space , we define an inner product , which varies smoothly with . This inner product is termed as the Riemannian metric in differential geometry. A Riemannian manifold comprises a manifold along with the collection of Riemannian metric .

It should be noted that induces a norm on the tangent space at : , where . This in turn induces a (local) distance function on , which is employed to compute length of a path between two points on . A geodesic between two points is a path of shortest length on the manifold, generalizing the notion of straight line in Euclidean space to manifolds.

Minimizing an objective function over the manifold requires the notion of gradients at every point . Correspondingly, the Riemannian gradient at is defined as an element of the tangent space .

Riemannian stochastic gradient update

The Riemannian stochastic gradient descent (RSGD) update (Bonnabel, 2013) is given by

(1)

where is a (decaying) step-size. The term represents a Riemannian stochastic gradient and is typically computed as the Riemannian gradient of the objective function on a batch of data , i.e., . The retraction, , maps tangent space onto with a local rigidity condition that preserves the gradients at (Absil et al., 2008, Sec. 4.1). The exponential mapping is an instance of the retraction.

It should be noted that when the manifold is the Euclidean space and the Riemannian metric is the standard Euclidean inner product, the RSGD update (1) simplifies to the (vanilla) stochastic gradient descent (SGD) update (Bottou et al., 2018):

(2)

where is a stochastic gradient commonly computed as the gradient of the objective function on a batch of data , i.e., . The retraction function in the Euclidean space is given by .

Euclidean adaptive stochastic gradient updates

The adaptive stochastic gradient methods employ the past gradients to compute a local distance measure and subsequently rescale the learning rate of each coordinate of the (Euclidean) model parameter. The AdaGrad algorithm (Duchi et al., 2011) introduced the following update, when :

(3)

where is a diagonal matrix corresponding to the vector , i.e., , and the symbol ‘’ denotes the entry-wise product (Hadamard product). RMSProp (Tieleman & Hinton, 2012) proposed an exponentially moving average of the adaptive term as follows: , where is another hyper-parameter.

Various stochastic gradient algorithms also employ an adaptive momentum term, ADAM (Kingma & Ba, 2015) being the most popular of them. In this paper, we restrict our discussion to algorithms without a momentum term because the calculation of the momentum term needs additional vector transport operation every iteration (Absil et al., 2008, Sec. 8.1). However, we include existing adaptive momentum based Riemannian stochastic gradient algorithms (Bécigneul & Ganea, 2019) as baselines in our experiments.

3 Riemannian adaptive stochastic gradient

In this section, we propose novel adaptive stochastic gradient algorithms in Riemannian setting. We first introduce a few notations, which will be useful in the rest of the manuscript. For given vectors and , or is element-wise square root, is element-wise square, denotes element-wise division, and denotes element-wise maximum. For a given matrix , denotes its vectorial representation. The term

represents the identity matrix of

dimension.

We consider the following problem

(4)

where elements of the manifold are represented as matrices of size . Such a geometry is true for most Riemannian manifolds employed in machine learning applications. Prominent examples of such manifolds include the Stiefel, Grassmann, spherical, symmetric positive definite, spectrahedron, and hyperbolic manifolds, to name a few.

In this paper, we are interested in a stochastic optimization setting where we assume that at each time step , the algorithm computes a feasible point given the current iterate . For simplicity, the Riemannian stochastic gradient at a given iterate is denoted by . We exploit the matrix structure of by proposing separate adaptive weight matrices corresponding to row and column subspaces with and , respectively. These weights are computed in an exponentially weighted manner as

(5)

where is a hyper-parameter. It should be noted that and correspond to row and column covariance matrices, respectively. We propose to adapt111For numerical stability while computing square root of and in (6), a small is added to their diagonal entries. the Riemannian gradient as

(6)

We observe that and weight matrices suitably modify the Riemannian gradient while respecting the matrix structure of , rather than plainly viewing as a vector in . For instance in the Grassmann manifold, our right adaptive weight matrix essentially provides different learning rates to different column subspaces instead of giving different learning rates to the individual entries (coordinates) of . Our framework allows adapting only the row or the column space of as well by simply substituting or , respectively. In vectorial representation, the adaptive update (6) may be viewed as

(7)

where .

It should be noted that does not lie on the tangent plane . Hence, the proposed adaptive Riemannian gradient is , where is a linear operator to project onto the tangent space . Overall, the proposed update is

(8)

where is defined as in (6).

Diagonal adaptation and the proposed algorithm

The proposed full matrix update (8

) is computationally prohibitive for high dimensional data. Hence, taking cue from the Euclidean adaptive gradient algorithms

(Duchi et al., 2011), we propose diagonal adaptation of adaptive weight matrices. In particular, we model and as diagonal matrices corresponding to vectors and , which are computed as

(9)

where the function returns the diagonal vector of a square matrix. The diagonal adaptation significantly reduces the effective dimension of adaptive weights to and the computational complexity of adapting with (diagonal) weight matrices to .

The final algorithm is summarized in Algorithm 1 and is henceforth referred to as RASA (Riemannian Adaptive Stochastic gradient Algorithm). RASA generalizes Euclidean AdaGrad-type and RMSProp-type algorithms to Riemannian matrix manifolds.

We employ and in our final update (step , Algorithm 1). The non-decreasing sequence of the adaptive weights (along with non-increasing sequence of step size ) ensures the convergence of RASA. Such a sequence for adaptive weights was introduced by Reddi et al. (2018) in ADAM to provide convergence guarantees. We present our convergence analysis of Algorithm 1 in Section 4.

We end this section noting that the effective dimension of our adaptive weights (vectors and ) is , which is less than the ambient dimension of the Riemannian gradient . In Euclidean algorithms such as ADAM, AMSGrad (Reddi et al., 2018), AdaGrad, or RMSProp, the effective dimension of the adaptive weights is same as the Euclidean gradient. To observe the effect of the proposed reduction in dimension of adaptive weights in Riemannian setting, we develop a baseline Riemannian adaptive gradient algorithm RASA-vec, which follows RMSProp’s procedure of computing adaptive weights from the Riemannian gradient. Hence, for RASA-vec, the Riemannian gradient is adapted as

(10)

where the adaptive weights matrix is , , and . Figure 1 compares the performance of the proposed RASA against RASA-vec and RSGD on principal component analysis and matrix completion problems. We develop three variants of RASA: RASA-R (adapts only the column subspace), RASA-L (adapts only the row subspace), and RASA-LR (adapts both row and column subspaces). The results illustrate the benefits of the proposed modeling approach. Our experiments confirm this observation across datasets.

0:  Step size , hyper-parameter .
1:  Initialize .
2:  for  do
3:     Compute Riemannian stochastic gradient .
4:     Update .
5:     Calculate .
6:     Update .
7:     Calculate .
8:     .
9:  end for
Algorithm 1 Riemannian adaptive stochastic algorithm


(a) Principal component analysis on the Stiefel manifold.


(b) Low-rank matrix completion on the Grassmann manifold.
Figure 1: The proposed algorithms RASA-L/R/LR perform better than RASA-vec, which ignores the underlying geometry of the manifold while computing adaptive weights. We also observe that adapting both row and column subspaces (RASA-LR) is usually better than adapting only one of them (RASA-L and RASA-R). Additional plots in this regard are presented in Section B.

4 Convergence rate analysis

For the purpose of analyzing the convergence rate for the proposed Algorithm 1, we view our proposed update, step 8 in Algorithm 1, as

(11)

where represents the iteration number, , and is defined as

(12)

where the symbol ‘’ represents the Kronecker product and denotes the vectorized representation of . Furthermore, denotes the vectorized representation of . Hence, , , , and are dimensional vectors in our analysis. We first summarize essential assumptions and a lemma before discussing our main results.

Note that our analysis also extends to that of RASA-vec directly.

4.1 Definitions and assumptions

For simplicity, the analysis hereinafter assumes the standard Riemannian metric as (Euclidean inner product) and the standard Riemannian norm as (Euclidean norm), where .

Definition 4.1.

Upper-Hessian bounded (Absil et al., 2008). The function is said to be upper-Hessian bounded in with respect to retraction if there exists a constant such that , for all and with , and all such that for all .

The above class of functions in the Riemannian setting corresponds to the set of continuous functions with Lipschitz continuous gradients in the Euclidean space.

We now state our assumptions on problem (4).

Assumption 1.

For problem (4), we assume the following:

(A1) The function is continuously differentiable and is lower bounded, i.e., where is an optimal solution of (4).

(A2) The function has -bounded Riemannian stochastic gradient, i.e., or equivalently .

(A3) The function is upper-Hessian bounded.

Existing works (Zhou et al., 2018; Chen et al., 2019), which focus on the convergence of (Euclidean) ADAM (Kingma & Ba, 2015), use a Euclidean variant of A2. Furthermore, A2 holds when the manifold is compact like the Grassmann manifold (Absil et al., 2008), or through slight modification of the objective function and the algorithm (Kasai et al., 2018).

Finally, we define Retraction -smooth functions via the following lemma.

Lemma 4.2.

Retraction -smooth (Huang et al., 2015; Kasai et al., 2018). Suppose Assumption 1 holds. Then, for all and constant in Definition 4.1, we have

(13)

where and . In particular, such a function is called retraction -smooth with respect to .

The proof of Lemma 4.2 is provided in (Kasai et al., 2018, Lemma 3.5).

4.2 Convergence analysis of Algorithm 1

Our proof structure extends existing convergence analysis of ADAM-type algorithms in the Euclidean space (Zhou et al., 2018; Chen et al., 2019) into the Riemannian setting. However, such an extension is non-trivial and key challenges include

  • the update formula based on and in Algorithm 1 requires the upper bound of defined in (12).

  • projection onto the tangent space: the Riemannian gradient adaptively weighted by and needs to be projected back onto a tangent space. Our analysis needs to additionally take care of this.

To this end, we first provide a lemma for the upper bounds of the full Riemannian gradient as well as the elements of the vector , where and are defined in steps of Algorithm 1.

Lemma 4.3.

Let , where and are defined in steps in Algorithm 1 . Then, under Assumption 1, we have the following results
    , and
    the -th element of satisfies .

We now present our main result in Theorem 4.4.

Theorem 4.4.

Let and are the sequences obtained from Algorithm 1, where . Then, under Assumption 1, we have the following results for Algorithm 1

(14)

where is a constant term independent of .

Proof sketch: We first derive a Retraction -smooth inequality from Lemma 4.2 with respect to the adaptive gradient . Exploiting the symmetric property of , we obtain the upper bound of . Here, to remove the dependency of on , is evaluated instead of . Then, taking the expectation and telescoping the inequality, we obtain the desired result. The complete proof is in Section A.

Remark 1: The first term of (4.4) represents the weighted term of the sum of squared step length, and appears in the standard RSGD (Bonnabel, 2013). From this point of view, the advantage of the adaptive gradient algorithms can be seen that they reduce the effect of the term compared with the standard RSGD algorithm.

Remark 2: Theorem 4.4 reproduces the results of (Chen et al., 2019, Theorem 3.1) when is defined as the Euclidean space.

In Corollary 4.5, we derive a convergence rate of Algorithm 1.

Corollary 4.5 (Convergence rate of Algorithm 1).

Let and are the sequences obtained from Algorithm 1, where . Let and is lower-bounded by a constant , where is the dimension of the manifold . Then, under Assumption 1, the output of of Algorithm 1 satisfies

(15)

where and

(16)

The proof of Corollary 4.5 is in Section A.

5 Related works

The asymptotic convergence of stochastic gradient descent on Riemannian manifolds (RSGD) was first proved by Bonnabel (2013). Among the first works that aim to adapt stochastic gradients in optimization on Riemannian matrix manifolds is the cRMSProp algorithm (Roy et al., 2018). They effectively perceive adaptive weights as vectors. More concretely, if elements in have matrix representation of size , then cRMSProp computes an adaptive weight matrix and adapts the gradient matrix as . It can be observed that this interpretation of adaptive weights ignores the matrix structure of , , and in general , and process adaptive weights similar to the Euclidean algorithms discussed in Section 2. Roy et al. (2018) also constrain the adaptive weight matrix to lie on the tangent plane, thereby requiring computationally expensive vector transport operations in each iteration. It should also be noted that Roy et al. (2018) do not provide convergence analysis for cRMSProp.

Another direction of work by Cho & Lee (2017); Bécigneul & Ganea (2019) derive the ADAM algorithm (Kingma & Ba, 2015) on matrix manifolds. In particular, they employ a momentum term along with the Riemannian gradient to compute the search direction. However, the adaptive weight vector in the ADAM algorithm is now substituted with a scalar weight, which effectively adapts the step size rather than the search direction. While Cho & Lee (2017) do not discuss convergence analysis, Bécigneul & Ganea (2019) provide convergence guarantees limited to geodesically convex functions (Zhang & Sra, 2016). It should be noted that both require vector transport operations in each iteration due to the momentum term.

Our approach, on the other hand, preserves the underlying matrix structure of the manifold and compute adaptive weight matrices corresponding to row and column subspaces of the Riemannian gradient. We do not enforce any constraint on the adaptive weight matrices and avoid parallel transport altogether. Like most optimization algorithms on manifolds, our algorithm can be easily be generalized to product of manifolds.



(a) Case P1: Synthetic dataset.


(b) Case P2: MNIST dataset.


(c) Case P3: COIL20 dataset.
Figure 2: Performance on the PCA datasets (numbers in the parentheses within legends show best-tuned ).

Existing works have also explored other directions of improvement to RSGD in specific settings, similar to the Euclidean counterpart. These include variance reduction

(Zhang et al., 2016; Sato et al., 2017), averaged RSGD (Tripuraneni et al., 2018), and recursive gradients (Kasai et al., 2018), among others.



(a) .


(b) .


(c) .


(d) .
Figure 3: Performance on the COIL20 dataset across different values of the initial step size (Case P3).

6 Experiments

In this section, we compare our proposed adaptive algorithm, RASA, and its variants with the following baseline Riemannian stochastic algorithms.

  • RSGD (Bonnabel, 2013) is the vanilla Riemannian stochastic gradient algorithm.

  • cRMSProp proposed by Roy et al. (2018).

  • cRMSProp-M: a variant of cRMSProp, which uses Riemannian gradients for adaptive weights computations instead of the Euclidean gradients.

  • Radagrad proposed by Bécigneul & Ganea (2019) considers scalar adaptive weight. Hence, it adapts the step size instead of the gradient.

  • Radam and Ramsgrad by Bécigneul & Ganea (2019): include momentum terms similar to ADAM and AMSGrad. Additionally, like Radagrad, they adapt only the step size.

We compare the baselines with our proposed variants RASA-L, RASA-R, and RASA-LR that either adapt the row (left) subspace, column (right) subspace, or both (left and right subspaces).

All the considered algorithms are implemented in the Matlab toolbox Manopt (Boumal et al., 2014).

The algorithms are evaluated in terms of the number of iterations and are initialized from the same point. The algorithms are stopped when the maximum iterations count reaches a predefined value. We fix the batchsize to (except in the MovieLens datasets, where it is set to ). The step size sequence is generated as (Chen et al., 2019; Reddi et al., 2018), where is the iteration count. The initial step size is initialized from multiple values. Performance of algorithms across different values of are also provided in supplementary material. for adaptive algorithms (all except RSGD) is fixed to . The additional momentum-related term (used in Radam and Ramsgrad) is set to . Our simulations are performed in Matlab on a 4.0 GHz Intel Core i7 machine with 32 GB RAM.

We address the principal component analysis (PCA) and the independent component analysis (ICA) problems on the Stiefel manifold

: the set of orthogonal -frames in for some (Absil et al., 2008, Sec. 3.3). The elements of are represented as matrices of size . We also consider the low-rank matrix completion (MC) problem on the Grassmann manifold , which is the set of -dimensional subspaces in and is a Riemannian quotient manifold of the Stiefel manifold (Absil et al., 2008, Sec. 3.4).

The details of the manifolds and the derivations of the Riemannian gradient are in Sections D and E

6.1 Principal components analysis

Given a set of data points , the PCA problem amounts to learning an orthogonal projector that minimizes the sum of squared residual errors between projected data points and the original data points. It is formulated as . This problem is equivalent to .



(a) Case I1: YaleB dataset.


(b) Case I2: CIFAR-100 dataset.
Figure 4: Performance on the ICA problem (numbers in the parentheses within legends show best-tuned ).

We first evaluate our proposed algorithms on a synthetic dataset of size (Case P1). The initial step size is selected from the range . Figure 2 (a) shows optimality gap with best-tuned step sizes of each algorithm. The minimum loss for the optimality gap computation is obtained by the Matlab command pca. We observe that our algorithms RASA-L/R/LR obtain the best solution.On the other hand, cRMSProp and its modified variant cRMSProp-M converge poorly. Radagrad and its momentum based variants (Ramsgrad and Radam) initially proceed well but finally converge to a solution slightly worse than RSGD.

We additionally evaluate RASA on the MNIST (Case P2) and COIL20 (Case P3) datasets in the same setting as described in Case P1. MNIST contains handwritten digits data of (LeCun et al., ) and has images for training and images for testing with images of size . For MNIST, . COIL20 (Nene et al., 1996) contains normalized camera images of the objects that were taken from different angles. We use images that are resized into pixels. For COIL20, . In both MNIST and COIL20, Figures 2 (b)-(c), we observe that RASA-LR outperforms other baselines. This emphasizes the benefit of adapting the row and column subspaces over individual entries on Riemannian manifolds.

In Figure 3, we show the performance of algorithms on COIL20 dataset when initialized with different values of . It can be seen that the proposed algorithms are more robust to changes in . As discussed before, RASA-LR obtains the best result on this dataset when .



(a) MovieLens-1M (train).


(b) MovieLens-1M (test).


(c) MovieLens-10M (train).


(d) MovieLens-10M (test).
Figure 5: Performance on the MovieLens datasets (numbers in the parentheses within legends show best-tuned ). Our proposed algorithm RASA-LR performs competitively and shows faster convergence.

6.2 Joint diagonalization in ICA

The ICA or the blind source separation problem refers to separating a signal into components so that the components are as independent as possible (Hyvärinen & Oja, 2000). A particular preprocessing step is the whitening step that is proposed through joint diagonalization on the Stiefel manifold (Theis et al., 2009). To that end, the optimization problem is , where defines the sum of the squared diagonal elements of A. The symmetric matrices s are of size and can be cumulant matrices or time-lagged covariance matrices of different signal samples (Theis et al., 2009).

We use YaleB (Georghiades et al., 2001) dataset (Case I1), which contains human subjects images under different poses and illumination conditions. The image size of the original images is . From this dataset, we create a region covariance matrix (RCM) descriptors (Porikli & Tuzel, 2006; Tuzel et al., 2006; Pang et al., 2008) for images, and the resultant RCMs are of size . The initial step size is selected from and we set . For YaleB, . We also use CIFAR-100 (Krizhevsky & Hinton, 2009) dataset (Case I2), which consists of color images in classes with images per class. The RCM descriptors of size from images are created and is set . For CIFAR-100, . We show plots for relative optimality gap (ratio of optimality gap to the optimal objective value) for all the algorithms in Figure 4. The optimal solution value for optimality gap computation is obtained from the algorithm proposed in (Theis et al., 2009). We observe that adapting both left and right subspaces (RASA-LR) yields the best results.

6.3 Low-rank matrix completion (MC) on MovieLens

The low-rank matrix completion problem amounts to completing an incomplete matrix , say of size , from a small number of observed entries by assuming a low-rank model for . If is the set of the indices for which we know the entries in , then the rank- MC problem amounts to finding matrices and that minimizes the error . Partitioning and exploiting the least squares structure (Boumal & Absil, 2015), the problem is equivalent to , where and is the set of observed indices for -th column.

We show the results on the MovieLens datasets (Harper & Konstan, 2015) which consist of ratings of movies given by different users. The MovieLens-10M dataset consists of ratings by users for movies. The MovieLens-1M dataset consists of ratings by users for movies. We randomly split the data into / train/test partitions. We run different algorithms for rank and regularization parameter as suggested by Boumal & Absil (2015). Figures 5 (a)-(d) show that our algorithm RASA-LR achieves faster convergence during the training phase and the best generalization performance on the test set. cRMSProp and cRMSProp-M did not run successfully on MovieLens datasets.

7 Discussion and conclusion

We observe that the proposed modeling of adaptive weights on the row and column subspaces obtains better empirical performance than existing baselines (Roy et al., 2018; Cho & Lee, 2017; Bécigneul & Ganea, 2019). In particular, we do not observe additional benefit of the adaptive momentum methods (Radam and Ramsgrad) in the Riemannian setting when the adaptive updates exploit the geometry better.

Overall, we presented a principled approach for modeling adaptive weights for Riemannian stochastic gradient. We developed computationally efficient algorithms and discussed the convergence analysis. Our experiments validate the benefits of the proposed algorithms on different applications.

References

  • Absil et al. (2008) Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
  • Bécigneul & Ganea (2019) Bécigneul, B. and Ganea, O.-E. Riemannian adaptive optimization methods. In ICLR, 2019.
  • Bonnabel (2013) Bonnabel, S. Stochastic gradient descent on Riemannian manifolds. IEEE Trans. on Automatic Control, 58(9):2217–2229, 2013.
  • Bottou et al. (2018) Bottou, L., Curtis, F. E., and Nocedal, J. Optimization mehtods for large-scale machine learning. SIAM Rev., 60(2):223–311, 2018.
  • Boumal & Absil (2015) Boumal, N. and Absil, P.-A. Low-rank matrix completion via preconditioned optimization on the Grassmann manifold. Linear Algebra and its Applications, 475:200–239, 2015.
  • Boumal et al. (2014) Boumal, N., Mishra, B., Absil, P.-A., and Sepulchre, R. Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res., 15(1):1455–1459, 2014.
  • Chen et al. (2019) Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of Adam-type algorithms for non-convex optimization. In ICLR, 2019.
  • Cho & Lee (2017) Cho, M. and Lee, J.

    Riemannian approach to batch normalization.

    In NIPS, 2017.
  • Duchi et al. (2011) Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
  • Ganea et al. (2018) Ganea, O.-E., Bécigneul, B., and Hofmann, T. Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, 2018.
  • Georghiades et al. (2001) Georghiades, A., Belhumeur, P., and Kriegman, D.

    From few to many: Illumination cone models for face recognition under variable lighting and pose.

    IEEE Trans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.
  • Goldberg et al. (2001) Goldberg, K., Roeder, T., Gupta, D., and Perkins, C. Eigentaste: a constant time collaborative filtering algorithm. Inform. Retrieval, 4(2):133–151, 2001.
  • Harandi et al. (2017a) Harandi, M., Salzmann, M., and Hartley, R. Joint dimensionality reduction and metric learning: A geometric take. In ICML, 2017a.
  • Harandi et al. (2017b) Harandi, M., Salzmann, M., and Richard, H. Dimensionality reduction on spd manifolds: The emergence of geometry-aware methods. IEEE Trans. Pattern Anal. Mach. Intell., 2017b.
  • Harper & Konstan (2015) Harper, F. M. and Konstan, J. A. The MovieLens datasets: history and context. ACM Transactions on Interactive Intelligent Systems, 5(4):19:1–19:19, 2015.
  • Huang et al. (2015) Huang, W., Gallivan, K. A., and Absil, P.-A. A Broyden class of quasi-Newton methods for Riemannian optimization. SIAM J. Optim., 25(3):1660–1685, 2015.
  • Hyvärinen & Oja (2000) Hyvärinen, A. and Oja, E. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Kasai et al. (2018) Kasai, H., Sato, H., and Mishra, B. Riemannian stochastic recursive gradient algorithm. In ICML, 2018.
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Krizhevsky & Hinton (2009) Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • (21) LeCun, Y., Cortes, C., and Burges, C. J.

    The MNIST database.

    http://yann.lecun.com/exdb/mnist/.
  • Lee (2003) Lee, J. M. Introduction to smooth manifolds, volume 218 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edition, 2003.
  • McMahan & Streeter (2010) McMahan, H. B. and Streeter, M. Adaptive bound optimization for online convex optimization. In COLT. 2010.
  • Mishra & Sepulchre (2014) Mishra, B. and Sepulchre, R. R3MC: A Riemannian three-factor algorithm for low-rank matrix completion. In IEEE CDC, pp. 1137–1142, 2014.
  • Nene et al. (1996) Nene, S. A., Nayar, S. K., and Murase, H. Columbia university image library (COIL-20). Technical Report CUCS-005-96, 1996.
  • Nickel & Kiela (2017) Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. In NIPS, 2017.
  • Nickel & Kiela (2018) Nickel, M. and Kiela, D. Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In ICML, 2018.
  • Pang et al. (2008) Pang, Y., Yuan, Y., and Li, X. Gabor-based region covariance matrices for face recognition. IEEE Trans. Circuits Syst. Video Technol., 18(7):989–993, 2008.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word representation. In EMNLP, 2014.
  • Porikli & Tuzel (2006) Porikli, F. and Tuzel, O. Fast construction of covariance matrices for arbitrary size image windows. In ICIP, 2006.
  • Reddi et al. (2018) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of Adam and beyond. In ICLR, 2018.
  • Roy et al. (2018) Roy, S. K., Mhammedi, Z., and Harandi, M.

    Geometry aware constrained optimization techniques for deep learning.

    In CVPR, 2018.
  • Sato et al. (2017) Sato, H., Kasai, H., and Mishra, B. Riemannian stochastic variance reduced gradient. arXiv preprint: arXiv:1702.05594, 2017.
  • Shah et al. (2018) Shah, V., Kyrillidis, A., and Sanghavi, S. Minimum norm solutions do not always generalize well for over-parameterized problems. arXiv preprint arXiv:1811.07055, 2018.
  • Theis et al. (2009) Theis, F. J., Cason, T. P., and Absil, P.-A. Soft dimension reduction for ICA by joint diagonalization on the Stiefel manifold. In ICA, 2009.
  • Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. RMSProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • Tripuraneni et al. (2018) Tripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. Averaging stochastic gradient descent on Riemannian manifolds. In COLT, 2018.
  • Tuzel et al. (2006) Tuzel, O., Porikli, F., and Meer, P. Region covariance: a fast descriptor for detection and classification. In ECCV, 2006.
  • Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. In NIPS. 2017.
  • Zaheer et al. (2018) Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In NeurIPS. 2018.
  • Zeiler (2012) Zeiler, M. D. ADADELTA: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • Zhang & Sra (2016) Zhang, H. and Sra, S. First-order methods for geodesically convex optimization. In COLT, 2016.
  • Zhang et al. (2016) Zhang, H., Reddi, S. J., and Sra, S. Riemannian SVRG: fast stochastic optimization on Riemannian manifolds. In NIPS, 2016.
  • Zhou et al. (2018) Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.

Appendix A Proofs

This section provides the complete proofs of the lemmas and the convergence analysis.

Hereinafter, we use

to express expectation with respect to the joint distribution of all random variables. For example,

is determined by the realizations of the independent random variables , the total expectation of for any can be taken as . We also use to denote an expected value taken with respect to the distribution of the random variable .

a.1 Proof of Lemma 4.3

Proof.

For any , we have

Next, we derive the upper bound of the -th element () of , i.e., . Denoting as for simplicity, we have from the algorithm. Now, we first assume that . Then, addressing due to , we have

Consequently, for any , we have . Similarly, having for , and assuming , we have . Thus, we have .

Now, when , due to . Therefore, supposing that , and also that and , we have

Thus, for any , we have .

This completes the proof. ∎

a.2 Proof of Theorem 4.4

This subsection provides the complete proof of Theorem 4.4.

For this purpose, we first provide an essential lemma for projection matrix as presented below.

Lemma A.1 (Projection matrix).

Suppose that is the constraint of the tangent space at to be projected onto, then the projection matrix is derived as

It should be noted that is symmetric.

Now, we provide the proof of Theorem 4.4.

Proof of Theorem 4.4.

For simplicity, we assume that the parameter represents its vectorized form throughout the analysis below. We also assume that the Riemannian stochastic gradients and the Riemannian gradient at represent their vectorized forms of their original matrix forms, which are and , respectively. Accordingly, we assume a projection matrix as in Lemma A.1 as the projection operator .

Since is retraction -smooth in Lemma 4.2, we have

(A.1)

We first bound the second term in the right hand side of above. Here, we notice that

where the first equality holds because is symmetric as stated at Lemma A.1 and the second equality holds because is on the tangent space and is just . This observation can be also obtained by noticing , where in Lemma A.1, as presented below.

where the first equality holds because belongs to the normal space to the tangent space , and its inner product with the tangent vector is zero. We can also bound the second term of (A.1) as

where the inequality holds because of the contraction of the projection operator.

Consequently, (A.1) results in

(A.2)

Now, we consider the case (A.2) with . We have

Taking expectation and rearranging terms of above yields