# Efficient Riemannian Optimization on the Stiefel Manifold via the Cayley Transform

Strictly enforcing orthonormality constraints on parameter matrices has been shown advantageous in deep learning. This amounts to Riemannian optimization on the Stiefel manifold, which, however, is computationally expensive. To address this challenge, we present two main contributions: (1) A new efficient retraction map based on an iterative Cayley transform for optimization updates, and (2) An implicit vector transport mechanism based on the combination of a projection of the momentum and the Cayley transform on the Stiefel manifold. We specify two new optimization algorithms: Cayley SGD with momentum, and Cayley ADAM on the Stiefel manifold. Convergence of Cayley SGD is theoretically analyzed. Our experiments for CNN training demonstrate that both algorithms: (a) Use less running time per iteration relative to existing approaches that enforce orthonormality of CNN parameters; and (b) Achieve faster convergence rates than the baseline SGD and ADAM algorithms without compromising the performance of the CNN. Cayley SGD and Cayley ADAM are also shown to reduce the training time for optimizing the unitary transition matrices in RNNs.

## Authors

• 129 publications
• 17 publications
• 15 publications
• ### Geoopt: Riemannian Optimization in PyTorch

Geoopt is a research-oriented modular open-source package for Riemannian...
05/06/2020 ∙ by Max Kochurov, et al. ∙ 0

• ### DEAM: Accumulated Momentum with Discriminative Weight for Stochastic Optimization

Optimization algorithms with momentum, e.g., Nesterov Accelerated Gradie...
07/25/2019 ∙ by Jiyang Bai, et al. ∙ 0

• ### Quasi-hyperbolic momentum and Adam for deep learning

Momentum-based acceleration of stochastic gradient descent (SGD) is wide...
10/16/2018 ∙ by Jerry Ma, et al. ∙ 0

• ### On Tight Convergence Rates of Without-replacement SGD

For solving finite-sum optimization problems, SGD without replacement sa...
04/18/2020 ∙ by Kwangjun Ahn, et al. ∙ 0

• ### APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Adam is the important optimization algorithm to guarantee efficiency and...
08/26/2020 ∙ by Hanlin Tang, et al. ∙ 11

• ### SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Distributed optimization is essential for training large models on large...
10/01/2019 ∙ by Jianyu Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Orthonormality has recently gained much interest, as there are significant advantages of enforcing orthonormality on parameter matrices of deep neural networks. For CNNs,

Bansal et al. (2018) show that orthonormality constraints improve accuracy and gives a faster empirical convergence rate, Huang et al. (2018a) show that orthonormality stabilizes the distribution of neural activations in training, and Cogswell et al. (2015) show that orthonormality reduces overfitting and improves generalization. For RNNs, Arjovsky et al. (2016); Zhou et al. (2006)

show that the orthogonal hidden-to-hidden matrix alleviates the vanishing and exploding-gradient problems.

Riemannian optimization on the Stiefel manifold, which represents the set of all orthonormal matrices of the same size, is an elegant framework for optimization under orthonormality constraints. But its high computational cost is limiting its applications, including in deep learning. Recent efficient approaches incorporate orthogonality in deep learning only for square parameter matrices (Arjovsky et al., 2016; Dorobantu et al., 2016), or indirectly through regularization (Bansal et al., 2018), which however does not guarantee the exact orthonormality of parameter matrices.

To address the aforementioned limitations, our first main contribution is an efficient estimation of the retraction mapping based on the Cayley transform for updating large

non-square matrices of parameters on the Stiefel manifold. We specify an efficient iterative algorithm for estimating the Cayley transform that consists of only a few matrix multiplications, while the closed-form Cayley transform requires costly matrix inverse operations (Vorontsov et al., 2017; Wisdom et al., 2016). The efficiency of the retraction mapping is both theoretically proved and empirically verified in the paper.

Our second main contribution is aimed at improving convergence rates of training by taking into account the momentum in our optimization on the Stiefel manifold. We derive a new approach to move the tangent vector between tangent spaces of the manifold, instead of using the standard parallel transport. Specifically, we regard the Stiefel manifold as a submanifold of a Euclidean space. This allows for representing the vector transport (Absil et al., 2009) as a projection onto the tangent space. As we show, since the Cayley transform implicitly projects gradients onto the tangent space, the momentum updates result in an implicit vector transport. Thus, we first compute a linear combination of the momentum and the gradient in the Euclidean space, and then update the network parameters using the Cayley transform, without explicitly performing the vector transport.

We apply the above two contributions to generalize the standard SGD with momentum and ADAM (Kingma & Ba, 2014) to the Stiefel manifold, resulting in our two new optimization algorithms called Cayley SGD with momentum and Cayley ADAM. A theoretical analysis of the convergence of Cayley SGD is presented. Similar analysis for Cayley ADAM is omitted, since it is very similar to the analysis presented in (Becigneul & Ganea, 2019).

Cayley SGD and Cayley ADAM are empirically evaluated on image classification using VGG and Wide ResNet on the CIFAR-10 and CIFAR-100 datasets (Krizhevsky & Hinton, 2009)

. The experiments show that Cayley SGD and Cayley ADAM achieve better classification performance and faster convergence rate than the baseline SGD with momentum and ADAM. While the baselines take less time per epoch since they do not enforce the orthonormality constraint, they take more epochs until convergence than Cayley SGD and Cayley ADAM. In comparison with existing optimization methods that also account for orthonormality – e.g., Polar decomposition, QR decomposition, or closed-form Cayley transform – our Cayley SGD and Cayley ADAM run much faster and yield equally good or better performance in image classification.

Finally, we apply the aforementioned two contributions to training of the unitary RNNs. Wisdom et al. (2016) proposes the full capacity unitary RNN that updates the hidden-to-hidden transition matrix with the closed-form Cayley transform. In contrast, for our RNN training, we use the more efficient Cayley SGD with momentum and Cayley ADAM. The results show that our RNN training takes less running time per iteration without compromising performance.

## 2 Related Work

There is a host of literature on using orthonormality constraints in neural-network training. This section reviews closely related work, which can be broadly divided in two groups: orthonormality regularization and Riemannian optimization.

Regularization approaches can be divided as hard, which strictly enforce orthonormality, and soft. Hard regularizations are computationally expensive. For example, Huang et al. (2018b)

extend Batch Normalization

(Ioffe & Szegedy, 2015)

with ZCA, and hence require costly eigenvalue decomposition.

Huang et al. (2018a) derive a closed-form solution that also requires eigenvalue decomposition. Bansal et al. (2018); Cisse et al. (2017); Xiong et al. (2016) introduce mutual coherence regularization and spectral restricted isometry regularization; however, these regularizations are soft in that they can not guarantee orthonormality.

Riemannian optimization guarantees that the solution respects orthonormality constraints. For example, Cho & Lee (2017) replace Batch Normalization layers in a CNN with Riemannian optimization on the Grassmann manifold , where the parameters are normalized but not orthonormalized. Also, Vorontsov et al. (2017); Wisdom et al. (2016); Lezcano-Casado & Martínez-Rubio (2019); Helfrich et al. (2017) perform Riemannian optimization on the group of unitary matrices to stablize RNNs training, but their technique cannot be applied to non-square parameter matrices. Becigneul & Ganea (2019) introduce a more general Riemannian optimization, but do not show how to efficiently perform this optimization on the Stiefel manifold.

The key challenge of Riemannian optimization is that exponential mapping — the standard step for estimating the next update point — is computationally expensive on the Stiefel manifold. Some methods use an efficient pseudo-geodesic retraction instead of the exponential mapping. For example, Absil & Malick (2012); Gawlik & Leok (2018); Manton (2002) use a projection-based method to map the gradient back to the Stiefel manifold that rely on computational expensive SVD. Other approaches are based on the closed-form Cayley transform (Fiori et al., 2012; Jiang & Dai, 2015; Nishimori & Akaho, 2005; Zhu, 2017), but require the costly matrix inversion. Also, Wen & Yin (2013) reduce the cost of the Cayley transform by making the restrictive assumption that the matrix size is such that . Unfortunately, this algorithm is not efficient when .

## 3 Preliminary

This section briefly reviews some well-known properties of the Riemannian and Stiefel manifolds. The interested reader is referred to Boothby (1986); Edelman et al. (1998) and references therein.

### 3.1 Riemannian Manifold

###### Definition 1.

Riemannian Manifold: A Riemannian manifold () is a smooth manifold equipped with a Riemannian metric defined as the inner product on the tangent space for each point , .

###### Definition 2.

Geodesic, Exponential map and Retraction map: A geodesic is a locally shortest curve on a manifold. An exponential map, , maps a tangent vector, , to a manifold, . represents a geodesic on a manifold, s.t. . A retraction map is defined as a smooth mapping on a manifold iff and , where denotes the derivative of , denotes an identity map defined on . It is easy to show that any exponential map is a retraction map. As computing an exponential map is usually expensive, a retraction map is often used as an efficient alternative.

###### Definition 3.

Parallel transport and vector transport: Parallel transport is a method to translate a vector along a geodesic on a manifold while preserving norm. A vector transport is a smooth map defined on a retraction of a manifold , . satisfies the following properties: (1) Underlying retraction: ; (2) Consistency: ; (3) Linearity: . Usually, a vector transport is a computationally efficient alternative to a parallel transport.

### 3.2 Stiefel Manifold

The Stiefel manifold , , is a Riemannian manifold that is composed of all orthonormal matrices . In the rest of the paper, we will use notation to denote the Stiefel manifold. We regard as an embeded submanifold of a Euclidean space. Hence, the Riemannian metric is the Euclidean metric as:, where are tangent vectors in . The tangent space of at is defined as

 TXM={Z:Z⊤X+X⊤Z=0} (1)

The projection of a matrix to can be computed as

 πTX(Z) =WX where:W =^W−^W⊤,^W=ZX⊤−12X(X⊤ZX⊤). (2)

Given a derivative of the objective function at in the Euclidean space, we can compute the Riemannian gradient on the Stiefel manifold as a projection onto using given by Eq.(3.2). It follows that optimization of on the Riemannian manifold can be computed as follows. First, compute in . Second, transport the momentum to the current tangent space and combine it linearly with the current Riemannian gradient to update the momentum . Finally, third, update the new parameter along a curve on the manifold with the initial direction as .

While the exponential map and parallel transport can be used to update parameters and momentums in optimization on the Riemannian manifold, they are computationally infeasible on the Stiefel manifold. In the following section, we specify our computationally efficient alternatives.

#### 3.2.1 Parameter Updates by Iterative Cayley Transform

The Cayley transform computes a parametric curve on the Stiefel manifold using a skew-symmetric matrix

(Nishimori & Akaho, 2005). The closed-form of the Cayley transform is given by:

 Y(α)=(I−α2W)−1(I+α2W)X, (3)

where is a skew-symmetric matrix, i.e. , is on the Stiefel manifold, and is a parameter that represents the length on the curve. It is straightforward to verify that

 Y(0)=XandY′(0)=WX. (4)

From Definition 2 and the definition of the tangent space of the Stiefel manifold given by Eq.(1), the Cayley transform is a valid retraction map on the Stiefel manifold. By choosing , where as in Eq.(3.2), we see that the Cayley transform implicitly projects gradient on the tangent space as its initial direction. Therefore, the Cayley transform can represent an update for the parameter matrices on the Stiefel manifold.

However, the closed-form Cayley transform in Eq.(3) involves computing the expensive matrix inversion, which cannot be efficiently performed in large deep neural networks.

Our first contribution is an iterative estimation of the Cayley transform that efficiently uses only matrix multiplications, and thus is more efficient than the closed form in Eq.(3). We represent the Cayley transform with the following fixed-point iteration:

 Y(α)=X+α2W(X+Y(α)), (5)

which can be proved by moving on the right-hand side to the left-hand side in Eq.(3). The expression in Eq.(5) is an efficient approximation of Eq.(3). In Sec. 5, we will analyze its convergence rate to the closed-form Eq.(3).

#### 3.2.2 Momentum Updates by the Implicit Vector Transport

Our second contribution is an efficient way to perform momentum updates on the Stiefel manifold. We specify an implicit vector transport by combining the Cayley transform and momentum updates in an elegant way without explicitly computing the vector transport. As the Stiefel manifold can be viewed as a submanifold of the Euclidean space , we have a natural inclusion of the tangent space . Consequently, the vector transport on the Stiefel manifold is the projection on the tangent space. Formally, for two tangent vectors , the vector transport of along a retraction map , denoted as , can be computed as:

 τηX(ξX) =πTr(ηX)(ξX). (6)

We specify the retraction map as the Cayley transform in Eq.(3). At optimization step , in Eq.(6), we choose , where is the momentum in step . Then, we compute the vector transport of as . As the projection onto a tangent space is a linear map, then . Thus we first compute a linear combination of the Euclidean gradient and the momentum , as in the Euclidean space, and then use the iterative Cayley transform to update the parameters, without explicitly estimating the vector transport, since the Cayley transform implicitly project a vector onto the tangent space.

## 4 Algorithms

This section specifies our Cayley SGD and Cayley ADAM algorithms. Both represent our efficient Riemannian optimization on the Stiefel manifold that consists of two main steps. As the Cayley transform implicitly projects gradient and momentum vectors onto the tangent space, we first linearly combine the momentum of the previous iteration with the stochastic gradient of the objective function at the current point , denoted as . Then, we use the iterative Cayley transform to estimate the next optimization point based on the updated momentum. This is used to generalize the conventional SGD with momentum and ADAM algorithms to our two new Riemannian optimizations on the Stiefel manifold, as described in Section 4.1 and Section 4.2.

### 4.1 Cayley SGD with Momentum

We generalize the heavy ball (HB) momentum update (Ghadimi et al., 2015; Zavriev & Kostyuk, 1993) in the th optimization step111Note that we use index in superscript to indicate our iterative steps in estimation of the Cayley transform, and index in subscript to indicate optimization steps. to the Stiefel manifold. Theoretically, it can be represented as:

 Mk+1=βπTXk(Mk)−GM(Xk),Xk+1=Y(α,Xk,Wk) (7)

where is a curve that starts at with length on the Stiefel manifold, specified by the Cayley transform in Eq.(3). In practice, we efficiently perform the updates in Eq.(7) by the proposed iterative Cayley transform and implicit vector transport on the Stiefel manifold. Specially, we first update the momentum as if it were in the Euclidean space. Then, we update the new parameters by iterative Cayley transform. Finally, we correct the momentum by projecting it to . The details of our Cayley SGD algorithm are shown in Alg. 1.

### 4.2 ADAM on the Stiefel Manifold

ADAM is a recent first-order optimization method for stochastic objective functions. It estimates adaptive lower-order moments and uses adaptive learning rate. The algorithm is designed to combine the advantages of both AdaGrad, which performs well in sparse-gradient cases, and RMSProp, which performs well in non-stationary cases.

We generalize ADAM to the Stiefel manifold by making three modifications to the vanilla ADAM. First, we replace the standard gradient and momentum with the corresponding ones on the Stiefel manifold, as described in Section 4.1. Second, we use a manifold-wise adaptive learning rate that assign a same learning rate for all entries in a parameter matrix as in (Absil et al., 2009). Third, we use the Cayley transform to update the parameters. Cayley ADAM is summarized in Alg 2.

## 5 Convergence Analysis

In this section, we analyze convergence of the iterative Cayley transform and Cayley SGD with momentum. To facilitate our analysis, we make the following common assumption.

###### Assumption 1.

The gradient of the objective function is Lipschitz continuous

 ∥∇f(X)−∇f(Y)∥≤L∥X−Y∥,∀X,Y, where L>0 is a constant. (8)

Lipschitz continity is widely applicable to deep learning architectures. For some models using ReLU, the derivative of ReLU is Lipschitz continuous almost everywhere with an appropriate Lipschitz constant

in Assumption 1 , except for a small neighbourhood around 0, whose measure tends to . Such cases do not affect either analysis in theory or training in practice.

The above assumption allows proving the following contraction mapping theorem.

###### Theorem 1.

For , the iteration is a contraction mapping and converges to the closed-form Cayley transform given by Eq.(3). Specifically, at iteration , .

Theorem 1 shows the iterative Cayley transform will converge. Specially, it converges faster than other approximation algorithms, such as, e.g., the Newton iterative and Neumann Series which achieves error bound at the iteration. We further prove the following result on convergence:

###### Theorem 2.

Given an objective function that satisfies Assumption 1, let Cayley SGD with momentum run for iterations with . For , where is a positive constant, we have: .

The proofs of Theorem 1 and Theorem 2 are presented in the appendix.

## 6 Experiments

### 6.1 orthonormality in CNNs

In CNNs, for a convolutional layer with kernel , we first reshape into a matrix of size , where , . In most cases, we have . Then, we restrict the matrix on the Stiefel manifold using Cayley SGD or Cayley ADAM, while other parameters are optimized with SGD and ADAM.

Datasets: We evaluate Cayley SGD or Cayley ADAM in image classification on the CIFAR10 and CIFAR100 datasets (Krizhevsky & Hinton, 2009). CIFAR10 and CIFAR100 consist of of 50,000 training images and 10,000 test images, and have 10 and 100 mutually exclusive classes.

Models: We use two networks — VGG (Simonyan & Zisserman, 2014) and Wide ResNet (Zagoruyko & Komodakis, 2016) — that obtain state of the art performance on CIFAR10 and CIFAR100. For VGG, every convolutional layer is followed by a batch normalization layer and a ReLU. For Wide ResNet, we use basic blocks, where two consecutive 3-by-3 convolutional layers are followed by the batch normalization and ReLU activation, respectively.

Training Strategies: We use different learning rates and for weights on the Euclidean space and the Stiefel manifold, respectively. We set the weight decay as 0.0005, momentum as 0.9, and minibatch size as 128. The initial learning rates are set as and for Cayley SGD and and for Cayley ADAM. During training, we reduce the learning rates by a factor of

at 60, 120, and 160 epochs. The total number of epochs in training is 200. In training, the data samples are normalized using the mean and variance of the training set, and augmented by randomly flipping training images.

Our baselines include SGD with momentum and ADAM. We follow the same training strategies as mentioned above, except for the initial learning rates set to 0.1 and 0.001, respectively.

Performance: Table 1 and Table 2 show classification errors on CIFAR10 and CIFAR100 respectively using different optimization algorithms. As shown in the tables, the proposed two algorithms achieve competitive performance, and for certain deep architectures, the best performance. Specifically, the network WRN-28-10 trained with Cayley ADAM achieves the best error rate of and on CIFAR10 and CIFAR100 respectively. Fig. 1

compares training curves of our algorithms and baselines in terms of epochs, and shows that both Cayley SGD and Cayley ADAM converge faster than the baselines. In particular, the training curves of the baselines tend to get stuck in a plateau before the learning rate drops, which is not the case with our algorithms. This might be because the baselines do not enforce orthonormality of network parameters. In training, the backpropagation of orthonormal weight vectors, in general, does not affect each other, and thus has greater chances to explore new parameter regions. Fig.

2 also compares the training loss curve in terms of time. Our Cayley SGD and Cayley ADAM converge the fastest among methods that also address orthonormality. Although the baselines SGD and ADAM converge faster at the beginning due to their training efficiency, our Cayley SGD and Cayley ADAM can catch up with the baseline after 12000 seconds, which corresponds to the 120th epoch of SGD and ADAM, and the 60th epoch of Cayley SGD and Cayley ADAM.

Comparison with State of the Art. We compare the proposed algorithms with two sets of state of the art. The first set of approaches are soft orthonormality regularization approaches (Bansal et al., 2018). Specially, for a weight matrix , SO penalizes , DSO penalizes (), the SRIP penalizes the spectral norm of . The second set of approaches includes the following hard orthonormality methods: Polar decomposition(Absil et al., 2009), QR decomposition(Absil et al., 2009), closed-form Cayley transform, Wen&Yin (Wen & Yin, 2013), OMDSM(Huang et al., 2018a), DBN(Huang et al., 2018b). Note that we do not include momentum in Polar decomposition and QR decomposition as previous work does not specify the momentum. Also, we use the closed-form Cayley transform without momentum as an ablation study of the momentum effect. All experiments are evaluated on the benchmark network WRN-28-10. Table 3 shows that our algorithms achieve comparable error rates with state of the art. All algorithms are run on one TITAN Xp GPU, and their average training time are compared per epoch. Table 3 shows that our algorithms run much faster than existing algorithms, except for the baseline SGD and ADAM which do not impose orthonormality constraints.

### 6.2 orthonormality in RNNs

In RNNs, the hidden-to-hidden transition matrix

can be modeled as a unitary matrix

(Arjovsky et al., 2016). Wisdom et al. (2016) model the hidden-to-hidden transition matrix as a full-capacity unitary matrix on the complex Stiefel manifold: .

Pixel-by-pixel MNIST: We evaluate the proposed algorithms on the challenging pixel-by-pixel MNIST task for long-term memory. The task was used to test the capacity of uRNNs (Arjovsky et al., 2016; Wisdom et al., 2016). Following the same setting as in Wisdom et al. (2016), we reshape MNIST images of pixels to a pixel-by-pixel sequences, and select 5,000 out of the 60,000 training examples for the early stopping validation.

Training: Wisdom et al. (2016) restricted the transition unitary matrix on the Stiefel manifold via a closed-form Cayley transform. On the contrary, we use Cayley SGD with momentum and Cayley ADAM to reduce the training time. Table 4 shows that the proposed algorithms reduce the training time by about 35% for all settings of the network, while maintaining the same level of accuracy. All experiments are performed on one TITAN Xp GPU.

Checking Unitariness: To show that the proposed algorithms are valid optimization algorithms on the Stiefel manifold, we check the unitariness of the hidden-to-hidden matrix by computing the error term during training. Table 5 compares average errors for varying numbers of iterations . As can be seen, the iterative Cayley transform can approximate the unitary matrix when . The iterative Cayley transform performs even better than the closed form Cayley transform, which might be an effect of the rounding error of the matrix inversion as in Eq.(3).

## 7 Conclusion

We proposed an efficient way to enforce the exact orthonormal constraints on parameters by optimization on the Stiefel manifold. The iterative Cayley transform was applied to the conventional SGD and ADAM for specifying two new algorithms: Cayley SGD with momentum and Cayley ADAM, and the theoretical analysis of convergence of the former. Experiments show that both algorithms achieve comparable performance and faster convergence over the baseline SGD and ADAM in training of the standard VGG and ResNet on CIFAR10 and CIFAR100, as well as RNNs on the pixel-by-pixel MNIST task. Both Cayley SGD with momentum and Cayley ADAM take less runtime per iteration than all existing hard orthonormal methods and soft orthonormal methods, and can be applied to non-square parameter matrices.

#### Acknowledgements

This work was supported in part by NSF grant IIS-1911232, DARPA XAI Award N66001-17-2-4029 and AFRL STTR AF18B-T002.

## Appendix A Preliminary

In this section, we derive some properties of the Stiefel manifold in this section to facilitate the proofs of Theorem 1 and Theorem 2 in the main paper.

Considering the Stiefel manifold is a bounded set and Lipschitz Assumption 1 on the gradient in Sec. 5, one straightforward conclusion is that both and its Stiefel manifold gradient that is a projection onto the tangent space are bounded. Formally, there exists a positive constant , such that

 ||∇Mf(X)||≤||∇f(X)||≤G,∀X∈M (9)

As stochastic gradient is the gradient of a sub-dataset, where is a stochastic variable for data samples and we are working are a finite dataset, it’s straightforward to show that and its Riemannian stochastic gradient are also bounded. For brevity, we still use the same upper bound , such that:

 ||GM(X)||≤||G(X)||≤G,∀X∈M (10)

Recall the recursion in Eq.(7), we show that the momentum is also bounded:

 ||Mk+1|| ≤β||Mk||+||GM(Xk)|| ≤k∑i=0βk−i||GM(Xi)|| ≤11−βG (11)

Therefore, we know that in Eq.(7) is bounded.

## Appendix B Proof of Theorem 1

###### Proof.

By subtracting the iterative relationship Eq.(5) by its iteration , we have:

 ||Yi+1−Y(α)||≤α||W||2||Yi−Y(α)|| (12)

Therefore, since is bounded, for , such that , the iteration in Eq.(5) is a contraction mapping, and it will converge to the closed-from solution .

By differentiate Eq.(5), we have:

 dY(α)dα =W(X+Y(α)2)+α2WdY(α)dα d2Y(α)dα2 =(I−α2W)−1WdY(α)dα (13)

therefore, and are bounded, i.e. there exist a positive constant , such that:

 ||d2Y(α)dα2||≤C (14)

Using the Taylor expansion of in Eq.(3), we have:

 Y(α)=Xk+αMk+1+12α2d2Y(γkα)dα2 (15)

where . Given , we have:

 ||Y0−Y(α)||=o(α2) (16)

Since is bounded and , then,

 ||Yi−Y(α)||≤(α||W||2)i||Y0−Y(α)||=o(α2+i) (17)

## Appendix C Proof of Theorem 2

###### Proof.

Use Taylor expansion of , the process of Cayley SGD with momentum Eq.(7) can be written as:

 Mk+1 =πTXk(βMk)−GM(Xk) Xk+1 =Xk+αMk+1+12α2d2Y(γkα)dα2 (18)

where .

Using the fact that

 M=πTX(M)+πNX(M) (19)

where is the projection onto the normal space, and

 πNX(M)=XX⊤M+M⊤X2 (20)

Then, the projection of momentum can be represented as:

 πTXk(Mk) =Mk−πNXk(Mk) =Mk−XkX⊤kMk+M⊤kXk2 =Mk−12Xk{[Xk−1+αMk+12α2d2Y(γk−1α)dα2]⊤Mk +M⊤k[Xk−1+αMk+12α2d2Y(γk−1α)dα2]} =Mk−αXkM⊤kMk−14α2Xk[d2Y(γk−1α)dα2⊤Mk+M⊤kd2Y(γk−1α)dα2] (21)

Then the momentum update in Eq.(7) is equivalent to:

 Mk+1 =βMk−αβXkM⊤kMk−GM(Xk) −14α2βXk[d2Y(γk−1α)dα2⊤Mk+M⊤kd2Y(γk−1α)dα2] (22)

Therefore, the paramter update in Eq.(7) can be represented as:

 Xk+1 =Xk+αβMk−αGM(Xk) −14α3βXk[d2Y(γk−1α)dα2⊤Mk+M⊤kd2Y(γk−1α)dα2] −α2βXkM⊤kMk+12α2d2Y(γkα)dα2 =Xk+β(Xk−Xk−1)−αGM(Xk)+α2U (23)

where

 U =−14αβXk[d2Y(γk−1α)dα2⊤Mk+M⊤kd2Y(γk−1α)dα2] −βXkM⊤kMk+12d2Y(γkα)dα2−β2d2Y(γk−1α)dα2 (24)

Since , , and are bounded, there is a positive constant , such that

 ||U||≤D (25)

To facilitate the analysis of Cayle SGD with momentum, we introduce auxiliary variables , such that:

 Zk+1 =Zk−α1−βGM(Xk)+α21−βU

where

 Zk =Xk+Pk (27)

and

 Pk=⎧⎪⎨⎪⎩β1−β(Xk−Xk−1),k≥10,k=0

Since is a smooth function according to Assumption 1, we have:

 f(Y)−f(X)−tr(∇f(X)⊤(Y−X)) = ∫10∇tr(f(Y+t(Y−X))⊤(Y−X))dt−tr(∇f(X)⊤(Y−X)) ≤ ||∫10(∇f(Y+t(Y−X))−∇f(X))dt||×||Y−X|| ≤ ∫10L||t(Y−X)||dt×||Y−X|| ≤ L2||Y−X||2 (28)

Then, we have

 f(Zk+1)≤ f(Zk)+tr(∇f(Zk)⊤(Zk+1−Zk))+L2||Zk+1−Zk||2 = f(Zk)+tr(∇f(Zk)⊤(Zk+1−Zk))+L2||α1−βGM(Xk)−α21−βU||2 ≤ f(Zk)+tr(∇f(Zk)⊤(Zk+1−Zk))+Lα2(1−β)2||GM(Xk)||2+Lα4(1−β)2||U||2 ≤ f(Zk)−α1−βtr(GM(Xk)⊤∇f(Zk))+α21−βtr(U⊤∇f(Zk))+Lα2(1−β)2G2 + Lα4(1−β)2D2 ≤ f(Zk)−α1−βtr(GM(Xk)⊤∇f(Zk))+α22(1−β)(||U||2+||∇f(Zk)||2)+Lα2(1−β)2G2 + Lα4(1−β)2D2 ≤ f(Zk)−α1−βtr(GM(Xk)⊤∇f(Zk))+α22(1−β)(D2+G2) + Lα2(1−β)2G2+Lα4(1−β)2D2 (29)

By taking expectation over the both sides, we have:

 E[f(Zk+1)−f(Zk))] ≤ E[−α1−βtr(∇f(Zk)⊤∇Mf(Xk))]+α22(1−β)(D2+G2)+Lα2(1−β)2G2+Lα4(1−β)2D2 ≤ E[−α1−βtr(∇f(Zk)−∇f(Xk))⊤∇Mf(Xk)−α1−βtr(∇f(Xk)⊤∇Mf(Xk))] + α22(1−β)(D2+G2)+Lα2(1−β)2G2+Lα4(1−β)2D2 = E[−α1−βtr(∇f(Zk)−∇f(Xk))⊤∇Mf(Xk)−α1−β||∇Mf(Xk)||2] + α22(1−β)(D2+G2)+Lα2(1−β)2G2+Lα4(1−β)2D2 (30)

By noticing that:

 −α1−β(∇f(Zk)−∇f(Xk))⊤∇Mf(Xk) ≤ 12L||∇f(Zk)−∇f(Xk)||2+Lα22(1−β)2||∇Mf(Xk)||2 (31)

Then

 E[f(Zk+1)−f(Zk))] ≤12LE||∇f(Zk)−∇f(Xk)||2+(Lα22(1−β)2−α1−β)E||∇Mf(Xk)||2 +α22(1−β)(D2+G2)+Lα2(1−β)2G2+Lα4(1−β)2D2 (32)

According to the Lipschitz continuous property in Assumption 1, we have:

 ||∇f(Zk)−∇f(Xk)||2≤ L2||Zk−Xk|