Riemannian stochastic quasi-Newton algorithm with variance reduction and its convergence analysis

03/15/2017 ∙ by Hiroyuki Kasai, et al. ∙ Amazon University of Electro-Communications 0

Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large, but finite number of loss functions. The present paper proposes a Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR). The key challenges of averaging, adding, and subtracting multiple gradients are addressed with notions of retraction and vector transport. We present convergence analyses of R-SQN-VR on both non-convex and retraction-convex functions under retraction and vector transport operators. The proposed algorithm is evaluated on the Karcher mean computation on the symmetric positive-definite manifold and the low-rank matrix completion on the Grassmann manifold. In all cases, the proposed algorithm outperforms the state-of-the-art Riemannian batch and stochastic gradient algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a smooth real-valued function on a Riemannian manifold [1]. The problem under consideration in the present paper is the minimization of the expected risk of for a given model variable taken with respect to the distribution of , i.e., , where and is a random seed representing a single sample or set of samples. When given a set of realizations of , we define the loss incurred by the parameter vector with respect to the -th sample as , and then the empirical risk is defined as the average of the sample losses:

(1)

where

is the total number of the elements. This problem has many applications that include, to name a few, principal component analysis (PCA) and the subspace tracking problem

[2]

on the Grassmann manifold. The low-rank matrix/tensor completion problem is a promising example of the manifold of fixed-rank matrices/tensors

[3, 4]

. The linear regression problem is also defined on the manifold of the fixed-rank matrices

[5].

Riemannian gradient descent requires the Riemannian full gradientestimation, i.e., , for every iteration, where is the Riemannian stochastic gradient of on the Riemannian manifold for -th sample. This estimation is computationally heavy when is extremely large. A popular alternative is

Riemannian stochastic gradient descent

(R-SGD) that extends stochastic gradient descent (SGD) in the Euclidean space [6]. Because this uses only one , the complexity per iteration is independent of . However, similarly to SGD [7], R-SGD suffers from a slow convergence due to a decaying step-size sequence. Variance reduction (VR) methods have been proposed recently to accelerate the convergence of SGD in the Euclidean space [8, 9, 10, 11, 12]. One distinguished feature is to calculate a full gradient estimation periodically, and to re-use it to reduce the variance of noisy stochastic gradient. However, because all previously described algorithms are first-order algorithms, their convergence speed can be slow because of their poor curvature approximations in ill-conditioned problems as seen in Section 4. One promising approach is second-order algorithms such as stochastic quasi-Newton (QN) methods using Hessian evaluations [13, 14, 15, 16]. They achieve faster convergence by exploiting curvature information of the objective function . Furthermore, addressing these two acceleration techniques, [17] and [18] propose a hybrid algorithm of the stochastic QN method accompanied with the VR method.

Examining the Riemannian manifolds again, many challenges on the QN method have been addressed in deterministic settings [19, 20, 21]. The VR method in the Euclidean space has also been extended to Riemannian manifolds, so-called R-SVRG [22, 23]. Nevertheless, the second-order stochastic algorithm with the VR method has not been explored thoroughly for the problem (1). To this end, we propose a Riemannian stochastic QN method based on L-BFGS and the VR method.

Our contributions are four-fold; (i) we propose a novel (and to the best of our knowledge, the first) Riemannian limited-memory QN algorithm with a VR method. (ii) Our convergence analysis deals with both non-convex and (strongly) retraction-convex functions. In this paper, is said to be strongly retraction-convex when is (strongly) convex along a curve on defined by a retraction (Assumption 3) while the other functions are called as non-convex functions. (iv) The proposed algorithm and its analyses are considered under computationally efficient retraction and vector transport operations instead of the more restrictive exponential mapping and parallel translation operations. This is more challenging than R-SVRG [23], but gives us a big advantage other than computational efficiency, i.e., wider kinds of applicable manifolds. For example, while [23] cannot be applied to the Stiefel and fixed-rank manifolds because these manifolds do not have closed form expressions for parallel translation, our analyses and algorithm can be directly applied to them.

The specific features of the algorithms are two-fold; (i) we update the curvature pair of the QN method every outer loop by exploiting full gradient estimations in the VR method, and thereby capture more precise and stabler curvature information. This avoids additional sweeping of samples required in the Euclidean stochastic QN [16], additional gradient estimations required in the Euclidean online BFGS (oBFGS) [14, 13, 24], or additional sub-sampling of Hessian [16, 17]. (ii) Compared with a simple Riemannian extension of the QN method, a noteworthy advantage of its combination with the VR method is that, as revealed below, frequent transportations of curvature information between different tangent spaces, which are inextricable in such a simple Riemannian extension, can be drastically reduced. This is a special benefit of the Riemannian hybrid algorithm, which does not exist in the Euclidean case [17, 18]. More specifically, the calculations of curvature information and the second-order modified Riemannian stochastic gradient are performed uniformly on the tangent space of the outer loop.

The paper is organized as follows. Section 2 presents details of our proposed R-SQN-VR. Section 3 presents the convergence analyses. In Section 4, numerical comparisons with R-SGD and R-SVRG on two problems are provided with results suggesting the superior performances of R-SQN-VR. The proposed R-SQN-VR is implemented in the Matlab toolbox Manopt [25]. The concrete proofs of theorems and additional experiments are provided as supplementary material.

2 Riemannian stochastic quasi-Newton algorithm with variance reduction (R-SQN-VR)

 Algorithm 1 Riemannian stochastic quasi-Newton with variance reduction (R-SQN-VR).  

0:  Update frequency , step-size , memory size

, number of epochs K, and cautious update threshold

.
1:  Initialize , and calculate the Riemannian full gradient .
2:  for  do
3:     Store .
4:     for  do
5:        Choose uniformly at random.
6:        Calculate the tangent vector from to by .
7:        if  then
8:           Transport the stochastic gradient to by .
9:           Calculate as .
10:           Calculate , transport back to by , and obtain .
11:           Update from as .
12:        else
13:           Calculate as .
14:           Update from as .
15:        end if
16:     end for
17:     Option I: (or for randomly chosen ).
18:     Option II: .
19:     Calculate the Riemannian full gradient .
20:     Calculate the tangent vector from to by .
21:     Compute , and where .
22:     if  then
23:        Discard pair when , and store pair .
24:     end if
25:     Transport to by .
26:  end for
27:  Option III: output
28:  Option IV: output for randomly chosen and .

 

We assume that the manifold is endowed with a Riemannian metric structure, i.e., a smooth inner product of tangent vectors is associated with the tangent space for all [1]. The norm of a tangent vector is the norm associated with the Riemannian metric. The metric structure allows a systematic framework for optimization over manifolds. Conceptually, the constrained optimization problem (1) is translated into an unconstrained problem over .

2.1 R-SGD and R-SVRG

R-SGD: Given a starting point , R-SGD produces a sequence in that converges to a first-order critical point of (1). Specifically, it updates as

where is the step-size, and where is a Riemannian stochastic gradient, which is a tangent vector at .

represents an unbiased estimator of the Riemannian full gradient

, and the expectation of over the choices of is , i.e., . The update moves from in the direction with a step-size while remaining on . This mapping, denoted as , is called retraction at , which maps the tangent bundle onto with a local rigidity condition that preserves gradients at . Exponential mapping is an instance of the retraction.

R-SVRG: R-SVRG has double loops where a -th outer loop, called epoch, has inner iterations. R-SVRG keeps after inner iterations of -th epoch, and computes the full Riemannian gradient only for this stored . It also computes the Riemannian stochastic gradient for -th sample. Then, picking -th sample for each -th inner iteration of -th epoch at , we calculate , i.e., by modifying using both and . Because they belong to different tangent spaces, a simple addition of them is not well-defined because Riemannian manifolds are not vector spaces. Therefore, after and are transported to by , is set as

where represents vector transport from to , and satisfies . The vector transport is associated with retraction and all . It holds that (i) , (ii) , and (iii) is a linear map. Parallel translation is an instance of the vector transport. Consequently, the final update is defined as .

2.2 Proposed R-SQN-VR

We propose a Riemannian stochastic QN method accompanied with a VR method (R-SQN-VR). A straightforward extension is to update the modified stochastic gradient by premultiplying a linear inverse Hessian approximation operator at as

where by denoting the inverse Hessian approximation at simply as . Here, is an isometric vector transport explained in Section 3. should be positive definite, i.e., and is close to the Hessian of , i.e., . It is noteworthy that is calculated only every outer epoch, and remains to be used for throughout the corresponding -th epoch.

Curvature pair : This paper particularly addresses the operator used in L-BFGS intended for a large-scale data. Thus, let and be the variable variation and the gradient variation at , respectively, where the superscript expresses explicitly that they belong to . It should be noted that the curvature pair is calculated at the new just after -th epoch finished. Furthermore, after the epoch index is incremented, the curvature pair must be used only at because the calculation of is performed only at .

The variable variation is calculated from the difference between and . This is represented by the tangent vector from to , which is calculated using the inverse of the retraction . Since belongs to the , transporting this onto yields

(2)

The gradient variation is calculated from the difference between the new full gradient and the previous, but transported [21] as

(3)

where is explained in Section 3.

Inverse Hessian approximation operator : is calculated using the past curvature pairs. More specifically, is updated as , where with identity mapping id [21]. Therein, denotes the flat of , i.e., . Thus, depends on and , and similarly depends on and . Proceeding recursively, is a function of the initial and all previous curvature pairs . Meanwhile, L-BFGS restricts use to the most recent pairs since with are likely to have little curvature information. Based on this idea, L-BFGS performs updates by the initial . We use the pairs when .

Now, we consider the final calculation of used for in the inner iterations of -th outer epoch using the most recent curvature pairs. Here, since this calculation is executed at and a Riemannian manifold is in general not a vector space, all the curvature pairs must be located at . To this end, just after the curvature pair is calculated in (2) and (3), the past pairs of are transported into by the same vector transport used when calculating and . It should be emphasized that this transport is necessary only for every outer epoch instead of every inner loop, and results in drastic reduction of computational complexity in comparison with the straightforward extension of the Euclidean stochastic L-BFGS [24] into the manifold setting. Consequently, the update is defined as

where , and is the initial inverse Hessian approximation. is the identity mapping. Because is not necessarily , and because it is any positive definite self-adjoint operator, we use similar to the Euclidean case. The practical update of uses two-loop recursion algorithm [26] in Algorithm A.1 of the supplementary material.

Cautious update:

Euclidean L-BFGS fails on non-convex problems because the Hessian approximation has eigenvalues that are away from zero and are not uniformly bounded above. To circumvent this issue,

cautious update has been proposed in the Euclidean space [27]. By following this, we skip the update of the curvature pair when the following condition is not satisfied;

(4)

where is a predefined constant parameter. According to this update, the positive definiteness of is guaranteed as far as is positive definite.

Second-order modified stochastic gradient : R-SVRG transports and at into to add them to at . If we follow the same strategy, we must also transport pairs of into the current at every inner iteration. Addressing this problem and the fact that both the full gradient and the curvature pairs belong to the same tangent space , we transport from into , and complete all the calculations on . More specifically, after transporting as from to using , the modified stochastic gradient is computed as

After calculating using the two-loop recursion algorithm, we obtain by transporting to by . Finally, we update from as . It should be noted that, although is not generally guaranteed as a descent direction, is a descent direction. Furthermore, the positive definiteness of yields that is an average descent direction due to .

3 Convergence analysis

This section presents convergence analyses on both non-convex and retraction-convex functions under retraction and vector transport operations.. The concrete proofs are in the supplementary file.

Assumption 1.

We assume below [21];

(1.1) The objective function and its components are twice continuously differentiable.

(1.2) For a sequence generated by Algorithm 1, there exists a compact and connected set such that for all . Also, for each , there exists a totally retractive neighborhood of such that stays in for any , where the -totally retractive neighborhood of is a set such that for all , , and is a diffeomorphism on , which is the ball in with center and radius , where is the zero vector in . Furthermore, suppose that there exists such that .

(1.3) The sequence continuously remains in -totally retractive neighborhood of critical point and is retraction-smooth with respect to retraction in . Here, is said to be retraction-smooth in if for all , i.e., there exists a constant such that , for all , all , and all such that for all .

(1.4) The vector transport is isometric on . It satisfies for any and .

(1.5) There exists a constant such that the vector transport satisfies the following conditions for all , which is some neighborhood of an arbitrary point : where denotes the differentiated retraction, i.e., with , and .

(1.6) Riemannian stochastic gradient is bounded as as [14, 15, 16].

Essential inequalities. We briefly summarize essential inequalities. They are detailed in the supplementary material. For all , which is a neighborhood of , the difference between the parallel translation and the vector transport is given with a constant as (Lemma C.14)

(5)

where and . Similarly, as for the difference between the exponential mapping and the retraction, there exist , for all and all small length of such that (Lemma C.15)

(6)

Then, the variance of is upper bounded by (Lemma D.9)

(7)

where is the constant of Assumption 1, is a Lipschitz constant, and is the constant in (5). Finally, there exist such that (Proposition C.7 for non-convex functions and Proposition D.6 for retraction-convex functions)

(8)

where the with means that is positive semidefinite.

Now, we first present a global convergence analysis to a critical point starting from any initialization point, which is common in a non-convex setting with additional but mild assumptions;

Assumption 2.

We assume that is bounded below by a scalar , and a decaying step-size sequence satisfies and . Additionally, since is compact, all continuous functions on can be bounded. Therefore, there exists such that for all and , we have .

Theorem 3.1 (Global convergence analysis on non-convex functions).

Let be a Riemannian manifold and be a non-degenerate local minimizer of . Consider Algorithm 1 and suppose Assumptions 1 and 2, and that the mapping has the positive real number that the largest eigenvalue of its Riemannian Hessian is bounded for all . Then, we have .

We next present a global convergence rate analysis. This requires an strict selection of a fixed step size satisfying the condition below, but, instead, provides a convergence rate under it.

Theorem 3.2 (Global convergence rate analysis on non-convex functions).

Let be a Riemannian manifold and be a non-degenerate local minimizer of . Consider Algorithm 1 with Option II and IV, and suppose Assumption 1. Let the constants in (5), and in (6), and , and in (7). is the constant Assumption 1.3, and and are the constants and in (8). Set and , where , and . Given sufficiently small , suppose that is chosen such that holds. Set and . Then, we have

(9)

The total number of gradient evaluations is to obtain an -solution. The proof is given by extending those of [12, 23, 15].

As a final analysis, we present a local convergence rate in neighborhood of a local minimum by introducing additionally a local assumption for retraction-convexity below. This is also very common and standard in manifold optimization.

Assumption 3.

We assume that the objective function is strongly retraction-convex with respect to in . Here, is said to be strongly retraction-convex in if for all and is strongly convex, i.e., there exists a constant such that , for all , all , and all such that