1 Introduction
Optimization algorithms are a mainstay in machine learning research, underpinning solvers for a wide swath of problems ranging from linear regression and SVMs to deep learning. Consequently, scaling such algorithms to large scale datasets while preserving theoretical guarantees is of paramount importance. An important challenge in this field is designing scalable algorithms for optimization problems in the presence of constraints on the search space, a situation all too often encountered in real life. One approach to handling such constrained optimization problems on vector spaces is to reformulate them as optimization tasks on a suitable Riemannian manifold, with the constraints acting as manifold parametrization. Often, the problems can be shown to possess desirable geometric properties like convexity with respect to distanceminimizing geodesics on the manifold, leading to provably efficient optimization algorithms
[1, 2, 3, 4]. These ideas can then be combined with stochastic optimization techniques influenced by [5], to deal with large datasets with theoretical convergence guarantees. See [6, 7]for recent examples. For instance, we can consider the problem of computing leading eigenvectors in the PCA setting
[8] with unitnorm constraints. Projectionbased strategies are normally used for this kind of problems [9], but alternating between solving and projecting can be prohibitively expensive in high dimensions. However, the unitnorm constraint can be used to cast the eigenvector problem into an unconstrained optimization scenario on the unit sphere, which happens to be one of the most wellbehaved Riemannian manifolds.Once the problems have been cast onto manifolds, one would want fast optimization algorithms that potentially use stochastic minibatches to deal with very large datasets. Such algorithms operating in Euclidean space have been widely researched in the optimization literature, but their development for Riemannian manifolds has been limited so far. In particular, one should note that the convergence speed limitations of unconstrained stochastic algorithms in the Euclidean case apply to manifold optimization as well; for instance a straightforward port of stochastic gradient descent to Riemannian manifolds
[2] attains the same sublinear convergence seen in Euclidean space. There has been extensive work in the Euclidean domain using variancereduced gradients to address this issue, with the aim of improving convergence rates by explicitly reducing the variance of stochastic gradients with suitably spaced fullgradient evaluations [8, 10]. Another nice advantage of this technique is the removal of the need for decaying learning rates for proving convergence, thereby solving the sublinearity issue as well sidestepping the nontrivial task of selecting an appropriate decay rate for SGDlike algorithms in largescale optimization scenarios. Researchers have begun porting these methods to the manifold optimization domain, with a stochastic firstorder variance reduced technique [7] showing robust convergence guarantees for both convex and nonconvex problems on geodesically complete manifolds.Another complementary approach to improving convergence rates is of course using secondorder updates for the iterates. In the Euclidean setting, one can show quadratic convergence rates for convex problems using Newton iterations, but these tend to be prohibitively expensive in highdimensional bigdata settings due to the need to store and invert the Hessian matrix. This limitation has led to the development of quasiNewton methods, most notably LBFGS [11]
, which uses lowerorder terms to approximate the inverse Hessian. The curvature information provided by the Hessian estimate allows superlinear convergence in ideal settings
[12]. While widely used for smalltomedium scale problems, adoption of these methods for big data problems has been limited, since the second order updates can be prohibitively expensive to compute in these situations. However, most optimization algorithms in the literature that use stochastic minibatching techniques to deal with large datasets are modifications of first order gradientdescent [13, 14] with relatively slower convergence in practical situations. This has recently begun to be addressed, with researchers devising stochastic variants of the LBFGS technique [15], with straightforward convergence analyses. This has also been combined with variance reduction techniques and shown to have a linear convergence rate for convex problems in Euclidean space [16]. Our work in this paper is in a similar vein: we study quasiNewton LBFGS updates with stochastic variance reduction techniques for optimization problems on Riemannian manifolds, and analyze their convergence behavior for convex and nonconvex functions.Contributions: The main contributions of this work may be summarized as follows:
1. We propose a stochastic LBFGS method for Riemannian manifolds using stochastic variance reduction techniques for the firstorder gradient estimates, and analyze the convergence for both convex and nonconvex functions under standard assumptions.
2. Our proof for strongly convex functions is different from those of recently proposed stochastic LBFGS algorithms using variancereduced gradients in Euclidean space [16] due to different bounds on the stochastic gradients. We do not use sectional curvature bounds in our proof for the convex case, making it structurally different from that of Riemannian SVRG [7].
3. We show strong experimental results on Karcher mean computations and calculation of leading eigenvalues, with noticeably better performance than Riemannian SVRG and VRPCA; the latter is one of the best performing Euclidean algorithms for the finding dominant eigenvalues, that also uses stochastic variancereduced gradients.
2 Preliminaries
2.1 Riemannian geometry
We begin with a brief overview of the differential geometric concepts we use in this work. We consider (smooth) manifolds that are locally homeomorphic to open subsets of , in the sense that the neighborhood of each point can be assigned a system of coordinates of appropriate dimensionality. Formally, this is defined with the notion of a chart at each , where is an open subspace containing . Smooth manifolds are ones with covering collections of differentiable () charts. A Riemannian metric is a bilinear tensor field of type , that is also symmetric and positive definite. A manifold endowed with such a metric is called a Riemannian manifold. The tangent space at every is a vector space, with the Riemannian metric as the attendant metric. then induces a norm for vectors in the tangent space , which we denote by .
Riemannian manifolds are endowed with the LeviCivita connection, which induces the notion of parallel transport of vectors from one tangent space to another along a geodesic, in a metric preserving way. That is, we have an operator where, informally speaking, joins , and for any , we have . The parallel transport can be shown to be an isometry.
For every smooth curve lying in , we denote its velocity vector as for each , with the “speed” given by . The length of such a curve is usually measured as Denoting the covariant derivative along of some , with respect to the Riemannian (LeviCivita) connection by , we call the acceleration of the curve. Curves with constant velocities () are called geodesics, and can be shown to generalize the notion of Euclidean straight lines. We assume that every pair can be connected by a geodesic s.t. . Immediately we have the notion of “distance” between any as the minimum length of all geodesics connecting , assuming the manifolds are geodesically complete as mentioned above, in that every countable decreasing sequence of lengths of geodesics connecting a pair of points has a welldefined limit.
The geodesic induces a useful operator called the exponential map, defined as . If there is a unique geodesic connecting , then the exponential map has an inverse, denoted by . The length of this geodesic can therefore be seen to be .
The derivative of a differentiable function is defined using the Riemannian connection by the following equivalence: , where . Then, by the Riesz representation theorem, there exists a gradient s.t. . Similarly, we can denote the Hessian as follows: . We denote the mapping from to the Riesz representation of by . One can consult standard textbooks on differential geometry [17, 18] for more details.
2.2 Convexity and Lipschitz smoothness on manifolds
Similar to [2, 3, 7], we define manifold (or geodesic) convexity concepts analogous to the Euclidean baselines, as follows : a set is convex on the manifold if there exists a geodesic connecting that completely lies in , i.e. . Then, a function can be defined as convex w.r.t. geodesics if connecting on the manifold, we have:
We can also define a notion of strong convexity as follows: a function is called strongly convex if for any and (sub)gradient , we have
(1) 
We define Lipschitz smoothness of a function by imposing Lipschitz continuity on the gradients, as follows:
where L is the smoothness parameter. Analogous to the Euclidean case, this property can also be formulated as::
(2) 
3 Stochastic Riemannian LBFGS
In this section we present our stochastic variancereduced LBFGS algorithm on Riemannian manifolds and analyze the convergence behavior for convex and nonconvex differentiable functions on Riemannian manifolds. We assume these manifolds to be Lipschitz smooth, as defined above, with existence of unique distanceminimizing geodesics between every two points, i.e. our manifolds are geodesically complete; this allows us to have a welldefined inverse exponential map that encodes the distance between a pair of points on the manifold. For the convergence analysis, we also assume to have a unique minimum at , where is a compact convex subset of the manifold.
3.1 The Algorithm
The pseudocode is shown in Algorithm 1. We provide a brief discussion of the salient properties, and compare it to similar algorithms in the Euclidean domain, for example [15, 16], as well as those on Riemannian manifolds, for example [3]. To begin, note that denotes the Riesz representation of the gradient , as defined in 2.1. We denote full gradients by and stochastic gradients by . Similar to other stochastic algorithms with variancereduction, we use two loops: each iteration of the inner loop corresponds to drawing one minibatch from the data and performing the stochastic gradient computations (Steps , ), whereas each outer loop iteration corresponds to two passes over the full dataset, once to compute the full gradient (Step ) and the other to make multiple minibatch runs (Steps through ). Compared to the Euclidean setting, note that the computation of the variancereduced gradient in Step involves an extra step: the gradients (s) reside in the tangent spaces of the respective iterates, therefore we have to perform parallel transport to bring them to the same tangent space before performing linear combinations.
To avoid memory issues and complications arising from Hessianvector computations in the Riemann setting, we chose to update the second correction variable using a simple difference of gradients approximation: . We should note here that the parallel transport parametrization should be clear from the context; here denotes transporting the vector to along the connecting geodesic . We omit any relevant annotations from the transport symbol to prevent notational overload. We calculate the first correction pair in one of two ways: (a) as , or (b) as . We denote these by Option and Option respectively in Alg. 1. Note that in both cases, denotes the parallel transport of the argument to the tangent space at . In our experiments, we noticed faster convergence for the strongly convex centroid computation problem with Option , along with computation of the correction pairs every iteration and a low memory pool. For calculating dominating eigenvalues on the unitradius sphere, Option yielded better results. Once the corrections pairs have been computed, we compute the descent step in Step 21 using the standard twoloop recursion formula given in [12], using the correction pairs stored in memory. Note that we use fixed stepsize in the update steps and in computing the correction pairs, and do not impose or perform calculations designed to have them satisfy Armijo or Wolfe conditions.
Compared to the Euclidean algorithms [15, 16], Alg.1 has some key differences: 1) we did not notice any significant advantage from using separate minibatches in Steps and , therefore we use the same minibatch to compute the VR gradient and the correction elements ; 2) we do not keep a running average of the iterates for computing the correction element (Steps through ); 3) we use constant stepsizes throughout the whole process, in contrast to [15] that uses a decaying sequence. Note that, as seen in Step , we use the firstorder VR gradient to update the iterates for the first iterations; this is because we calculate correction pairs every steps and evaluate the gradientinverse Hessian product (Step ) once at least two pairs have been collected. Similar to [12], we drop the oldest pair to maintain the memory depth . Compared to the algorithms in [3, 19], ours uses stochastic VR gradients, with all the attendant modifications and advantages, and does not use linesearch techniques to satisfy Wolfe conditions.
3.2 Analysis of convergence
In this section we provide the main convergence results of the algorithm. We analyze convergence for finitesum empirical risk minimization problems of the following form:
(3) 
where the Riemannian manifold is denoted by . Note that the iterates are updated in Algorithm 1 by taking the exponential map of the descent step multiplied by the stepsize, with the descent step computed as the product of the inverse Hessian estimate and the stochastic variancereduced gradient using the standard twoloop recursion formula. Thus, to bound the optimization error using the iterates, we will need bounds on both the stochastic gradients and the inverse Hessians. As mentioned in [7], the methods used to derive the former bounds for Euclidean algorithms cannot be ported directly to manifolds due to metric nonlinearities; see the proof of Proposition 1 for details. For the latter, we follow the standard template for LBFGS algorithms in the literature [4, 12, 15]. To begin, we make the following assumptions:
Assumption 1. The function in (5) is strongly convex on the manifold, whereas the s are individually convex.
Assumption 2. There exist s.t. .
These two assumptions allow us to (a) guarantee that has a unique minimizer in the convex sublevel set , and (b) derive bounds on the inverse Hessian updates using BFGS update formulae for the Hessian approximations. Similar to the Euclidean case, these can be written as follows:
(4) 
and by the ShermanMorrisonWoodbury lemma, that of the inverse:
where , and is the LaxMilgram representation of the Hessian. Details on these constructs can be found in [4], in addition to [17, 18].
3.2.1 Trace and determinant bounds
To start off our convergence discussions for both convex and nonconvex cases, we derive bounds for the trace and determinants of the Hessian approximations, followed by those for their inverses. The techniques used to do so are straightforward ports of the Euclidean originals [12], with some minor modifications to account for differential geometric technicalities. Using the assumptions above, we can prove the following bounds [4]:
Lemma 1.
Let be the approximation of the Hessian generated by Algorithm 1, and and be the corresponding LaxMilgram representations. Let , the memory parameter, be the number of correction pairs used to update the inverse Hessian approximation. Then, under Assumptions 1 and 2, we have:
Also, , for some .
From a notational perspective, recall that our notation for the parallel transport operator is , with the subscript denoting the geodesic. The symbols and in Lemma 1 above are unrelated to these geometric concepts, merely being the derived bounds on the eigenvalues of inverse Hessian approximations. The proof is given in the supplementary for completeness.
3.2.2 Convergence result for strongly convex functions
Our convergence result for strongly convex functions on the manifold can be stated as follows:
Proposition 1.
Let the Assumptions 1 and 2 hold. Further, let the in (5) be Sstrongly convex, and each of the be Lsmooth, as defined earlier. Define the following constants: , and . Denote the global optimum by . Then the iterate obtained after outer loop iterations will satisfy the following condition:
where the constants are chosen to satisfy for linear convergence.
For proving this statement, we will use the smoothness (2) and strong convexity (1) conditions mentioned earlier. As in the Euclidean case [10, 16], we will also require a bound on the stochastic variancereduced gradients. These can be bounded using triangle inequalities and smoothness on Riemannian manifolds, as shown in [7]. This alternative is necessary since the Euclidean bound first derived in [10], using the difference of the objective function at the iterates, cannot be ported directly to manifolds due to metric nonlinearities. Thus we take a different approach in our proof compared to the Euclidean case of [16], using the interpoint distances defined with the norms of inverse exponential maps. We do not use trigonometric distance inequalities [3, 6] for the convex case either, making the overall structure different from the proof of Riemannian SVRG as well. The details are deferred to the supplementary due to space limitations. However we do use the trigonometric inequality along with assumed lower bounds on sectional curvature for showing convergence for nonconvex functions, as described next.
3.2.3 Convergence for the nonconvex case
Here we provide a convergence result for nonconvex functions satisfying the following condition: , which automatically holds for strongly convex functions. We assume this to hold even if is nonconvex, since it allows us to show convergence of the iterates using . Further, similar to [2, 7] we assume that the sectional curvature of the manifold is lower bounded by . This allows us to derive a trigonometric inequality analogous to the Euclidean case, where the sides of the “triangle” are geodesics [6]. The details are given in the supplementary. Additionally, we assume that the eigenvalues of the inverse Hessian are bounded by within some suitable region around an optimum. The main result of this section may be stated as follows:
Proposition 2.
Let the sectional curvature of the manifold be bounded below by , and the be smooth. Let be an optimum of in (5). Assume the eigenvalues of the inverse Hessian estimates are bounded. Set , , and , where and . Then, for suitable choices of the inverse Hessian bounds , we can find values for the constants and so that the following holds:
is defined as if , and otherwise; is an upper bound on the diameter of the set mentioned earlier, containing an optimum . The proof is inspired by similar results from both Euclidean [20] and Riemannian [7] analyses, and is given in the supplementary. One way to deal with negative curvature in Hessians in Euclidean space is by adding some suitable positive to the diagonal, ensuring bounds on the eigenvalues. Investigation of such “damping” methods in the Riemannian context could be an interesting area of future work.
4 Experiments
4.1 Karcher mean computation for PD matrices
We begin with a synthetic experiment on learning the Karcher mean (centroid) [21] of positive definite matrices. For a collection of matrices , the optimization problem can be stated as follows:
We compare our minibatched implementation of the Riemannian SVRG algorithm from [7], denoted as rSVRG, with the stochastic variancereduced LBFGS procedure from Algorithm 1, denoted rSVLBFGS. We implemented both algorithms using the Manopt [22] and Mixest [23] toolkits. We generated three sets of random positive definite matrices, each of size , with condition numbers and , and computed the ground truths using code from [24]. Matrix counts were for condition number , and for the rest. Both algorithms used equal batchsizes of for the first and third datasets, and for the second, and were initialized identically. Both used learning rates satisfying their convergence conditions. In general we found rSVLBFGS to perform better with frequent correction pair calculations and a low retention rate, ostensibly due to the strong convexity; therefore we used , for all three datasets. As mentioned earlier, the correction pair was calculated using Option 1: where is calculated using the twoloop recursion. We used standard retractions to approximate the exponential maps. The retraction formulae for both symmetric PD and sphere manifolds used in the next section are given in the supplementary.
We calculated the error of iterate as , with being the ground truth. The log errors are plotted vs the number of data passes in Fig.1. Comparing convergence speed in terms of # data passes is often the preferred approach for benchmarking ML algorithms since it is an implementationagnostic evaluation and focuses on the key bottleneck (I/O) for big data problems. Comparisons of rSVRG with Riemannian gradient descent methods, both batch and stochastic, can be found in [7]. From Fig.1, we find rSVLBFGS to converge faster than rSVRG for all three datasets.
4.2 Leading eigenvalue computation
Next we conduct a synthetic experiment on calculating the leading eigenvalue of matrices. This of course is a common problem in machine learning, and as such is a unitnorm constrained nonconvex optimization problem in Euclidean space. It can be written as:
where is the data matrix, and are its columns. We can transform this into an unconstrained manifold optimization problem on the sphere defined by the norm constraint. To that end, we generated four sets of datapoints, for eigengaps , , and , using the techniques described in [8]. Each dataset contains vectors of dimension . We used a minibatch size of for the two Riemannian algorithms. As before, learning rates for rSVRG were chosen according to the bounds in [7]. Selecting appropriate values for the four parameters in rSVLBFGS (the first and secondorder learning rates, and ) was a nontrivial task; after careful grid searches within the bounds defined by the convergence conditions, we chose and for all four datasets. was set to for the dataset with eigengap , and for the rest. The correction pair was calculated using Option 2: . We plot the performance of rSVLBFGS, rSVRG and VRPCA in Fig.2. Extensive comparisons of VRPCA with other Euclidean algorithms have been conducted in [8]; we do not repeat them here. We computed the error of iterate as , being the ground truth obtained from Matlab’s .
We see that the rSVLBFGS method performs well on all four datasets, reaching errors of the order of well before VRPCA and rSVRG in the last three cases. The performance delta relative to VRPCA is particularly noticeable in each of the four cases; we consider this to be a noteworthy result for fixedstepsize algorithms on Riemannian manifolds.
5 Conclusion
We propose a novel LBFGS algorithm on Riemannian manifolds with variance reduced stochastic gradients, and provide theoretical analyses for strongly convex functions on manifolds. We conduct experiments on computing Riemannian centroids for symmetric positive definite matrices, and calculation of leading eigenvalues, both using large scale datasets. Our algorithm outperforms other Riemannian optimization algorithms with fixed stepsizes in both cases, and performs noticeably better than one of the fastest stochastic algorithms in Euclidean space, VRPCA, for the latter case.
References
 [1] P. A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
 [2] H. Zhang and S. Sra. Firstorder methods for geodesically convex optimization. In COLT, 2016.
 [3] S. Sra and R. Hosseini. Conic geometric optimization on the manifold of positive definite matrices. SIAM Journal on Optimization, 25(1):713–739, 2015.
 [4] W. Ring and B. Wirth. Optimization methods on Riemannian manifolds and their application to shape space. SIAM Journal on Optimization, 22(2):596–627, 2012.
 [5] H. Robbins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 [6] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
 [7] H. Zhang, S. J. Reddi, and S. Sra. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In NIPS, 2016.
 [8] O. Shamir. A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate. In ICML, 2015.

[9]
E. Oja.
Principal components, minor components, and linear neural networks.
Neural Networks, 5(6):927–935, 1992.  [10] R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In NIPS, 2013.
 [11] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1–3):503–528, 1989.
 [12] J. Nocedal and S. J. Wright. Numerical Optimization. Springer, 2006.
 [13] L. Bottou and Y. LeCun. Large scale online learning. In NIPS, 2004.
 [14] L. Bottou. Largescale machine learning with stochastic gradient descent. In International Conference on Computational Statistics, 2010.
 [15] R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer. A Stochastic QuasiNewton Method for Largescale Optimization, 2014. arXiv:1410.1068.
 [16] P. Moritz, R. Nishihara, and M. I. Jordan. A LinearlyConvergent Stochastic LBFGS Algorithm. In AISTATS, 2016.
 [17] J. Lee. Riemann Manifolds: an Introduction to Curvature. SpringerVerlag, 1997.
 [18] W. M. Boothby. An Introduction to Differentiable Manifolds and Riemannian Geometry. Academic Press Inc., 1986.
 [19] R. Hosseini and S. Sra. Matrix Manifold Optimization for Gaussian Mixtures. In NIPS, 2015.
 [20] S. J. Reddi, A. Hefny, S. Sra, B. Póczós, and A. Smola. Stochastic Variance Reduction for Nonconvex Optimization. In ICML, 2016.
 [21] R. Bhatia. Positive Definite Matrices. Princeton University Press, 2007.
 [22] N. Boumal, B. Mishra, P.A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15:1455–1459, 2014.
 [23] R. Hosseini and M. Mash’al. Mixest: An Estimation Toolbox for Mixture Models, 2015. arXiv:1507.06065.
 [24] D. A. Bini and B. Iannazzo. Computing the Karcher mean of symmetric positive definite matrices. Linear Algebra and its Applications, 483(4):1700–1710, 2013.
 [25] A. Mokhtari and A. Ribeiro. Global Convergence of Online Limited Memory BFGS. Journal of Machine Learning Research, 16(1):3151–3181, 2015.
6 Appendices
We present the convergence results from Propositions 1 and 2 in the main text in this section.
6.1 Analysis of convergence
We analyze convergence for finitesum empirical risk minimization problems of the following form:
(5) 
where the Riemannian manifold is denoted by . Note that the iterates are updated in Algorithm 1 by taking the exponential map of the descent step multiplied by the stepsize, with the descent step computed as the product of the inverse Hessian estimate and the stochastic variancereduced gradient using the standard twoloop recursion formula. Thus, to bound the optimization error using the iterates, we will need bounds on both the stochastic gradients and the inverse Hessians. As mentioned in [7], the methods used to derive the former bounds for Euclidean algorithms cannot be ported directly to manifolds due to metric nonlinearities; see the proof of Proposition 1 for details. For the latter, we follow the standard template for LBFGS algorithms in the literature [4, 12, 15]. To begin, we make the following assumptions:
Assumption 1. The function in (5) is strongly convex on the manifold, whereas the s are individually convex.
Assumption 2. There exist s.t. .
These two assumptions allow us to (a) guarantee that has a unique minimizer in the convex sublevel set , and (b) derive bounds on the inverse Hessian updates using BFGS update formulae for the Hessian approximations. Similar to the Euclidean case, these can be written as follows:
(6) 
and by the ShermanMorrisonWoodbury lemma, that of the inverse:
where . The is the LaxMilgram representation of the Hessian [4].
6.1.1 Trace and determinant bounds
To start off our convergence discussions for both convex and nonconvex cases, we derive bounds for the trace and determinants of the Hessian approximations, followed by those for their inverses. The techniques used to do so are straightforward ports of the Euclidean originals [12], with some minor modifications to account for differential geometric technicalities. Using the assumptions above, we can prove the following bounds [4]:
Lemma 1.
Let be the approximation of the Hessian generated by Algorithm 1, and and be the corresponding LaxMilgram representations. Let , the memory parameter, be the number of correction pairs used to update the inverse Hessian approximation. Then, under Assumptions 1 and 2, we have:
Also, , for some .
Proof.
For brevity of notation we temporarily drop the superscript. The proof for the Euclidean case [15, 25] can be generalized to the Riemannian scenario in a straightforward way, as follows. Define the average Hessian by
such that . Then, it can be easily shown that satisfies the bounds in Assumption 2. Furthermore, we have the following useful inequalities
(7) 
Let be the Riesz representation of . Recall that parallel transport is an isometry along the unique geodesics, which implies invariance of the trace operator. Then using the LBFGS update (6) and (7), we can bound the trace of the LaxMilgram representation of the Hessian approximations as follows:
This therefore proves boundedness of the largest eigenvalue of the estimates.
Similarly, to get a lower bound for the minimum eigenvalue, we bound the determinant as follows:
where we use to denote the maximum eigenvalue of , and use (7). Since is bounded above by the trace of , we can telescope the inequality above to get
The bounds on the maximum and minimum eigenvalues of thus derived allows us to infer corresponding bounds for those of as well, since by definition . ∎
6.1.2 Convergence results for the strongly convex case
Next we provide a brief overview of the bounds necessary to prove our convergence result. First, note the following bound implied by the Lipschitz continuity of the gradients:
Note the update step fom line 11 of Algorithm 1: . We can replace the inverse exponential map in the inner product above by the quantity in the parentheses. In order to replace by the eigenbounds from Lemma 1, we invoke the following result (Lemma 5.8 from [17]:
Lemma 2.
For any and , ,
where is the “speed” of the geodesic. This allows us to write . Recall that for Riemann geodesics we have , a constant.
Proposition 1.
Let the Assumptions 1 and 2 hold. Further, let the in (5) be Sstrongly convex, and each of the be Lsmooth, as defined earlier. Define the following constants: , and . Denote the global optimum by . Then the iterate obtained after outer loop iterations will satisfy the following condition:
where the constants are chosen to satisfy for linear convergence.
Proof.
From the smoothness condition (2), we have the following:
where we have omitted subscripts from the metric. Taking expectations, and using the bounds on the inverse Hessian estimates derived in Lemma 1, we have the following:
(8) 
where we have used the following bound on the stochastic variancereduced gradients derived in [7]:
This can be derived using triangle inequalities and the smoothness assumption. Note that the bound is different from the Euclidean case [10], due to technicalities introduced by the Riemannian metric not being linear in general.
Now, recall the condition , which follows from strong convexity. Using this, we can derive a bound on the term in (8) as follows:
where the second inequality follows from strong convexity (1), since . Plugging this into (8), we have the following:
(9) 
Now, note that strong convexity allows us to write the following:
where the last step follows from smoothness. Taking expectations of both sides, and using (9) for the first component on the right, we have
(10) 
Now, we denote , and
. Then, taking expectations over the sigma algebra of all the random variables till minibatch
, and some algebra, it can be shown that:where . Note that this provides a bound on the iterate at the end of the inner minibatch loop. Telescoping further, we have the bound
where . Then, using this result with a final appeal to the Lipschitz and strong convexity conditions, we have the bounds
thereby completing the proof. ∎
6.1.3 Convergence results for the nonconvex case
Lemma 3.
Let the sectional curvature of a Riemannian manifold be bounded below by . Let be the angle between sides of length and in a triangle on the manifold, with the third side of length , as usual. Then the following holds:
The cosine is defined using inner products, as in the Euclidean case, and the distances using inverse exponential maps, as seen above. The following sequence of results and proofs are inspired by the basic structure of [20], with suitable modifications involving the inverse Hessian estimates from the LBFGS updates.
Lemma 4.
Let the assumptions of Proposition 2 hold. Define the following functions:
where . Further, for , define the Lyapunov function . Then we have the following bound:
Comments
There are no comments yet.