Stochastic gradient descent on Riemannian manifolds

11/22/2011 ∙ by Silvère Bonnabel, et al. ∙ 0

Stochastic gradient descent is a simple approach to find the local minima of a cost function whose evaluations are corrupted by noise. In this paper, we develop a procedure extending stochastic gradient descent algorithms to the case where the function is defined on a Riemannian manifold. We prove that, as in the Euclidian case, the gradient descent algorithm converges to a critical point of the cost function. The algorithm has numerous potential applications, and is illustrated here by four examples. In particular a novel gossip algorithm on the set of covariance matrices is derived and tested numerically.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic approximation provides a simple approach, of great practical importance, to find the local minima of a function whose evaluations are corrupted by noise. It has had a long history in optimization and control with numerous applications (e.g. [19, 6, 36]). To demonstrate the main ideas on a toy example, we briefly mention a traditional procedure to optimize the ballistic trajectory of a projectile in a fluctuating wind. Successive gradient corrections (i.e. corrections proportional to the distance between the projectile impact and the target) are performed on the angle at which the projectile is launched. With a decreasing step size tending to zero, one can reasonably hope the launching angle will converge to a fixed value which is such that the corresponding impacts are centered on the target on average. One of the first formal algorithm of this kind is the Robbins-Monro algorithm [29], which dates back to the 1950s. It proves that for a smooth cost function having a unique minimum, the algorithm , where is a noisy evaluation of the gradient of at , converges in quadratic mean to the minimum, under specific conditions on the sequence .

Although stochastic gradient has found applications in control, system identification, and filtering theories (for instance a Kalman filter for noisy observations of a constant process is a Robbins-Monro algorithm), new challenging applications stem from the active machine learning community. The work of L. Bottou, a decade ago

[9]

, has popularized the stochastic gradient approach, both to address the online learning problem (identification of a constant parameter in real time from noisy output measurements) and large-scale learning (with ever-increasing data-sets, approximating the cost function with a simpler appropriate stochastic function can lead to a reduced numerical complexity). Some recent problems have been strong drivers for the development of new estimation methods, such as the one proposed in the present paper, dealing with stochastic gradient descent on manifolds.

The paper is organized as follows. In Section 2 general stochastic gradient descent algorithms on Riemannian manifolds are introduced. The algorithms already used in [23, 22, 3, 4, 26] can all be cast in the proposed general framework. The main algorithms are completely intrinsic, i.e. they do not depend on a specific embedding of the manifold or on a choice of local coordinates.

In section 3 the convergence properties of the algorithms are analyzed. In the Euclidian case, almost sure (a.s.) convergence of the parameter to a critical point of the gradient of the cost function is well-established under reasonable assumptions (see e.g. [9]), but this result has never been proven to hold for non-Euclidian spaces. In this paper, almost sure convergence of the proposed algorithms is obtained under several assumptions, extending the results of the Euclidian case to the Riemannian case.

In Section 4 the algorithms and the convergence results of the preceding sections are applied to four examples. The first example revisits the celebrated Oja algorithm [26]

for online principal component analysis (PCA). This algorithm can be cast in our versatile framework, and its convergence properties immediately follow from the theorems of Section

3. Moreover, the other results of the present paper allow to define alternative algorithms for online PCA with guaranteed convergence properties. The second example is concerned with the randomized computation of intrinsic means on a hyperbolic space, the Poincaré disk. This is a rather tutorial example, meant to illustrate the assumptions and results of the third theorem of Section 3. The convergence follows from this theorem. The last two examples are more detailed and include numerical experiments. The third example is concerned with a particular algorithm of [23]. The goal is to identify a positive semi-definite matrix (a kernel or a Mahalanobis distance) from noisy measurements. The theoretical convergence results of Section 3 allow to complete the work of [23], and simulations illustrate the convergence properties. The last example is concerned with a consensus application on the set of covariance matrices (see e.g. [20]). A novel randomized gossip algorithm based on the Fisher Information Metric is proposed. The algorithm has a meaningful statistical interpretation, and admits several invariance and guaranteed convergence properties that follow from the results of Section 3. As the state space is convex, the usual gossip algorithm [10] is well defined and can be implemented on this space. Simulations indicate the proposed Riemannian consensus algorithm converges faster than the usual gossip algorithm.

The appendix contains a brief recap of differential geometry. Preliminary results can be found in [8, 7].

2 Stochastic gradient on Riemannian manifolds

2.1 Standard stochastic gradient in

Let be a three times continuously differentiable cost function, where is a minimization parameter, and

is a probability measure on a measurable space

. Consider the optimization problem

(1)

In stochastic approximation, the cost function cannot be computed explicitly as the distribution is assumed to be unknown. Instead, one has access to a sequence of independent observations

of a random variable drawn with probability law

. At each time step

, the user can compute the so-called loss function

for any parameter . The loss can be viewed as an approximation of the (average) cost function evaluated under the input . Stochastic gradient descent is a standard technique to treat this problem. At each step the algorithm receives an input drawn according to , and performs a gradient descent on the approximated cost where can be viewed as the gradient of the loss, i.e., on average . As is not convex in many applications, one can not hope for a much better result than almost sure (a.s.) convergence of to some value , and convergence of to . Such a result holds under a set of standard assumptions, summarized in e.g. [9]. Note that, a.s. convergence is a very desirable property for instance in online estimation, as it ensures asymptotic convergence is always achieved in practice.

2.2 Limits of the approach: a motivating example

A topical problem that has attracted a lot of attention in the machine learning community over the last years is low-rank matrix estimation (or matrix completion, which can be viewed as the matrix counterpart of sparse approximation problems) and in particular the collaborative filtering problem: given a matrix containing the preference ratings of users about items (movies, books), the goal is to compute personalized recommendations of these items. Only a small subset of entries is known, and there are many ways to complete the matrix. A standard approach to overcome this ambiguity, and to filter the noise, is to constrain the state space by assuming the tastes of the users are explained by a reduced number of criteria (say, ). This yields the following non-linear optimization problem

The matrix being potentially of high dimension ( in the so-called Netflix prize problem), a standard method to reduce the computational burden is to draw random elements of , and perform gradient descent ignoring the remaining entries. Unfortunately the updated matrix does not have rank . Seeking the matrix of rank which best approximates it can be numerically costly, especially for very large . A more natural way to enforce the rank constraint is to endow the parameter space with a Riemannian metric, and to perform a gradient step within the manifold of fixed-rank matrices. In [22] this approach has led to stochastic gradient algorithms that compete with state of the art methods. Yet a convergence proof is still lacking. The convergence results below are general, and in Section 4.3 they will be shown to apply to this problem for the particular case of being symmetric positive definite.

2.3 Proposed general stochastic gradient algorithm on Riemannian manifolds

In this paper we propose a new procedure to address problem (1) where is a three times continuously differentiable cost function and where is now a minimization parameter belonging to a smooth connected Riemannian manifold . On , we propose to replace the usual update with the following update

(2)

where is the exponential map at , and can be viewed as the Riemannian gradient of the loss, i.e., we have on average where denotes the Riemannian gradient of at . The proposed update (2) is a straightforward transposition of the standard gradient update in the Euclidian case. Indeed,

is a tangent vector to the manifold that describes the direction of steepest descent for the loss. In update (

2), the parameter moves along the geodesic emanating from the current parameter position , in the direction defined by and with intensity . If the manifold at hand is equipped with the usual Euclidian scalar product, the geodesics are straight lines, and the definitions coincide. Note that, the procedure here is totally intrinsic, i.e. the algorithm is completely independent of the choice of local coordinates on the manifold.

In many cases, the exponential map is not easy to compute (a calculus of variations problem must be solved, or the Christoffel symbols need be known), and it is much easier and much faster to use a first-order approximation of the exponential, called a retraction. Indeed a retraction maps the tangent space at to the manifold, and it is such that . It yields the alternative update

(3)

Let us give a simple example to illustrate the ideas: if the manifold were the sphere endowed with the natural metric inherited through immersion in , a retraction would consist of a simple addition in the ambient space followed by a projection onto the sphere. This is a numerically very simple operation that avoids calculating the geodesic distance explicitly. See the Appendix for more details on Riemannian manifolds.

3 Convergence results

In this section, the convergence of the proposed algorithms (2) and (3) are analyzed. The parameter is proved to converge almost surely to a critical point of the cost function in various cases and under various conditions. More specifically, three general results are derived.

In Subsection 3.1) a first general result is derived: when the parameter is proved to remain in a compact set, the algorithm (2) converges a.s. under standard conditions on the step size sequence. This theorem applies in particular to all connected compact manifolds. Important examples of such manifolds in applications are the orthogonal group, the group of rotations, the sphere, the real projective space, the Grassmann and the Stiefel manifold. In Subsection 3.2), the result is proved to hold when a twice continuously differentiable retraction is used instead of the exponential map.

Finally, in Subsection 3.3), we consider a slightly modified version of algorithm (2) on specific non positively curved Riemannian manifolds. The step size is adapted at each step in order to take into account the effects of negative curvature that tend to destabilize the algorithm. Under a set of mild assumptions naturally extending those of the Euclidian case, the parameter is proved to a.s. remain in a compact set, and thus a.s. convergence is proved. Important examples of such manifolds are the Poincaré disk or the Poincaré half plane, and the space of real symmetric positive definite matrices . The sequence of step sizes will satisfy the usual condition in stochastic approximation:

(4)

3.1 Convergence on compact sets

The following theorem proves the a.s. convergence of the algorithm under some assumptions when the trajectories have been proved to remain in a predefined compact set at all times. This is of course the case if is compact.

Theorem 1.

Consider the algorithm (2) on a connected Riemannian manifold with injectivity radius uniformly bounded from below by . Assume the sequence of step sizes satisfy the standard condition (4). Suppose there exists a compact set such that for all . We also suppose that the gradient is bounded on , i.e. there exists such that for all and we have . Then converges a.s. and a.s.

Proof.

The proof builds upon the usual proof in the Euclidian case (see e.g. [9]). As the parameter is proved to remain in a compact set, all continuous functions of the parameter can be bounded. Moreover, as , there exists such that for we have . Suppose now that , then there exists a geodesic linking and as . and thus the Taylor formula implies that (see Appendix)

(5)

where is an upper bound on the Riemannian Hessian of in the compact set . Let be the increasing sequence of -algebras generated by the variables available just before time :

being computed from , is measurable . As is independent from we have . Thus

(6)

as . As , this proves is a nonnegative supermartingale, hence it converges a.s. implying that converges a.s. Moreover summing the inequalities we have

(7)

Here we would like to prove the right term is bounded so that the left term converges. But the fact that converges does not imply it has bounded variations. However, as in the Euclidian case, we can use a theorem by D.L. Fisk [15] ensuring that is a quasi martingale, i.e., it can be decomposed into a sum of a martingale and a process whose trajectories are of bounded variation. For a random variable , let denote the quantity .

Proposition 1.

[Fisk (1965)] Let be a non-negative stochastic process with bounded positive variations, i.e., such that . Then the process is a quasi-martingale, i.e.

Summing (6) over , it is clear that satisfies the proposition’s assumptions, and thus is a quasi-martingale, implying converges a.s. because of inequality (7) where the central term can be bounded by its absolute value which is convergent thanks to the proposition. But, as , this does not prove converges a.s. However, if is proved to converge a.s., it can only converge to 0 a.s. because of condition (4).

Now consider the nonnegative process . Bounding the second derivative of by , along the geodesic linking and , a Taylor expansion yields , and thus bounding from below the Hessian of on the compact set by we have . We just proved the sum of the right term is finite. It implies is a quasi-martingale, thus it implies a.s. convergence of towards a value which can only be 0.

3.2 Convergence with a retraction

In this section, we prove Theorem 1 still holds when a retraction is used instead of the exponential map.

Theorem 2.

Let be a connected Riemannian manifold with injectivity radius uniformly bounded from below by . Let be a twice continuously differentiable retraction, and consider the update (3). Assume the sequence of step sizes satisfy the standard condition (4). Suppose there exists a compact set such that for all . We suppose also that the gradient is bounded in , i.e. for we have for some . Then converges a.s. and a.s.

Proof.

Let . The proof essentially relies on the fact that the points and , are close to each other on the manifold for sufficiently large . Indeed, as the retraction is twice continuously differentiable there exists such that for sufficiently small, , and . As for sufficiently large can be made arbitrarily small (in particular smaller than the injectivity radius), this implies .

We can now reiterate the proof of Theorem 1. We have . The term can be bounded as in (5) whereas we have just proved is bounded by where is a bound on the Riemannian gradient of in . Thus is a quasi-martingale and . It means that if converges, it can only converge to zero.

Let us consider the variations of the function . Writing and bounding the first term of the right term by where is a bound on the gradient of , we see the inequalities of Theorem 1 are unchanged up to second order terms in . Thus is a quasi-martingale and thus converges. ∎

3.3 Convergence on Hadamard manifolds

In the previous section, we proved convergence as long as the parameter is known to remain in a compact set. For some manifolds, the algorithm can be proved to converge without this assumption. This is the case for instance in the Euclidian space, where the trajectories can be proved to be confined to a compact set under a set of conditions [9]. In this section, we extend those conditions to the important class of Hadamard manifolds, and we prove convergence. Hadamard manifolds are complete, simply-connected Riemannian manifolds with nonpositive sectional curvature. In order to account for curvature effects, the step size must be slightly adapted at each iteration. This step adaptation yields a more flexible algorithm, and allows to relax one of the standard conditions even in the Euclidian case.

Hadamard manifolds have strong properties. In particular, the exponential map at any point is globally invertible (e.g. [27]). Let be the squared geodesic distance. Consider the following assumptions, which can be viewed as an extension of the usual ones in the Euclidian case:

  1. There is a point and such that the opposite of the gradient points towards when becomes larger than i.e.

  2. There exists a lower bound on the sectional curvature denoted by .

  3. There exists a continuous function that satisfies

Theorem 3.

Let be a Hadamard manifold. Consider the optimization problem (1). Under assumptions 1-3, the modified algorithm

(8)

is such that converges a.s. and a.s.

Assumptions 1-3 are mild assumptions that encompass the Euclidian case. In this latter case Assumption 3 is usually replaced with the stronger condition for (note that, this condition immediately implies the existence of the function ). Indeed, on the one hand our general procedure based on the adaptive step allows to relax this standard condition, also in the Euclidian case, as will be illustrated by the example of Section 4.3. On the other hand, contrarily to the Euclidian case, one could object that the user must provide at each step an upper bound on a function of , where is the point appearing in Assumption 1, which requires some knowledge of . This can appear to be a limitation, but in fact finding a point fulfilling Assumption 1 may be quite obvious in practice, and may be far from requiring direct knowledge of the point the algorithm is supposed to converge to, as illustrated by the example of Section 4.2.

Proof.

The following proof builds upon the Euclidian case [9]. We are first going to prove that the trajectories asymptotically remain in a compact set. Theorem 1 will then easily apply. A second order Taylor expansion yields

(9)

where is an upper bound on the operator norm of half of the Riemannian hessian of along the geodesic joining to (see the Appendix). If the sectional curvature is bounded from below by we have ([12] Lemma 3.12)

where is the Hessian of the squared half distance and

denotes the largest eigenvalue of an operator. This implies that

. Moreover, along the geodesic linking and , triangle inequality implies as and there exists such that for . Thus for where . Let be the increasing sequence of -algebras generated by the several variables available just before time : . As is independent from , and is measurable, we have . Conditioning (9) to , and using Assumption 3:

(10)

Let be a smooth function such that

  • for

  • for

  • for

and let . Let us prove it converges a.s. to . As for all a second order Taylor expansion on yields

Because of the triangle inequality we have . Thus which is less than for . Using Assumption 3 and the fact that is measurable we have . Using (10) we have

(11)

as is positive, and less than 1. Either and then we have and thus . Or , and Assumption 1 ensures is negative. As , (11) implies . In both cases , proving is a positive supermartingale, hence it converges a.s. Let us prove it necessarily converges to . We have . Proposition 1 proves that is a quasi-martingale. Using (11) we have inequality

and as is a quasi-martingale we have a.s.

(12)

Consider a sample trajectory for which converges to . It means that for large enough and thus . Because of Assumption 1 we have also . This contradicts (12) as . The last equality comes from (4) and the fact that is continuous and thus bounded along the trajectory.

It has been proved that almost every trajectory asymptotically enters the ball of center and radius and stays inside of it. Let us prove that we can work on a fixed compact set. Let . We have just proved . Thus to prove a.s. convergence, it is thus sufficient to prove a.s. convergence on each of those sets. We assume from now on the trajectories all belong to the ball of center and radius . As this is a compact set, all continuous functions of the parameter can be bounded. In particular for some and thus the modified step size verifies the conditions of Theorem 1. Moreover, for some on the compact as it is dominated by . As there is no cut locus, this weaker condition is sufficient, since it implies that (6) holds. The proof follows from a mere application of Theorem 1 on this compact set.

Note that, it would be possible to derive an analogous result when a retraction is used instead of the exponential, using the ideas of the proof of Theorem 2. However, due to a lack of relevant examples, this result is not presented.

4 Examples

Four application examples are presented. The first two examples are rather tutorial. The first one illustrates Theorems 1 and 2. The second one allows to provide a graphical interpretation of Theorem 3 and its assumptions. The third and fourth examples are more detailed and include numerical experiments. Throughout this section is a sequence of positive step sizes satisfying the usual condition (4).

4.1 Subspace tracking

We propose to first revisit in the light of the preceding results the well-known subspace tracking algorithm of Oja [26]

which is a generalization of the power method for computing the dominant eigenvector. In several applications, one wants to compute the

principal eigenvectors, i.e. perform principal component analysis (PCA) of a covariance matrix , where . Furthermore, for computational reasons or for adaptiveness, the measurements are supposed to be a stream of -dimensional data vectors where (online estimation). The problem boils down to estimating an element of the Grassmann manifold Gr of -dimensional subspaces in a -dimensional ambient space, which can be identified to the set of rank projectors:

Those projectors can be represented by matrices where belongs to the Stiefel manifold , i.e., matrices of whose columns are orthonormal. Define the cost function

which is minimal when is a basis of the dominant subspace of the covariance matrix . It is invariant to rotations . The state-space is therefore the set of equivalence classes . This set is denoted by . It is a quotient representation of the Grassmann manifold Gr. This quotient geometry has been well-studied in e.g. [13]. The Riemannian gradient under the event is: . We have the following result

Proposition 2.

Suppose are uniformly bounded. Consider the stochastic Riemannian gradient algorithm

(13)

where is the compact SVD of the matrix . Then converges a.s. to an invariant subspace of the covariance matrix .

Proof.

The proof is a straightforward application of Theorem 1. Indeed, the update (13) corresponds to (2) as it states that is on the geodesic emanating from with tangent vector at a distance from . As the input sequence is bounded, so is the sequence of gradients. The injectivity radius of the Grassmann manifold is , and is thus bounded away from zero, and the Grassmann manifold is compact. Thus Theorem 1 proves that a.s. converges to a point such that , i.e. . For such points there exists such that , proving is an invariant subspace of . A local analysis proves the dominant subspace of (i.e. the subspace associated with the first eigenvalues) is the only stable subspace of the averaged algorithm [26] under basic assumptions. ∎

We also have the following result

Proposition 3.

Consider a twice differentiable retraction . The algorithm

(14)

converges a.s. to an invariant subspace of the covariance matrix .

The result is a mere application of Theorem 2. Consider in particular the following retraction: =qf

where qf() extracts the orthogonal factor in the QR decomposition of its argument. For small

, this retraction amounts to follow the gradient in the Euclidian ambient space , and then to orthonormalize the matrix at each step. It is an infinitely differentiable retraction [1]. The algorithm (14) with this particular retraction is known as Oja’s vector field for subspace tracking and has already been proved to converge in [25]. Using the general framework proposed in the present paper, we see this convergence result directly stems from Theorem 2.

This example clearly illustrates the benefits of using a retraction. Indeed, from a numerical viewpoint, the geodesic update (13) requires to perform a SVD at each time step, i.e. operations, whereas update (14) is only an orthonormalization of the vectors having a lower computational cost of order , which can be very advantageous, especially when is large.

4.2 Randomized computation of a Karcher mean on a hyperbolic space

We propose to illustrate Theorem 3 and the assumptions it relies on on a well-known and tutorial manifold. Consider the unit disk with the Riemannian metric defined on the tangent plane at by

where represents the conventional scalar product in

. The metric tensor is thus diagonal, so the angles between two intersecting curves in the Riemannian metric are the same as in the Euclidian space. However, the distances differ: as a point is moving closer to the boundary of the disk, the distances are dilated so that the boundary can not be reached in finite time. As illustrated on the figure, the geodesics are either arcs of circles that are orthogonal to the boundary circle, or diameters. The

Poincaré disk equipped with its metric is a Hadamard manifold.

Figure 1: The Poincaré disk. The boundary is at infinite distance from the center. The geodesics (solid lines) are either arcs of circles perpendicular to the boundary of the disk, or diameters. The dashed circle is the boundary of a geodesic ball centered at 0. Assumption 1 is obviously verified: if a point outside the ball makes a small move towards any point inside the ball along the geodesic linking them, its distance to 0 decreases.

The Karcher (or Fréchet) mean on a Riemannian manifold is defined as the minimizer of . It can be viewed as a natural extension of the usual Euclidian barycenter to the Riemannian case. It is intrinsically defined, and it is unique on Hadamard manifolds. There has been growing interest in computing Karcher means recently, in particular for filtering on manifolds, see e.g. [2, 5, 4]. On the Poincaré disk we propose to compute the mean of points in a randomized way. The method is as follows, and is closely related to the approach [4]. Let be the optimization parameter. The goal is to find the minimum of the cost function

At each time step, a point is randomly picked with an uniform probability law. The loss function is , and is the Riemannian gradient of the half squared geodesic distance . On the Poincaré disk, the distance function is defined by where . As the metric tensor is diagonal, the Riemannian gradient and the Euclidian gradient have the same direction. Moreover its norm is simply (see the Appendix). It is thus easy to prove that

(15)

When there is a lot of redundancy in the data, i.e. when some points are very close to each other, a randomized algorithm may be much more efficient numerically than a batch algorithm. This becomes obvious in the extreme case where the ’s are all equal. In this case, the approximated gradient coincides with the (Riemannian) gradient of the cost . However, computing this latter quantity requires times more operations than computing the approximated gradient. When is large and when there is a lot of redundancy in the data, we thus see a randomized algorithm can lead to a drastic reduction in the computational burden. Besides, note that, the stochastic algorithm can also be used in order to filter a stream of noisy measurements of a single point on the manifold (and thus track this point in case it slowly moves). Indeed, it is easily seen that if and is the Euclidian distance, the proposed update boils down to a first order discrete low pass filter as it computes a weighted mean between the current update and the new measurement .

Proposition 4.

Suppose at each time a point is randomly drawn. Let be such that and let . Consider the algorithm (8) where is given by (15) and

Then converges a.s. to the Karcher mean of the points .

Proof.

The conditions of Theorem 3 are easily checked. Assumption 1: it is easy to see on the figure that Assumption 1 is verified with , and being the radius of an open geodesic ball centered at and containing all the points . More technically, suppose . The quantity is equal to where is a positive quantity bounded away from zero for . As there exists such that , and , the term is negative and bounded away from zero, and so is its average over the s. Assumption 2: in dimension 2, the sectional curvature is known to be identically equal to . Assumption 3 is obviously satisfied as by triangle inequality.

Note that, one could object that in general finding a function satisfying Assumption 3 of Theorem 3 requires knowing and thus requires some knowledge of the point of Assumption 1. However, we claim (without proof) that in applications on Hadamard manifolds, there may be obvious choices for . This is the case in the present example where finding such that assumptions 1-3 are satisfied requires very little (or no) knowledge on the point the algorithm is supposed to converge to. Indeed, is a straightforward choice that always fulfills the assumptions. This choice is convenient for calculations as the geodesics emanating from are radiuses of the disk, but many other choices would have been possible.

4.3 Identification of a fixed rank symmetric positive semi-definite matrix

To illustrate the benefits of the approach on a recent non-linear problem, we focus in this section on an algorithm of [23], and we prove new rigorous convergence results. Least Mean Squares (LMS) filters have been extensively utilized in adaptive filtering for online regression. Let be the input vector, and be the output defined by where the unknown vector is to be identified (filter weights), and is a noise. At each step we let and the approximated cost function is . Applying the steepest descent leads to the stochastic gradient algorithm known as LMS: .

We now consider a non-linear generalization of this problem coming from the machine learning field (see e.g. [37]), where