We consider constrained optimization in the case where the constraint set is an implicitly-defined manifold or real algebraic variety . We also consider statistical models in Section 3. The algorithms we develop are local, motivated by examples which are too large for global methods like Lagrange multipliers. In Riemannian optimization, the basic idea of local methods is simple. To find the argmin, start somewhere on the manifold and move along a descent direction in the tangent space. This (usually) takes you off the manifold, so you must retract (see Definition 1) back to the manifold, and repeat. Eventually you hope to find locally minimizing . Robust techniques and theoretical results accompany a well-chosen Riemannian metric and retraction map [3, 8, 22]. However, such methods require prior knowledge of a suitable retraction map, ideally with an explicit formula. These are available for many common matrix manifolds, making Riemannian optimization a powerful tool in applications. However, for less common, implicitly-defined manifolds or algebraic varieties, there are no readily available formulas for retractions. Therefore, retraction maps applicable to an arbitrary manifold are desirable.
To construct retraction maps on embedded manifolds or algebraic varieties, intuition suggests retracting to the closest point on the manifold. Such maps are called metric projections in [8, p.111]
. When we step linearly off a sphere, we can normalize the resulting point. When we leave the manifold of fixed rank matrices, we can apply the singular value decomposition. However, for most implicitly-defined manifolds, retracting to the closest point does not admit an explicit formula or easy solution, and “it is difficult to compute in general”[8, p.164]. However, if we can compute metric projections, it has been shown  that they are second-order retractions [8, Def. 5.41], a desirable property since they match geodesics on the manifold up to second-order. After all, retractions should approximate the exponential map, which is useful provided that retraction is easier to compute in practice.
On the one hand this is an obvious choice; retract to the closest point on the manifold. Upon further reflection an absurdity is noted: We begin with a constrained optimization problem on , to minimize some function . Then we suggest solving another constrained optimization problem on repeatedly, at every local step, with the goal of finding the closest point. In particular, we do not have an explicit formula. Have we replaced one hard problem with many hard problems? The absurdity is resolved by use of a small trick. As we explain in Section 2, we can use homotopy continuation methods [6, 9, 24] to easily find the closest point at each local step. We do this by creating a system of equations whose solutions we know, and deforming them via a homotopy. To be clear, without this tool, the metric retraction might be an absurd choice, since we cannot compute it easily. Note that, in contrast to , we use homotopy continuation locally, rather than globally. In other words, we only use homotopy continuation to compute the retraction map at each local step.
In Section 2 we describe Algorithm 1, which uses homotopy continuation to compute the Euclidean distance retraction (see Theorem 1) for any implicitly-defined submanifold of . In Section 3, we view statistical models as Riemannian submanifolds of the probability simplex equipped with the Fisher metric. In this setting, we prove Theorem 2
, which shows that ideas from maximum-likelihood estimation may be used to create a second-order retraction applicable to arbitrary statistical models at smooth points. To show second-order, we use geodesics and covariant derivatives with the Levi-Civita connection oncorresponding to the Fisher metric. Theorem 2 shows that replacing the linear step from Algorithm 1 by the quadratic curve in the usual coordinates on allows the resulting retraction to follow geodesics on to second-order accuracy. We also describe Algorithm 2, which again uses homotopy continuation to compute this retraction. Section 4 describes a Julia package which implements Algorithm 1 and briefly demonstrates its usage, while Section 5 lists a few questions for future research.
Acknowledgements: We want to thank Paul Breiding, who suggested using the Euclidean distance equations for the homotopy, rather than another system of equations which we were first using for Algorithm 1.
2 Euclidean distance retraction by homotopy
We begin by considering submanifolds with the usual Euclidean metric, and which are implicitly-defined, meaning a smooth function is known such that . Since our methods are local, we also consider real algebraic varieties , which are smooth manifolds outside the singular locus. Given any smooth function our main objective is to solve the constrained optimization problem of minimizing restricted to .
Consider the smooth manifold structure on induced by viewing the identity map as a global chart. For all of our computations we will use the coordinates denoted for this global chart.
The chart-induced global frame for the tangent bundle over is denoted . Recall, a global frame of a smooth vector bundle is an ordered tuple of global sections whose values at every point yield a basis. The values at a point as a Riemannian manifold whose metric is given by the identity matrix in the global coordinates.
. Recall, a global frame of a smooth vector bundle is an ordered tuple of global sections whose values at every point yield a basis. The values at a pointwill be denoted , so that the vectors form a basis for . We view
as a Riemannian manifold whose metric is given by the identity matrix in the global coordinates.
Given a point , we can evaluate the Euclidean gradient of at , obtaining a vector . Projecting onto the tangent space , we obtain a descent direction . There are (better) ways to produce descent directions as well, for example [8, Alg. 6.3 (Riemannian trust-region)] or [3, Alg. 5 (Riemannian Newton method)]. After obtaining a descent direction, the idea is to move from to , and then retract back to the manifold in such a way as to approximate the exponential map. This leads us to the following
Let be a manifold and a point. A (first-order) retraction at is a smooth map satisfying
is the identity map.
Equivalently, for each curve with for some we have and . A local retraction at p is a retraction defined in some neighborhood of , but satisfying the same two properties above.
A retraction is called second-order if, in addition, for all
This implies that the curve matches geodesics on up to second order at , i.e. as [2, Prop. 3].
We note that a retraction map has a different definition in topology, although they are related. Our Definition 1 is appropriate in the context of Riemannian optimization, and matches the definition in the textbook , for instance. Retractions were first defined in . In the more general setting of Section 3 we consider Riemannian submanifolds of a Riemannian manifold which is not Euclidean space. In that case, the condition for a second-order retraction must instead be that the covariant derivative of the vector field defined by the velocity of the curve lies in the normal space at [8, Def. 8.64 and Eqn. 8.27]. We delay the additional details until Section 3.
Theorem 1 ().
Let be a smooth manifold or real algebraic variety. For any smooth point , define the relation by
There exists a neighborhood of in such that defines a local, second-order retraction. The curve matches the geodesic for up to second-order at .
Denote the unit sphere in by . The expression
provides a smooth map . As can be easily checked, this formula defines a retraction, called the Euclidean distance retraction, which admits an explicit formula in this simple example. Consequently, we could rewrite
since the argmin is unique in every point except , which never equals .
Choose the orthogonal group. Denote by the QR decomposition of a real, invertible matrix into an orthogonal matrix and an upper triangular matrix with strictly positive diagonal entries. Notice that GL
the QR decomposition of a real, invertible
matrix into an orthogonal matrix and an upper triangular matrix with strictly positive diagonal entries. Notice that GLis the complement of a hypersurface in , cut out by . Then, the map
is a (local) retraction on the orthogonal group’s tangent bundle with the projection onto its first argument. A proof can be found in [3, Ex. 4.1.2]
In general, the Euclidean distance retraction will not admit an explicit formula, except in special cases like those above. Therefore, it is desirable to develop techniques for computing the Euclidean distance retraction on arbitrary, implicitly-defined manifolds, in case an explicit formula is not available. This will be accomplished via Algorithm 1 below.
2.1 Homotopy continuation for the Euclidean distance retraction
In order to compute for the Euclidean distance retraction of Theorem 1, we need to solve an optimization problem over the constraint set , namely, minimizing Euclidean distance. At first glance, this is absurd. We started with an optimization problem (minimizing some other function restricted to ), and now we suggest taking local steps by solving more optimization problems with the same constraints. However, we have tricks to solve these local optimization problems efficiently, so our proposal is surprisingly useful. In this section we describe how homotopy continuation allows us to easily compute the Euclidean distance retraction.
Given any smooth function we consider the constrained optimization problem of minimizing restricted to , where is a smooth map and . In what follows, let denote the Jacobian matrix of in the standard basis, evaluated at the point , and let denote the transpose matrix. Given a point , and a descent direction , the idea is to move from to , which in general will not lie on the manifold . Subsequently, we want to retract back to the manifold in such a way as to approximate the exponential map. Theorem 1 shows that computing the Euclidean closest point to the point achieves this goal.
Homotopy Continuation. Choose randomly, say, from a multivariate Gaussian distribution.
randomly, say, from a multivariate Gaussian distribution. Set. Let be the map sending whose first component functions are those of and the the last component functions of are those of , denoted . Fixing the parameters , we know that solves the system of equations for the variables , by construction.
Now set , and create a straight-line path so that and . For the homotopy (1) defined below, certain parameter values might correspond to a system of equations with singular Jacobian. Such problematic parameter values are called the discriminant in . By Lemma 7.1.2 of , and since we chose randomly, this path avoids the discriminant in the parameter space with probability one. Using rather than was crucial here, since the discriminant forms a subset of real codimension two in , and hence our straight line will avoid it with probability one. Therefore, the solution paths from the homotopy (1) below will consist of nonsingular solutions for all fixed . We setup the homotopy to begin at and end at since the only singular solution can occur at , and there are more floating point numbers near zero.
Define the homotopy by
Consider the system of ordinary differential equations
Consider the system of ordinary differential equationswith initial values . Then the solution path can be computed numerically from to to obtain the previously unknown solutions to the equations . By construction, , and we will have successfully computed the Euclidean distance retraction. Note that although our solution path travels in complex space to ensure everywhere nonsingular Jacobians by avoiding the discriminant, the solution we obtain at the end is again real-valued (See Remark 1 below).
A few remarks are in order. One could object that we are solving a system of ODEs to compute a retraction, so why don’t we simply solve the system of ODEs defining the exponential map? One reason is that for arbitrary implicitly-defined manifolds we do not have parametrizations, and hence even setting up the ODEs corresponding to geodesics becomes difficult. Another reason is that the ODEs arising in homotopy continuation are very easy to solve, since at every we know that . The errors that normally accumulate during the numerical solution of ODEs can be corrected by applying Newton’s method to the system of equations in variables at any fixed . Even a naive algorithm using Euler’s method to predict and Newton’s method to correct is surprisingly effective (see  for more details). Adaptive step sizes can be chosen so that we never leave the region of quadratic convergence of Newton’s method, and hence the corrector steps are accomplished with just a few linear system solves. Already there exists efficient software specially designed for solving the ODEs arising in homotopy continuation, see for example [6, 9, 11, 19, 25].
Path-tracking. Although in practice one can use the specialized software listed above, we will give a brief description of a naive path-tracking algorithm for the reader’s benefit. To simplify notation, we lump the variables into one symbol, , for this paragraph. Solving the system of ordinary differential equations
to numerically approximate the solution path is much easier than for an arbitrary system of ODEs. The main reason is that we can use Newton’s method to quickly correct any accumulated errors, since for any fixed we know that the true solution point solves the square system of equations , and that its Jacobian will be nonsingular for all . By a first-order Taylor series, consider that
In (2), is a vector resulting from the map evaluated at the current point . is the square Jacobian matrix of partial derivatives in the variables evaluated at , and is a vector of derivatives in evaluated at . To predict the next point on the solution path, choose , set the left-side of (2) to zero, and solve the resulting system of linear equations for . This is essentially Euler’s method. If at some point we notice that is no longer zero, errors have begun to accumulate, and so we correct via Newton’s method: In (2), set , set the left-side to zero, and solve the resulting linear system for . If the stepsizes are chosen appropriately, we should never leave the region of quadratic convergence of Newton’s method, and hence the correction steps will quickly eliminate any errors in path-tracking.
We hope the brief explanation above has at least convinced the reader to learn more about homotopy continuation. We defer to either of the two excellent textbooks [6, 24] or the instructive tutorials at https://www.juliahomotopycontinuation.org/ for more details on how homotopy continuation works, and its many benefits. At this point, we record the main idea in the following algorithm.
By we denote a step size, which may be chosen appropriately using back-tracking line search or many other methods.
In general, the target system at will have many solutions, since there can be many critical points for minimizing Euclidean distance from to . One of these critical points is , which should satisfy . If the size is too large, the solution of the start system at may evolve into some far-away solution of the target system with . If we had all solutions to the start system, and tracked them, it is guaranteed that we would find . For a precise statement of this theorem, see [24, Thm. 8.4.1]. However, we only start with one solution . This issue is easily identified in practice. If the solution computed via homotopy is complex-valued rather than real, you know . If the solution is real-valued, but the distance is much larger than expected based on , then . In either case, one may choose another or simply shorten , making closer to . Running the algorithm again for this new, “shorter” homotopy, should solve the issue. In fact, in all examples we computed using the software described in Section 4, we never ran into this problem. We discuss this further in Section 5 in terms of future research.
3 Statistics: Maximum likelihood retraction
In this section we leave the setting of submanifolds of Euclidean space and instead consider statistical models as submanifolds of a more general Riemannian manifold whose metric varies from point to point. In this context, we will show that if minimizing Euclidean distance is replaced by maximizing likelihood, then we can define analogous retraction maps to optimize functions on statistical models . Theorem 2 shows how to make these retractions second-order accurate for the Levi-Civita connection on . Finally, Algorithm 2 describes how homotopy continuation may be used to compute these retraction maps on arbitrary, implicitly-defined statistical models.
Consider the smooth manifold structure on induced by viewing the identity map as a global chart. We will use the coordinates denoted for this global chart for all of our computations. The chart-induced global frame for the cotangent bundle over is denoted , while will denote the chart-induced global frame for the tangent bundle. Recall, a global frame of a smooth vector bundle is an ordered tuple of global sections whose values at every point yield a basis. The values at a point will be denoted , so that the vectors form a basis for .
Let be the open probability simplex defined by
Given we define the log-likelihood function by
A smooth statistical model is a subset which is also a smooth manifold. In contrast, an algebraic statistical model is a subset which is also a real algebraic variety. Given a point coming from the results of an experiment, maximum likelihood estimation seeks to find the point which maximizes the likelihood of observing the results . Such a point also maximizes restricted to . For more details, see [14, Ch. 6] or .
A Riemannian metric is a smooth symmetric covariant 2-tensor field that is positive definite at each point. In the global chart on
A Riemannian metric is a smooth symmetric covariant 2-tensor field that is positive definite at each point. In the global chart on, the Fisher metric is defined by
where and for any are smooth functions written in the global chart coordinates. Indeed, it is a Riemannian metric (see [5, p. 31]). In statistics and information geometry, the Fisher metric arises naturally, being the unique metric invariant under sufficient statistics [5, Thm 1.2]. In maximum likelihood estimation, one chooses the parameters of a model so as to maximize the likelihood of the observed data. Since the Fisher metric appears in the Cramér-Rao inequality [5, Thm 1.3], it is also related to the reliability of the estimator.
Given a function on a manifold, the vector field is given in local coordinates [17, p. 27] by
where the are smooth functions giving the inverse of the matrix of the metric. For the Fisher metric, and for . Since we see that
Compare this with the Euclidean gradient, which we denote , given by
in the same global frame, but computed with the Euclidean metric . The metrics themselves will be denoted by and when they act as inner products on the tangent space . Accordingly, and denote the normal spaces inside with respect to to the Fisher and Euclidean metrics. Viewing as a Riemannian submanifold of , we will use the Levi-Civita connection (cf. [17, Ch. 5]) corresponding to the Fisher metric in our computations of covariant derivatives below.
Let be a smooth statistical model. For and , consider the relation defined by where has components . Then there exists a neighborhood of where the argmax is unique and the relation defines a function
which is a first-order, local retraction on at .
Furthermore, viewing as a Riemannian submanifold of with the Fisher metric and its Levi-Civita connection, is a second-order retraction, so that the curve matches geodesics on up to second-order at .
First we check that there exists a neighborhood of such that actually defines a function . Then we will check that this function is a retraction. Consider the normal bundle as a submanifold of and define by in the global coordinates, denoted for short. This is a smooth map, being the restriction of the addition map on . Its differential at is bijective. To see this, consider restricted to the subset . Then is a diffeomorphism of onto and hence its differential at is bijective to . Restricting to gives a map onto the affine subspace whose differential is also bijective. Since we see that is bijective. Therefore is a local diffeomorphism at by the inverse function theorem for manifolds [16, Thm. 4.5]. In particular, there is a neighborhood of in where is injective.
The image of this neighborhood under is an open neighborhood of in for which the maximum likelihood problem has a unique solution. Say had two points which maximized the log-likelihood function restricted to . By the usual first-order criteria for local maxima constrained to a subset of , we know that the Euclidean gradients and lie in the Euclidean normal spaces and , respectively. We claim that this forces both and to be in and , respectively. To see this we calculate with an arbitary tangent vector ,
The first term is zero by first-order criteria for local maxima, and the second term is zero because is a statistical model, and hence lives in an affine hyperplane whose Euclidean normal vector is the all ones vector, since probabilities sum to one.
In this case, the symbol
is a statistical model, and hence lives in an affine hyperplane whose Euclidean normal vector is the all ones vector, since probabilities sum to one. In this case, the symboldenotes the all ones vector . Therefore . Similarly, . But then and contradicts the injectivity of on . Therefore is a neighborhood of for which the maximum likelihood problem has a unique solution. Let be the map sending a tangent vector to the tuple , where all coordinates are in terms of the global chart. Then there exists a neighborhood of where whenever , and hence the relation defines a function on .
As a result of the previous discussion, for and we can define the function
First we check that is smooth. Let be the projection onto , sending . For we have . Since is a diffeomorphism between and we know the inverse exists and is smooth on . Since is the composition of smooth maps, is also smooth.
Now we check that . But this follows since the argmax of the function for any over the larger domain is unique and equal to . This fact is well-known, and can be checked by noting that is the only critical point of on and that the Hessian is negative definite since . Therefore since , when we have a unique argmax at as needed.
Finally, we check the conditions for a first-order and second-order retraction. We don’t need the Riemannian structure for first-order, but since we are proving second-order as well we treat both cases uniformly. Define for in a small enough neighborhood of zero, such that . Then is the unique point in which maximizes the log-likelihood function restricted to . Let be the vector field on the curve given in the global chart by , where denotes the derivative of the component function of the curve . To show that is a first-order retraction we need to show that is equal to . To show that is a second-order retraction we need to show that [8, Def. 8.64 and Eqn 8.27]. Here, denotes the covariant derivative along the curve , using the Levi-Civita connection for the Fisher metric on (See [17, Thm 4.24]).
Let be any vector field on the curve which is tangent to , so that for all . By first-order criteria for local maxima we know that for all , since is everywhere tangent to . We also know that since . By denote the curve in with components , and let be the vector field along the curve given by where . Then, we have
Note that is the orthogonal projection of onto along the curve when using the Fisher metric. Applying and using [17, Prop 5.5] we find
Recall (see [17, p. 102]) the covariant derivative of a vector field along a curve is given by
where are the Christoffel symbols of the metric in the given local frame. For the Fisher metric with the corresponding Levi-Civita connection in the global frame induced by the global chart, every except , evaluating at the point along the curve (cf. [17, p. 123]). Using this in our calculations, equation (5) becomes , but at this becomes . Then, since we have , and since both and are in the tangent space at , equation (3) implies that and so , which completes the proof that is a first-order retraction.
Having proved first-order, both and are zero, and the first two terms of equation (4) vanish at . A somewhat tedious calculation (which we omit) shows that . Calculating using equation 5, it is noted that . Therefore, equation (4) becomes
Since this is true for any vector field which is everywhere tangent to , and hence for any tangent vector , we have , which completes the proof. ∎
3.1 Maximum likelihood retraction by homotopy
Having established Theorem 2, we now describe how to use homotopy continuation to compute this new retraction map on statistical models.
Let , , and a stepsize be given. Say that is equal to the zeros of the smooth map . We will also assume that the dimension of is equal to . If not, standard techniques of randomization in numerical nonlinear algebra can be used to create a new map so that , , and the additional zeros of are well-controlled. See [24, Ch. 13.5]. In what follows, let denote the Jacobian matrix of in the standard basis, denote its transpose, and let denote the diagonal matrix with entries . Since , we may assume that is the first component function of the map implicitly defining . Choose randomly from a multivariate Gaussian distribution, and then set the first component to , producing . Then set . Notice that since the first scalar in equals , is equal to plus a random, small step in the complexified normal space . By construction, solves the equation for the variables . These equations say that the Euclidean gradient of at is in the Euclidean normal space at , making a critical point of restricted to . We will use as a starting solution to a homotopy that begins with parameters and ends with parameters .
Let send , where the component functions of are those of plus those of . Then let be the homotopy defined by . As is the convention in the homotopy continuation literature, and since there are more floating point numbers near zero than one, we begin our homotopy at and end at . At we have by construction . If , then . We can compute by homotopy continuation using the start solution at , though we only keep and throw away . We record this idea in the following algorithm.