Mirror Descent (Nemirovsky & Yudin, 1983; Beck & Teboulle, 2003) is an important template first-order optimization method for optimizing a convex function w.r.t. a geometry specified by a strongly convex potential function. It enjoys rigorous guarantees, and its stochastic and online variants are even optimal for certain learning settings (Srebro et al., 2011). As its name implies, Mirror Descent was derived and is typically described in terms of performing gradient steps in the dual space: in each iteration, one maps the iterate to the dual space, performs an update there, and then mirrors it back to the primal space. Understanding Mirror Descent in this way requires explicitly discussing the dual space.
In this paper we derive a direct “primal” understanding of Mirror Descent, and in order to do so, turn to Riemannian Gradient Flow. The infinitesimal limit of Mirror Descent, where the stepsize is taken to zero, corresponds to a Riemannian Gradient Flow on a manifold with a metric tensor that is given by the Hessian of the potential function used by Mirror Descent (see Section 2.2). The standard forward Euler discretization of this Riemannian Gradient Flow gives rise to the Natural Gradient Descent algorithm (Amari, 1998). Our main observation is that a “partial” discretization of the flow, where we discretize the optimization objective but not the metric tensor specifying the geometry, gives rise precisely to Mirror Descent. This view allows us to understand how Mirror Descent is more faithful to the geometry compared to Natural Gradient Descent.
The relationship we reveal between Mirror Descent and Natural Gradient Descent is different from, and complementary to, the relationship discussed by Raskutti & Mukherjee (2015)—while their work showed how Mirror Descent and Natural Gradient Descent are dual to each other, in the sense that Mirror Descent is equivalent to Natural Gradient Descent in the dual space, we avoid the duality altogether. We work only in the primal space, derive Mirror Descent directly, and show how they are different discretizations of the same flow in the primal space.
Furthermore, our derivation of Mirror Descent also allows us to conceptually generalize Mirror Descent to any Riemannian manifold, including situations where metric tensor is not specified by the Hessian of any potential, and so there are no corresponding link functions and Bergman divergence.
1.1 Background: Mirror Descent
Consider optimizing a smooth function over a closed convex set , . We will focus on unconstrained optimization, i.e., .
Mirror Descent is a template first-order optimization algorithm specified by a strictly convex potential function . Mirror Descent was developed as a generalization of gradient descent to non-Euclidean geometries, where the local geometry is specified by the Bergman divergences w.r.t. given by . The iterative updates of Mirror Descent with stepsize are defined as:
For unconstrained optimization, the updates in eq. (1) are equivalently given by:
where is called the link function and provides a mapping between the primal optimization space and the dual space of gradients. Mirror Descent thus performs gradient updates in the dual “mirror”.
1.2 Background: Riemannian Gradient Flow
Let denote a Riemannian manifold over equipped with a metric tensor at each point . The metric tensor denotes a smoothly varying local inner product on the tangent space at , . Intuitively, the tangent space
is the vector space of all infinitesimal directionsthat we can move in while following a smooth path on (for more detailed exposition see, e.g. Do Carmo, 2016). For manifolds over , metric tensors can be identified with positive definite matrices that define local distances as for infinitesimal .
We consider minimizing a smooth objective:
The Riemannian Gradient Flow dynamics for (3) with initialization are obtained by seeking an infinitesimal change in that would lead to the best improvement in objective value, while controlling the length of the change in terms of the manifold geometry
For infinitesimal , using , we can replace with its first order approximation , and . Solving for from eq. (4), we obtain:
where here and throughout we denote .
We refer to the path specified by eq. (GF) and initial condition as Riemannian Gradient Flow or sometimes simply as gradient flow.
Examples of Riemannian metrics and corresponding gradient flow that arise in learning and related areas include:
The standard Euclidean geometry is recovered with . In this case (GF) reduces to the standard gradient flow . When is fixed to some other positive definite , we get the pre-conditioned gradient flow , which can also be thought of as the gradient flow dynamics on a reparametrization , i.e. with respect to geometry specified by a linear distortion.
For any strongly convex potential function over , the Hessian defines a non-Euclidean metric tensor. Examples include squared norms for ( again recovers the standard Euclidean geometry) and a particularly important example is the simplex, endowed with an entropy potential .
Information geometry (Amari, 2012)
is concerned with a manifold of probability distributions, e.g. in some parametric family, typically endowed with the metric derived from the KL-divergence. In our notation, we would consider this as defining a Riemannian metric structure over the manifold parameters , with a metric tensor which is given by the Fisher information matrix . Such a geometry can also be obtained by considering the entropy as a potential function and taking its Hessian.
Another useful Riemannian geometry on the manifold of probability distributions is given by the Wasserstein distance. The dynamics of the probability law of , where follows Langevin dynamics, are given by Riemannian Gradient Flow with respect to the Wasserstein distance, when minimizing the KL divergence to a target distribution (Jordan et al., 1998)
. Going beyond probability distributions, Riemannian Gradient Flow on the space of (signed) measures with respect to the Wasserstein distance captures training infinite width neural networks: gradient descent dynamics on individual weights corresponds to Wasserstein flow on the measure over the weights(e.g. Mei et al., 2018; Chizat & Bach, 2018).
2 Discretizing Riemannian Gradient Flow
The Riemannian Gradient Flow is a continuous object defined in terms of a differential equation (GF). To utilize it algorithmically, we consider discretizations of the flow.
2.1 Natural Gradient Descent
Natural Gradient Descent is obtained as the forward Euler discretization with stepsize of the gradient flow (GF):
These Natural Gradient Descent updates were suggested and popularized by Amari (1998), particularly in the context of information geometry, where the metric tensor is given by the Fisher information matrix of some family of distributions.
where denotes a discretization at scale , i.e. the largest such that is an integer multiple of .
The differential equation (NGD) specifies a piecewise linear solution
, interpolating between the Natural Gradient Descent iterates. In particular, the Natural Gradient Descent iterates in eq. (5) are given by where is the solution of (NGD) with the initial condition , and we could have alternatively defined the forward Euler discretization in this way.
2.2 Mirror Descent
The resulting solution is piecewise smooth. We will again consider the sequence of iterates at discrete points:
Consider the Mirror Descent iterates with step size for link function from eq. (2)
Define a Mirror Descent path by linearly interpolating in the dual space as follows:
2.3 A More Faithful Discretization
Comparing the two discretizations (NGD) and (MD) allows us to understand the relationship between Natural Gradient updates (5) and Mirror Descent updates (2). Although both updates have the same infinitesimal limit, as previously discussed by, e.g. Gunasekar (2018), they differ in how the discretization is done with finite stepsizes: while Natural Gradient Descent corresponds to discretizing both the objective and the geometry, Mirror Descent involves discretizing only the objective, but not the geometry. In this sense, Mirror Descent is a “more accurate” discretization, being more faithful to the geometry.
The price we pay for this “more accurate” discretization is that Mirror Descent might be more demanding to implement. Natural Gradient Descent can be easily implemented if we have access to gradients of the objective and the inverse Hessian of the metric tensor (i.e. we need to either obtain an invert the metric tensor, or have direct access to its inverse). But at least for a traditional implementation of the Mirror Descent updates in eq. (2), (a) we need the metric tensor to be a Hessian map, i.e. the differential equation should have a solution, and (b) we need to be able to efficiently calculate the link and inverse link functions.
2.4 When Does a Potential Exist?
In Theorem 1 we established that for any smooth strictly convex potential , we can define a metric tensor so that Mirror Descent is obtained as a discretization of the Riemannian Gradient Flow (GF), and (GF) is obtained as the infinitesimal limit of Mirror Descent. One might ask whether this connection with Mirror Descent holds for any Riemannian Gradient Flow.
The Riemannian Gradient Flow (GF), and hence also the discretization (MD), can be defined for any smooth Riemannian manifold, or any smooth Riemannian metric tensor . But the classic Mirror Descent updates (2) are only defined in terms of a potential function . That is, to make the connection with Mirror Descent, we needed to be a Hessian map, i.e. for the differential equation to have a solution. Since is already symmetric, this happens if and only if the following condition holds (see Proposition 4.1, Duistermaat, 2001),
Additionally, since must also be positive definite, if , then must be strictly convex. Hence, we can conclude that the discretization of Riemannian Gradient Flow in (MD) corresponds to classic Mirror Descent (2) for some strictly convex potential if and only if (7) holds. But the requirement (7) is non-trivial and does not hold in general, for instance, a seemingly simple metric tensor fails to satisfy (7) and therefore is not a Hessian map111This metric tensor can arise by considering the -dimensional manifold embedded in which consists of points with local distances induced by the Euclidean geometry on . The distance between and is then given by
2.5 Contrast With Prior Derivations
We emphasize that our observation of Mirror Descent as a partial discretization of Riemannian Gradient Flow, is rather different from and complementary to a previous relationship pointed out between Natural Gradient Descent and Mirror Descent by Raskutti & Mukherjee (2015), which does rely on duality. Raskutti & Mukherjee showed that Mirror Descent corresponds to Natural Gradient Descent in the dual, that is after a change of parametrization given by the link function . Thus, while Natural Gradient Descent takes steps along straight rays in the primal space, Mirror Descent takes steps along straight rays in the dual.
In terms of the discretized differential equations (NGD) and (MD), the above relationship can be stated as follows: the path of the Natural Gradient Descent discretization (NGD) is piecewise linear in the primal space, i.e. is piecewise linear, while the path of the Mirror Descent discretization (MD) is piecewise linear in the dual space, i.e. is piecewise linear, and consequently curved in the primal space. But this view does not explain why Mirror Descent is preferable when we are interested in the primal geometry. In contrast, here we focus only on the (primal) Riemannian geometry, and highlight why Mirror Descent is more faithful to this geometry.
The above dual view is also captured by another way of developing Mirror Descent as a discretization of a differential equation, which is again substantially different from our development as it does rely on a link function: when the metric tensor is a Hessian map and , then (GF) can also equivalently be written as follows (Nemirovsky & Yudin, 1983; Warmuth & Jagota, 1997; Raginsky & Bouvrie, 2012):
A forward Euler discretization of the differential equation (8) yields the Mirror Descent updates (2). This can be viewed as using a standard full forward Euler discretization, corresponding to piecewise linear updates and Natural Gradient Descent, but in the dual variables, i.e. discretizing where .
Viewing Mirror Descent as a forward Euler discretization of (8), or as dual to Natural Gradient Descent, depends crucially on having a link function and on the metric tensor being a Hessian map . But could there be an analogue to the link function even if the metric tensor is not a Hessian map? In other words, could we have a a function such that is piecewise linear under the Mirror Descent dynamics (MD), i.e.
is equivalent to (MD) for any smooth objective ? Applying the chain rule to the left hand side of (9), this would require that , which further implies that is a Hessian map222Applying the chain rule on (9) and substituting (MD) we get . If this holds for any objective , it must be that . But indicates that is a Jacobian map, and since is always symmetric, this further implies since both sides evaluate to .. Therefore, if the metric tensor is not a Hessian map, there is no analogue to the link function, and we cannot obtain Mirror Descent as Natural Gradient Descent after some change of variables, nor as a forward Euler discretization of a differential equation similar to (8).
A distinguishing feature of our novel derivation of Mirror Descent is that it does not require a potential, dual or link function, and so it does not rely on the metric tensor being a Hessian map. This also allows us to conceptually generalize Mirror Descent to metric tensors that are not Hessian maps.
3 Potential-free Mirror Descent
As we emphasized, a significant difference between our derivation and previous, or “classical”, derivations of Mirror Descent, is that our derivations did not involve, or even rely on the existence of potential function—that is, it did not rely on the metric tensor being a Hessian map. If the metric tensor is not a Hessian map, we cannot define the link function nor Bergman divergence, and the standard Mirror Descent updates (1)–(2) are not defined. Nevertheless our equivalent primal-only derivation of Mirror Descent (6) does allow us to generalize Mirror Descent as a first order optimization procedure to any metric tensor, even if it is not a Hessian map—we simply use (6) as the definition of Mirror Descent.
To be more precise, we can define iteratively as follows: given and the gradient , consider the path defined by the differential equation
For a general metric tensor , the above updates requires computing the solution of a differential equation at each step, which may or may not be efficiently computable. However, it is important to note that the differential equation (10) depends on the objective only through the gradient . That is, the only required access to the objective in order to implement the method is a single gradient access per iteration—the rest is just computation in terms of the pre-specified geometry (similar to computing the link and inverse link in standard Mirror Descent). The Mirrorless Mirror Descent procedure we defined is thus a valid first order optimization method, and independent of the tractability of solving the differential equation, could be of interest in studying optimization with first order oracle access under general geometries.
4 Importance of the Parametrization
Our development in Section 2 relied not only on an abstract Riemannian manifold , but on a specific parametrization (or “chart”) for the manifold, or in our presentation, on identifying the manifold , and so also its tangent spaces, with . Let us consider now the effect of a change of parametrization (i.e. on using a different chart).
Consider a change of parameters for some smooth invertible with invertible Jacobian , that specifies an isometric Riemannian manifold , i.e., such that for all infinitesimal . The metric tensor for the isometric manifold is given by
where recall that the Jacobian of the inverse is the inverse Jacobian, . This can also be thought of as using a different chart for the manifold (in our case, a global chart, since the manifold is isomorphic to ).
In understanding methods operating on a manifold, it is important to separate what is intrinsic to the manifold and its geometry, and what aspects of the method are affected by the parametrization, especially since one might desire “intrinsic” methods that depend only on the manifold and its geometry, but not on the parametrization. We therefore ask how changing the parametrization affects our development. In particular, does the Mirror Descent discretization, and with it the Mirror Descent updates change with parameterization?
Consider minimizing , which after the reparametrization we refer to as . The Riemannian Gradient Flow on is then defined by
where note that .
Since our initial development of the Riemannian Gradient Flow in (4) was independent of the parametrization, it should be the case that the solution of (12), i.e. as gradient flow in , is equivalent to gradient flow in , i.e. where is the solution of (GF). It is however, insightful to verify this directly: for the path from (GF) setting and calculating, we have:
thus verifying that indeed satisfies (12).
Do the same arguments hold also for the Mirror Descent discretization (MD)? Taking the solution of (MD) and setting , we can follow the same derivation as above, except now the metric tensor and gradient are calculated at different points ( and , respectively):
We can see why the Mirror Descent discretization, and hence also the Mirror Descent iterates are not invariant to changes in parametrization: if is fixed, i.e., the reparametrization is affine, we have and (14) shows that satisfies the Mirror Descent discretized differential equation w.r.t. . But more generally, the discretization would be affected by the “alignment” of the Jacobians along the solution path.
A related question is how a reparametrization affects whether the metric tensor is a Hessian map. Indeed, for a particular parametrization (i.e. chart), the existence of a potential function such that has nothing to do with the manifold, and just depends on whether satisfies (7), and it may well be the case that is not a Hessian map but is, or visa versa (in fact, in general if is non-linear we cannot expect both and to be Hessian maps). Is there always such a reparametrization? Amari & Armstrong (2014) showed that while all Riemannian manifolds isomorphic to admit a parametrization for which the metric tensor is a Hessian map, this is no longer true in higher dimensions, and even a manifold isomorphic to might not admit any parametrization with a Hessian metric tensor.
We see then how our potential-free derivation of Section 3 can indeed be much more general than the traditional view of Mirror Descent which applies only when the metric tensor is a Hessian map and a potential function exists: for many Riemannian manifolds, there is no parametrization with a Hessian metric tensor, and so it is not possible to define Mirror Descent updates classically such that the Riemannian gradient flow is obtained as their limit. Yet, the approach of Section 3 always allows to do so. Furthermore, even for manifolds for which there exists a parametrization where the metric tensor is the Hessian of some potential function, our approach allows considering discretizations in other isometric parametrizations.
In this paper we presented a “primal” derivation of Mirror Descent, based on a discretization of the Riemannian Gradient Flow, and showed how it can be useful for understanding, thinking about, and potentially analyzing Mirror Descent, Natural Gradient Descent, and Riemannian Gradient Flow. We also showed how this view suggests an generalization of (Mirrorless) Mirror Descent to any Riemannian geometry.
Appendix A Stochastic Discretization
Finally, we also briefly discuss how yet another discretization of Riemannian Gradient Flow captures Stochastic Mirror Descent, and could be useful in studying optimization versus statistical issues in training.
We have so far discussed exact, or batch Mirror Descent, but a popular variant is Stochastic
Mirror Descent, where at each iteration we update based on an unbiased estimatorof the , i.e. such that as,
Consider stochastic objective of the form:
We can derive Stochastic Mirror Descent from the following stochastic discretization of Riemannian Gradient Flow (GF):
where are sampled i.i.d., and we used two different resolutions, and , to control the discretization.
Setting and taking we recover “single example“ Stochastic Mirror Descent, i.e. where at each iteration we use a gradient estimator based on a single i.i.d. example. But varying relative to also allows us to obtain other variants.
Taking , e.g. for , we recover Mini-Batch Stochastic Mirror Descent, where at each iteration we use a gradient estimator obtained by averaging across i.i.d. examples. To see this, note that solving (17) as in Theorem 1 we have that for , and so .
It is also interesting to consider , in particular when while is fixed. This corresponds to optimization using stochastic (infinitesimal) gradient flow, where over a time we use samples. Studying how close the discretization (17) remains to the population Riemannian Gradient Flow (GF), in terms of and , could allow us to tease apart the optimization complexity and sample complexity of learning (minimizing the population objective).
- Amari (1998) Amari, S.-I. Natural gradient works efficiently in learning. Neural computation, 1998.
- Amari (2012) Amari, S.-i. Differential-geometrical methods in statistics, volume 28. Springer Science & Business Media, 2012.
- Amari & Armstrong (2014) Amari, S.-i. and Armstrong, J. Curvature of hessian manifolds. Differential Geometry and its Applications, 33:1–12, 2014.
- Beck & Teboulle (2003) Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 2003.
- Chizat & Bach (2018) Chizat, L. and Bach, F. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in neural information processing systems, pp. 3036–3046, 2018.
- Do Carmo (2016) Do Carmo, M. P. Differential Geometry of Curves and Surfaces: Revised and Updated Second Edition. Courier Dover Publications, 2016.
- Duistermaat (2001) Duistermaat, J. J. On hessian riemannian structures. Asian journal of mathematics, 5(1):79–91, 2001.
Characterizing implicit bias in terms of optimization geometry.
In International Conference on Machine Learning, 2018.
- Jordan et al. (1998) Jordan, R., Kinderlehrer, D., and Otto, F. The variational formulation of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
- Mei et al. (2018) Mei, S., Montanari, A., and Nguyen, P.-M. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Nemirovsky & Yudin (1983) Nemirovsky, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. 1983.
Raginsky & Bouvrie (2012)
Raginsky, M. and Bouvrie, J.
Continuous-time stochastic mirror descent on a network: Variance reduction, consensus, convergence.In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC). IEEE, 2012.
- Raskutti & Mukherjee (2015) Raskutti, G. and Mukherjee, S. The information geometry of mirror descent. IEEE Transactions on Information Theory, 2015.
- Srebro et al. (2011) Srebro, N., Sridharan, K., and Tewari, A. On the universality of online mirror descent. In Advances in neural information processing systems, pp. 2645–2653, 2011.
Warmuth & Jagota (1997)
Warmuth, M. K. and Jagota, A. K.
Continuous and discrete-time nonlinear gradient descent: Relative
loss bounds and convergence.
International Symposium on Artificial Intelligence and Mathematics, 1997.