Computational models such as the Hopfield neural networks (HNN) [1, 2] have long been motivated as analog machines to solve optimization problems. Both deterministic and stochastic differential equations for the HNN have appeared in the literature, and the numerical experiments using these models have reported appealing performance [3, 4, 5, 6, 7]. In 
, introducing the stochastic HNN as the “diffusion machine”, Wong compared the same with other stochastic optimization algorithms, and commented: “As such it is closely related to both the Boltzmann machine and the Langevin algorithm, but superior to both in terms of possible integrated circuit realization”, and that “A substantial speed advantage for diffusion machines is likely”. Despite these promising features often backed up via numerical simulations, studies seeking basic understanding of the evolution equations remain sparse[9, 10]. The purpose of this paper is to revisit the differential equations for the HNN from a geometric standpoint.
Our main contribution is to clarify the interplay between the Riemannian geometry induced by the choice of HNN activation functions, and the associated variational interpretations for the HNN flow. Specifically, in the deterministic case, we argue that the HNN flow is natural gradient descent (Section II) of a function over a manifold
with respect to (w.r.t.) a distance induced by certain metric tensorthat depends on the activation functions of the HNN. This identification leads to an equivalent mirror descent interpretation (Section III). An example is worked out (Section IV) to elucidate these ideas.
In the stochastic HNN, the random fluctuations resulting from the noise do not allow the sample paths of the stochastic differential equation to be interpreted as gradient flow on . However, we argue (Section V) that the stochastic sample path evolution engenders a (deterministic) flow of probability measures supported on which does have an infinite dimensional gradient descent interpretation. In other words, the gradient flow interpretation for the stochastic HNN holds in the macroscopic or ensemble sense. This viewpoint seems novel in the context of stochastic HNN, and indeed leads to proximal algorithms enabling fast computation (Section V-C). A comparative summary of the gradient flow interpretations for the deterministic and the stochastic HNN is provided in Table I (after Section V-C).
The notation stands for the Euclidean gradient operator. We sometimes put a subscript as in
to denote the gradient w.r.t. vector, and omit the subscript when its meaning is obvious from the context. As usual, acting on vector-valued function returns the Jacobian. The symbol is used for the Hessian. We use the superscript to denote matrix transposition, and
to denote the expectation operator w.r.t. the probability density function (PDF), i.e., . In the development that follows, we assume that all probability measures are absolutely continuous, i.e., the corresponding PDFs exist. The notation stands for the set of all joint PDFs supported on
with finite second raw moments, i.e.,. The inequality means that the matrix is symmetric positive definite. The symbol
denotes an identity matrix of appropriate dimension. For, the symbol denotes the tangent space of at ; we use to denote the tangent space at a generic point in . The set of natural numbers is denoted as . We use the symbols and for denoting element-wise (i.e., Hadamard) product and division, respectively.
Ii HNN Flow as Natural Gradient Descent
We consider the deterministic HNN dynamics given by
with initial condition . Here, the function is continuously differentiable, and therefore admits a minimum on . Without loss of generality, we assume to be non-negative on . The vectors are referred to as the state and hidden state, respectively. The so called “activation function” is assumed to satisfy the following properties:
differentiable almost everywhere,
i.e., its -th component depends only on the -th component of the argument,
for all are strictly increasing.
Examples of activation function include the logistic function, hyperbolic tangent function, among others.
To view the continuous-time dynamics (1) from a geometric standpoint, we start by rewriting it as
In words, the flow of is generated by a vector field that is negative of the Jacobian of evaluated at , times the gradient of w.r.t. . Due to (2), the Jacobian in (3) is a diagonal matrix. Furthermore, since each is strictly increasing and differentiable almost everywhere, the diagonal terms are positive. Therefore, the right-hand-side (RHS) of (3) has the following form: negative of a positive definite matrix times the gradient of . This leads to the result below.
Amari’s natural gradient descent  describes the steepest descent of on a Riemannian manifold with metric tensor , and is given by the recursion
where the discrete iteration index , and the step-size is small. For , we get the natural gradient dynamics
Notice that due to the assumptions listed on the activation function , its Jacobian appearing in (3) is a diagonal matrix with positive entries. Therefore, (3) is of the form (6) with , . In particular, for all . This completes the proof.
i.e., an orthogonal coordinate system can be associated with . Further insights can be obtained by interpreting as a Hessian manifold . To this end, we would like to view the positive definite metric tensor as the Hessian of a (twice differentiable) strictly convex function. Such an identification will allow us to associate a mirror descent [13, 14] in the dual manifold  corresponding to the natural gradient descent in manifold . Before delving into mirror descent, we next point out that (7) allows us to compute geodesics on the manifold , which completes the natural gradient descent interpretation.
The (minimal) geodesic curve parameterized via that connects , solves the well-known Euler-Lagrange equation
subject to the boundary conditions , . Here denote the Cristoffel symbols of the second kind for , and are given by
wherein denotes the -th element of the inverse metric tensor . For , and diagonal , we get
For our diagonal metric tensor (4), since the -th diagonal element of only depends on , the terms (10a)–(10c) are all zero, i.e., the only non-zero Cristoffel symbols in (10) are (10d). This simplifies (8) as a set of decoupled ODEs:
Once the geodesic is obtained from (12), the geodesic distance induced by the metric tensor , can be computed as
In view of the notation , we can define the Riemannian gradient , and rewrite the dynamics (6) as .
Iii HNN Flow as Mirror Descent
We now provide an alternative way to interpret the flow generated by (3) as mirror descent on a Hessian manifold that is dual to . For and , consider the mapping given by the map for some strictly convex and differentiable map , i.e.,
The map is referred to as the “mirror map”, formally defined below. In the following, the notation stands for the domain of .
(Mirror map) A differentiable, strictly convex function , on an open convex set , satisfying as the argument of approaches the boundary of the closure of , is called a mirror map.
Given a mirror map, one can introduce a Bregman divergence  associated with it as follows.
(Bregman divergence associated with a mirror map) For a mirror map , the associated Bregman divergence , is given by
Geometrically, is the error at due to first order Taylor approximation of about . It is easy to verify that iff . However, the Bregman divergence is not symmetric in general, i.e., .
In the sequel, we assume that the mirror map is twice differentiable. Then, being strictly convex, its Hessian is positive definite. This allows one to think of as a Hessian manifold with metric tensor . The mirror descent for function on a convex set , is a recursion of the form
where is the step-size, and the Bregman projection
Raskutti and Mukherjee  pointed out that if is twice differentiable, then the map given by , establishes a one-to-one correspondence between the natural gradient descent (5) and the mirror descent (15). Specifically, the mirror descent (15) on associated with the mirror map , is equivalent to the natural gradient descent (5) on with metric tensor , where
See for example, [21, Section 2.2].
for , where the notation means that the function does not depend on the -th component of the vector . Therefore, the map associated with the HNN flow is of the form (19).
where is defined in (19), and are constants. The constants define a parametrized family of maps . Combining (14) and (20) yields , and hence (due to (18b)). In Section IV, we illustrate these ideas on a concrete example.
for some to-be-determined mirror map , thereby interpreting (1) as mirror descent in itself, with as the dual variable, and as the primal variable. This interpretation, however, is contingent on the assumption that the monotone vector function is expressible as the gradient of a convex function (here, ), for which the necessary and sufficient condition is that the map be maximally cyclically monotone [20, Section 2, Corollary 1 and 2]. This indeed holds under the structural assumption (2).
Iv An Example
Consider the activation function given by the soft-projection operator
Iv-a Natural Gradient Descent Interpretation
Direct calculations yield
for all . Notice that implies , and hence the RHS of (23) is positive. We have
In this case, (12) reduces to the nonlinear ODE
which solved with the boundary conditions , , , determines the geodesic curve connecting . Introducing the change-of-variable , noting that , and enforcing , we can transcribe (25) into the separable ODE
where are constants of integration. Using the boundary conditions , in (28) yields the geodesic curve as
where , and . From (29), it is easy to verify that is component-wise in , and satisfies the boundary conditions , . From (13b) and (29), the geodesic distance associated with the activation function (22) is
where all vector operands such as square-root, arcsin, division (denoted by ), are element-wise. The contour plots in Fig. 2 show the geodesic balls of given by (30), centered at . We summarize: the HNN flow generated by (3) with activation function (22) is natural gradient descent of measured w.r.t. the geodesic distance (30).
Iv-B Mirror Descent Interpretation
To derive a mirror descent interpretation, integrating (24) we obtain
where , and as before, means that the function does not depend on the -th component of the vector . Equation (31) implies that
where are constants. Therefore, (31) can be re-written as
Iv-B1 The case
Let us consider a special case of (32), namely , that is particularly insightful. In this case,
which is the dual Logistic loss [21, Table 1], and the corresponding mirror map can be identified as
Thus, an alternative interpretation of the HNN dynamics with activation function (22) is as follows: it can be seen as mirror descent of the form (15) on in variables given by (33); a candidate mirror map is given by (38) with associated Bregman divergence (37).
V Stochastic HNN Flow As Gradient Descent In the Space of Probability Measures
We next consider the stochastic HNN flow, also known as the “diffusion machine” [8, Section 4], given by the Itô stochastic differential equation (SDE)
where the state***We make the standard assumption that at . If the activation functions do not satisfy this, then a reflecting boundary is needed at each , to keep the sample paths within the unit hypercube; see e.g., . , the standard Wiener process , the notation stands for the vector of ones, and the parameter denotes the thermodynamic/annealing temperature. Recall that the metric tensor given by (4) is diagonal, and hence its inverse and square roots are diagonal too with respective diagonal elements being the inverse and square roots of the original diagonal elements.
The diffusion machine (39) was proposed in [8, 24] to solve global optimization problems on the unit hypercube, and is a stochastic version of the deterministic HNN flow discussed in Sections II–IV. This can be seen as a generalization of the Langevin model [25, 26] for the Boltzmann machine 
. Alternatively, one can think of the diffusion machine as the continuous-state continuous-time version of the HNN in which Gaussian white noise is injected at each node.
where the given initial joint PDF . The drift and diffusion coefficients in (39) are tailored so that admits the stationary joint PDF
i.e., a Gibbs PDF where is a normalizing constant (known as the “partition function”) to ensure . This follows from noting that for (39), we have†††We remind the readers that the raised indices denote the elements of the inverse of the metric tensor, i.e., .
and that for each . From (41), the critical points of coincides with that of , which is what allows to interpret (39) as an analog machine for globally optimizing the function . In fact, for the FPK operator (42), the solution of (40) enjoys an exponential rate of convergence [29, p. 1358-1359] to (41). We remark here that the idea of using SDEs for solving global optimization problems have also appeared in [30, 31].
In the context of diffusion machine, the key observation that part of the drift term can be canceled by part of the diffusion term in the FPK operator (42a), thereby guaranteeing that the stationary PDF associated with (42) is (41), was first pointed out by Wong . While this renders the stationary solution of the FPK operator meaningful for globally minimizing , it is not known if the transient PDFs generated by (40) admit a natural interpretation. We next argue that (40) is indeed a gradient flow in the space of probability measures supported on the manifold .
V-a Wasserstein Gradient Flow
For , consider the free energy functional given by
along any solution generated by (43), thanks to for all . In particular, the RHS of (46) equals zero iff , i.e., at the stationary PDF (41). For , the RHS of (46) is for any transient PDF solving (43).
To view (43) as gradient flow over , we will need the notion of (quadratic) Wasserstein metric between two probability measures and on , defined as
where denotes a joint probability measure supported on , and the geodesic distance is given by (13). The infimum in (47) is taken over the set , which we define as the set of all joint probability measures having finite second moment that are supported on , with prescribed -marginal , and prescribed -marginal . If and have respective PDFs and , then we can use the notation in lieu of . The subscript in (47) is indicative of its dependence on the ground Riemannian metric , and generalizes the Euclidean notion of Wasserstein metric , i.e., case. See also [33, Section 3] and . Following [32, Ch. 7] and using the positive definiteness of , it is easy to verify that (47) indeed defines a metric on , i.e., is non-negative, zero iff , symmetric in its arguments, and satisfies the triangle inequality.
The square of (47) is referred to as the “optimal transport cost” (see  for the Euclidean case) that quantifies the minimum amount of work needed to reshape the PDF to (or equivalently, to ). This can be formalized via the following dynamic variational formula [36, Corollary 2.5] for (47):
The infimum above is taken over the PDF-vector field pairs , where . Recognizing that (48a) is the continuity/Liouville equation for the single integrator dynamics , one can interpret (48) as a fixed horizon “minimum weighted energy” stochastic control problem subject to endpoint PDF constraints (48b)-(48c).
Following  and [36, Section 1.2], we can endow with the metric , resulting in an infinite dimensional Riemannian structure. Specifically, the tangent space can be equipped with the inner product given by
for tangent vectors . Given a smooth curve in , the associated tangent vector is characterized as the vector field that solves (48). We refer the readers to [38, Ch. 13] for further details. In particular, the “Wasserstein gradient” [39, Ch. 8] of a functional at , can be defined as
where denotes the functional derivative of . Consequently, the FPK PDE (43) can be transcribed into the familiar gradient flow form:
Furthermore, the Euler (i.e., first order in time) discretization of the LHS of (51) yields the familiar gradient descent form:
This is because the Itô SDE (39) was hand-crafted in  such that the associated stationary PDF becomes (41) w.r.t. the volume measure , which is desired from an optimization perspective since then, the local minima of would coincide with the location of the modes of . In contrast, (53) admits stationary PDF w.r.t. the volume measure (up to a normalization constant), and in general, would not correspond to the local minima of .
V-B Proximal Recursion on
with the initial PDF , and small step-size . In particular, in strong sense as . That a discrete time-stepping procedure like (54) can approximate the solution of FPK PDE, was first proved in  for , and has since become a topic of burgeoning research; see e.g., [39, 42, 43].
and (41) is the fixed point of this recursion. Just like the proximal viewpoint of finite dimensional gradient descent, (54) can be taken as an alternative definition of gradient descent of on the manifold w.r.t. the metric ; see e.g., [47, 48, 49, 50].
The utilitarian value of (54) over (52) is as follows. While the algebraic recursion (52) involves first order differential operator (w.r.t. ), the variational recursion (54) is zero-th order. In the optimization community, the proximal operator is known  to be amenable for large scale computation. In our context too, we will see in Section V-C that (52) allows scattered weighted point cloud-based computation avoiding function approximation or spatial discretization, which will otherwise be impossible had we pursued a direct numerical method for the algebraic recursion (52).