I Introduction
Computational models such as the Hopfield neural networks (HNN) [1, 2] have long been motivated as analog machines to solve optimization problems. Both deterministic and stochastic differential equations for the HNN have appeared in the literature, and the numerical experiments using these models have reported appealing performance [3, 4, 5, 6, 7]. In [8]
, introducing the stochastic HNN as the “diffusion machine”, Wong compared the same with other stochastic optimization algorithms, and commented: “As such it is closely related to both the Boltzmann machine and the Langevin algorithm, but superior to both in terms of possible integrated circuit realization”, and that “A substantial speed advantage for diffusion machines is likely”. Despite these promising features often backed up via numerical simulations, studies seeking basic understanding of the evolution equations remain sparse
[9, 10]. The purpose of this paper is to revisit the differential equations for the HNN from a geometric standpoint.Our main contribution is to clarify the interplay between the Riemannian geometry induced by the choice of HNN activation functions, and the associated variational interpretations for the HNN flow. Specifically, in the deterministic case, we argue that the HNN flow is natural gradient descent (Section II) of a function over a manifold
with respect to (w.r.t.) a distance induced by certain metric tensor
that depends on the activation functions of the HNN. This identification leads to an equivalent mirror descent interpretation (Section III). An example is worked out (Section IV) to elucidate these ideas.In the stochastic HNN, the random fluctuations resulting from the noise do not allow the sample paths of the stochastic differential equation to be interpreted as gradient flow on . However, we argue (Section V) that the stochastic sample path evolution engenders a (deterministic) flow of probability measures supported on which does have an infinite dimensional gradient descent interpretation. In other words, the gradient flow interpretation for the stochastic HNN holds in the macroscopic or ensemble sense. This viewpoint seems novel in the context of stochastic HNN, and indeed leads to proximal algorithms enabling fast computation (Section VC). A comparative summary of the gradient flow interpretations for the deterministic and the stochastic HNN is provided in Table I (after Section VC).
Notations
The notation stands for the Euclidean gradient operator. We sometimes put a subscript as in
to denote the gradient w.r.t. vector
, and omit the subscript when its meaning is obvious from the context. As usual, acting on vectorvalued function returns the Jacobian. The symbol is used for the Hessian. We use the superscript to denote matrix transposition, andto denote the expectation operator w.r.t. the probability density function (PDF)
, i.e., . In the development that follows, we assume that all probability measures are absolutely continuous, i.e., the corresponding PDFs exist. The notation stands for the set of all joint PDFs supported onwith finite second raw moments, i.e.,
. The inequality means that the matrix is symmetric positive definite. The symboldenotes an identity matrix of appropriate dimension. For
, the symbol denotes the tangent space of at ; we use to denote the tangent space at a generic point in . The set of natural numbers is denoted as . We use the symbols and for denoting elementwise (i.e., Hadamard) product and division, respectively.Ii HNN Flow as Natural Gradient Descent
We consider the deterministic HNN dynamics given by
(1a)  
(1b) 
with initial condition . Here, the function is continuously differentiable, and therefore admits a minimum on . Without loss of generality, we assume to be nonnegative on . The vectors are referred to as the state and hidden state, respectively. The so called “activation function” is assumed to satisfy the following properties:

homeomorphism,

differentiable almost everywhere,

has structure
(2) i.e., its th component depends only on the th component of the argument,

for all are strictly increasing.
Examples of activation function include the logistic function, hyperbolic tangent function, among others.
To view the continuoustime dynamics (1) from a geometric standpoint, we start by rewriting it as
(3) 
In words, the flow of is generated by a vector field that is negative of the Jacobian of evaluated at , times the gradient of w.r.t. . Due to (2), the Jacobian in (3) is a diagonal matrix. Furthermore, since each is strictly increasing and differentiable almost everywhere, the diagonal terms are positive. Therefore, the righthandside (RHS) of (3) has the following form: negative of a positive definite matrix times the gradient of . This leads to the result below.
Theorem 1.
Proof.
Amari’s natural gradient descent [11] describes the steepest descent of on a Riemannian manifold with metric tensor , and is given by the recursion
(5) 
where the discrete iteration index , and the stepsize is small. For , we get the natural gradient dynamics
(6) 
Notice that due to the assumptions listed on the activation function , its Jacobian appearing in (3) is a diagonal matrix with positive entries. Therefore, (3) is of the form (6) with , . In particular, for all . This completes the proof.
Theorem 1 allows us to interpret the flow generated by (3) as the steepest descent of in a Riemannian manifold with length element given by
(7) 
i.e., an orthogonal coordinate system can be associated with . Further insights can be obtained by interpreting as a Hessian manifold [12]. To this end, we would like to view the positive definite metric tensor as the Hessian of a (twice differentiable) strictly convex function. Such an identification will allow us to associate a mirror descent [13, 14] in the dual manifold [15] corresponding to the natural gradient descent in manifold . Before delving into mirror descent, we next point out that (7) allows us to compute geodesics on the manifold , which completes the natural gradient descent interpretation.
Iia Geodesics
The (minimal) geodesic curve parameterized via that connects , solves the wellknown EulerLagrange equation
(8) 
subject to the boundary conditions , . Here denote the Cristoffel symbols of the second kind for , and are given by
(9) 
wherein denotes the th element of the inverse metric tensor . For , and diagonal , we get
(10a)  
(10b)  
(10c)  
(10d) 
For our diagonal metric tensor (4), since the th diagonal element of only depends on , the terms (10a)–(10c) are all zero, i.e., the only nonzero Cristoffel symbols in (10) are (10d). This simplifies (8) as a set of decoupled ODEs:
(11) 
Using (4) and (10d), for , we can rewrite (11) as
(12) 
which solved together with the endpoint conditions , , yields the geodesic curve . In Section IV, we will see an explicitly solvable example for the system of nonlinear ODEs (12).
Once the geodesic is obtained from (12), the geodesic distance induced by the metric tensor , can be computed as
(13a)  
(13b) 
This allows us to formally conclude that the flow generated by (3), or equivalently by (6), can be seen as the gradient descent of on the manifold , measured w.r.t. the distance given by (13b).
Remark 1.
In view of the notation , we can define the Riemannian gradient , and rewrite the dynamics (6) as .
Iii HNN Flow as Mirror Descent
We now provide an alternative way to interpret the flow generated by (3) as mirror descent on a Hessian manifold that is dual to . For and , consider the mapping given by the map for some strictly convex and differentiable map , i.e.,
The map is referred to as the “mirror map”, formally defined below. In the following, the notation stands for the domain of .
Definition 1.
(Mirror map) A differentiable, strictly convex function , on an open convex set , satisfying as the argument of approaches the boundary of the closure of , is called a mirror map.
Given a mirror map, one can introduce a Bregman divergence [16] associated with it as follows.
Definition 2.
(Bregman divergence associated with a mirror map) For a mirror map , the associated Bregman divergence , is given by
(14) 
Geometrically, is the error at due to first order Taylor approximation of about . It is easy to verify that iff . However, the Bregman divergence is not symmetric in general, i.e., .
In the sequel, we assume that the mirror map is twice differentiable. Then, being strictly convex, its Hessian is positive definite. This allows one to think of as a Hessian manifold with metric tensor . The mirror descent for function on a convex set , is a recursion of the form
(15a)  
(15b) 
where is the stepsize, and the Bregman projection
(16) 
For background on mirror descent, we refer the readers to [13, 14]. For a recent reference, see [17, Ch. 4].
Raskutti and Mukherjee [15] pointed out that if is twice differentiable, then the map given by , establishes a onetoone correspondence between the natural gradient descent (5) and the mirror descent (15). Specifically, the mirror descent (15) on associated with the mirror map , is equivalent to the natural gradient descent (5) on with metric tensor , where
(17) 
is the LegendreFenchel conjugate [18, Section 12] of . In our context, the Hessian is given by (4). To proceed further, the following Lemma will be useful.
Lemma 1.
Proof.
See for example, [21, Section 2.2].
Combining (4) with (18a), we get
(19) 
for , where the notation means that the function does not depend on the th component of the vector . Therefore, the map associated with the HNN flow is of the form (19).
To derive the mirror descent (15) associated with the HNN flow, it remains to determine . Thanks to (18b), we can do so without explicitly computing . Specifically, integrating (19) yields , i.e.,
(20) 
where is defined in (19), and are constants. The constants define a parametrized family of maps . Combining (14) and (20) yields , and hence (due to (18b)). In Section IV, we illustrate these ideas on a concrete example.
Remark 2.
We note that one may rewrite (1) in the form of a mirror descent ODE [19, Section 2.1], given by
(21) 
for some tobedetermined mirror map , thereby interpreting (1) as mirror descent in itself, with as the dual variable, and as the primal variable. This interpretation, however, is contingent on the assumption that the monotone vector function is expressible as the gradient of a convex function (here, ), for which the necessary and sufficient condition is that the map be maximally cyclically monotone [20, Section 2, Corollary 1 and 2]. This indeed holds under the structural assumption (2).
Iv An Example
Consider the activation function given by the softprojection operator
(22) 
see Fig. 1. We next illustrate the natural gradient descent and the mirror descent interpretations of (3) with activation function (22).
Iva Natural Gradient Descent Interpretation
Direct calculations yield
(23) 
for all . Notice that implies , and hence the RHS of (23) is positive. We have
(24) 
which allows us to deduce the following: the HNN flow with activation function (22) is the natural gradient descent (5) with diagonal metric tensor given by (24), i.e., , .
In this case, (12) reduces to the nonlinear ODE
(25) 
which solved with the boundary conditions , , , determines the geodesic curve connecting . Introducing the changeofvariable , noting that , and enforcing , we can transcribe (25) into the separable ODE
(26) 
which gives
(27) 
and consequently,
(28) 
where are constants of integration. Using the boundary conditions , in (28) yields the geodesic curve as
(29) 
where , and . From (29), it is easy to verify that is componentwise in , and satisfies the boundary conditions , . From (13b) and (29), the geodesic distance associated with the activation function (22) is
(30) 
where all vector operands such as squareroot, arcsin, division (denoted by ), are elementwise. The contour plots in Fig. 2 show the geodesic balls of given by (30), centered at . We summarize: the HNN flow generated by (3) with activation function (22) is natural gradient descent of measured w.r.t. the geodesic distance (30).
IvB Mirror Descent Interpretation
To derive a mirror descent interpretation, integrating (24) we obtain
(31) 
where , and as before, means that the function does not depend on the th component of the vector . Equation (31) implies that
(32) 
where are constants. Therefore, (31) can be rewritten as
(33) 
Comparing the RHS of (33) with (19), for this specific example, we have that
(34) 
IvB1 The case
Let us consider a special case of (32), namely , that is particularly insightful. In this case,
i.e., a separable (weighted) sum of the bit entropy (see [21, Table 1] and [22, Table 1]). The associated Bregman divergence
(35) 
can be interpreted as the weighted sum of logistic loss functions associated with the Bernoulli random variables. Furthermore,
yields(36) 
Consequently, using (18b) and (36), we get
(37) 
which is the dual Logistic loss [21, Table 1], and the corresponding mirror map can be identified as
(38) 
Thus, an alternative interpretation of the HNN dynamics with activation function (22) is as follows: it can be seen as mirror descent of the form (15) on in variables given by (33); a candidate mirror map is given by (38) with associated Bregman divergence (37).
V Stochastic HNN Flow As Gradient Descent In the Space of Probability Measures
We next consider the stochastic HNN flow, also known as the “diffusion machine” [8, Section 4], given by the Itô stochastic differential equation (SDE)
(39) 
where the state^{*}^{*}*We make the standard assumption that at . If the activation functions do not satisfy this, then a reflecting boundary is needed at each , to keep the sample paths within the unit hypercube; see e.g., [23]. , the standard Wiener process , the notation stands for the vector of ones, and the parameter denotes the thermodynamic/annealing temperature. Recall that the metric tensor given by (4) is diagonal, and hence its inverse and square roots are diagonal too with respective diagonal elements being the inverse and square roots of the original diagonal elements.
The diffusion machine (39) was proposed in [8, 24] to solve global optimization problems on the unit hypercube, and is a stochastic version of the deterministic HNN flow discussed in Sections II–IV. This can be seen as a generalization of the Langevin model [25, 26] for the Boltzmann machine [27]
. Alternatively, one can think of the diffusion machine as the continuousstate continuoustime version of the HNN in which Gaussian white noise is injected at each node.
Let be the FokkerPlanckKolmogorov (FPK) operator [28] associated with (39) governing the evolution of the joint PDFs , i.e.,
(40) 
where the given initial joint PDF . The drift and diffusion coefficients in (39) are tailored so that admits the stationary joint PDF
(41) 
i.e., a Gibbs PDF where is a normalizing constant (known as the “partition function”) to ensure . This follows from noting that for (39), we have^{†}^{†}†We remind the readers that the raised indices denote the elements of the inverse of the metric tensor, i.e., .
(42a)  
(42b) 
and that for each . From (41), the critical points of coincides with that of , which is what allows to interpret (39) as an analog machine for globally optimizing the function . In fact, for the FPK operator (42), the solution of (40) enjoys an exponential rate of convergence [29, p. 13581359] to (41). We remark here that the idea of using SDEs for solving global optimization problems have also appeared in [30, 31].
In the context of diffusion machine, the key observation that part of the drift term can be canceled by part of the diffusion term in the FPK operator (42a), thereby guaranteeing that the stationary PDF associated with (42) is (41), was first pointed out by Wong [8]. While this renders the stationary solution of the FPK operator meaningful for globally minimizing , it is not known if the transient PDFs generated by (40) admit a natural interpretation. We next argue that (40) is indeed a gradient flow in the space of probability measures supported on the manifold .
Va Wasserstein Gradient Flow
Using (42b), we rewrite the FPK PDE (40) as
(43) 
where
(44) 
For , consider the free energy functional given by
(45) 
which is a sum of the potential energy , and the internal energy . The free energy (45) serves as the Lyapunov functional for (43) since direct calculation yields (Appendix A)
(46) 
along any solution generated by (43), thanks to for all . In particular, the RHS of (46) equals zero iff , i.e., at the stationary PDF (41). For , the RHS of (46) is for any transient PDF solving (43).
To view (43) as gradient flow over , we will need the notion of (quadratic) Wasserstein metric between two probability measures and on , defined as
(47) 
where denotes a joint probability measure supported on , and the geodesic distance is given by (13). The infimum in (47) is taken over the set , which we define as the set of all joint probability measures having finite second moment that are supported on , with prescribed marginal , and prescribed marginal . If and have respective PDFs and , then we can use the notation in lieu of . The subscript in (47) is indicative of its dependence on the ground Riemannian metric , and generalizes the Euclidean notion of Wasserstein metric [32], i.e., case. See also [33, Section 3] and [34]. Following [32, Ch. 7] and using the positive definiteness of , it is easy to verify that (47) indeed defines a metric on , i.e., is nonnegative, zero iff , symmetric in its arguments, and satisfies the triangle inequality.
The square of (47) is referred to as the “optimal transport cost” (see [35] for the Euclidean case) that quantifies the minimum amount of work needed to reshape the PDF to (or equivalently, to ). This can be formalized via the following dynamic variational formula [36, Corollary 2.5] for (47):
(48a)  
(48b)  
(48c) 
The infimum above is taken over the PDFvector field pairs , where . Recognizing that (48a) is the continuity/Liouville equation for the single integrator dynamics , one can interpret (48) as a fixed horizon “minimum weighted energy” stochastic control problem subject to endpoint PDF constraints (48b)(48c).
Following [37] and [36, Section 1.2], we can endow with the metric , resulting in an infinite dimensional Riemannian structure. Specifically, the tangent space can be equipped with the inner product given by
(49) 
for tangent vectors . Given a smooth curve in , the associated tangent vector is characterized as the vector field that solves (48). We refer the readers to [38, Ch. 13] for further details. In particular, the “Wasserstein gradient” [39, Ch. 8] of a functional at , can be defined as
(50) 
where denotes the functional derivative of . Consequently, the FPK PDE (43) can be transcribed into the familiar gradient flow form:
(51) 
Furthermore, the Euler (i.e., first order in time) discretization of the LHS of (51) yields the familiar gradient descent form:
(52) 
where for . The functional recursion (52) evolves on the metric space . We will see next that (52) can be recast as an equivalent proximal recursion, which will prove helpful for computation.
Remark 3.
We clarify here that the FPK operator appearing in the RHS of (43), or equivalently in the RHS of (51), is not quite same as the LaplaceBeltrami operator [40]
(53) 
This is because the Itô SDE (39) was handcrafted in [8] such that the associated stationary PDF becomes (41) w.r.t. the volume measure , which is desired from an optimization perspective since then, the local minima of would coincide with the location of the modes of . In contrast, (53) admits stationary PDF w.r.t. the volume measure (up to a normalization constant), and in general, would not correspond to the local minima of .
VB Proximal Recursion on
A consequence of identifying (43) as gradient flow in w.r.t. the metric is that the solution of (51) can be approximated [36] via variational recursion
(54) 
with the initial PDF , and small stepsize . In particular, in strong sense as . That a discrete timestepping procedure like (54) can approximate the solution of FPK PDE, was first proved in [41] for , and has since become a topic of burgeoning research; see e.g., [39, 42, 43].
The RHS in (54) is an infinite dimensional version of the MoreauYosida proximal operator [44, 45, 46], denoted as , i.e., (54) can be written succinctly as
(55) 
and (41) is the fixed point of this recursion. Just like the proximal viewpoint of finite dimensional gradient descent, (54) can be taken as an alternative definition of gradient descent of on the manifold w.r.t. the metric ; see e.g., [47, 48, 49, 50].
The utilitarian value of (54) over (52) is as follows. While the algebraic recursion (52) involves first order differential operator (w.r.t. ), the variational recursion (54) is zeroth order. In the optimization community, the proximal operator is known [51] to be amenable for large scale computation. In our context too, we will see in Section VC that (52) allows scattered weighted point cloudbased computation avoiding function approximation or spatial discretization, which will otherwise be impossible had we pursued a direct numerical method for the algebraic recursion (52).
Notice from (45) that is strictly convex in . Furthermore, the map is convex in for any . Therefore, (55) is a strictly convex functional recursion wherein each step admits unique minimizer.
Comments
There are no comments yet.