Hopfield Neural Network Flow: A Geometric Viewpoint

We provide gradient flow interpretations for the continuous-time continuous-state Hopfield neural network (HNN). The ordinary and stochastic differential equations associated with the HNN were introduced in the literature as analog optimizers, and were reported to exhibit good performance in numerical experiments. In this work, we point out that the deterministic HNN can be transcribed into Amari's natural gradient descent, and thereby uncover the explicit relation between the underlying Riemannian metric and the activation functions. By exploiting an equivalence between the natural gradient descent and the mirror descent, we show how the choice of activation function governs the geometry of the HNN dynamics. For the stochastic HNN, we show that the so-called “diffusion machine", while not a gradient flow itself, induces a gradient flow when lifted in the space of probability measures. We characterize this infinite dimensional flow as the gradient descent of certain free energy with respect to a Wasserstein metric that depends on the geodesic distance on the ground manifold. Furthermore, we demonstrate how this gradient flow interpretation can be used for fast computation via recently developed proximal algorithms.



There are no comments yet.


page 1

page 8


Natural gradient in Wasserstein statistical manifold

We study the Wasserstein natural gradient in parametric statistical mode...

Gradient Flow Algorithms for Density Propagation in Stochastic Systems

We develop a new computational framework to solve the partial differenti...

The AdaBoost Flow

We introduce a dynamical system which we call the AdaBoost flow. The flo...

Mirrorless Mirror Descent: A More Natural Discretization of Riemannian Gradient Flow

We present a direct (primal only) derivation of Mirror Descent as a "par...

Thoughts on the Consistency between Ricci Flow and Neural Network Behavior

The Ricci flow is a partial differential equation for evolving the metri...

Information-geometry of physics-informed statistical manifolds and its use in data assimilation

The data-aware method of distributions (DA-MD) is a low-dimension data a...

Neural Networks Optimally Compress the Sawbridge

Neural-network-based compressors have proven to be remarkably effective ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Computational models such as the Hopfield neural networks (HNN) [1, 2] have long been motivated as analog machines to solve optimization problems. Both deterministic and stochastic differential equations for the HNN have appeared in the literature, and the numerical experiments using these models have reported appealing performance [3, 4, 5, 6, 7]. In [8]

, introducing the stochastic HNN as the “diffusion machine”, Wong compared the same with other stochastic optimization algorithms, and commented: “As such it is closely related to both the Boltzmann machine and the Langevin algorithm, but superior to both in terms of possible integrated circuit realization”, and that “A substantial speed advantage for diffusion machines is likely”. Despite these promising features often backed up via numerical simulations, studies seeking basic understanding of the evolution equations remain sparse

[9, 10]. The purpose of this paper is to revisit the differential equations for the HNN from a geometric standpoint.

Our main contribution is to clarify the interplay between the Riemannian geometry induced by the choice of HNN activation functions, and the associated variational interpretations for the HNN flow. Specifically, in the deterministic case, we argue that the HNN flow is natural gradient descent (Section II) of a function over a manifold

with respect to (w.r.t.) a distance induced by certain metric tensor

that depends on the activation functions of the HNN. This identification leads to an equivalent mirror descent interpretation (Section III). An example is worked out (Section IV) to elucidate these ideas.

In the stochastic HNN, the random fluctuations resulting from the noise do not allow the sample paths of the stochastic differential equation to be interpreted as gradient flow on . However, we argue (Section V) that the stochastic sample path evolution engenders a (deterministic) flow of probability measures supported on which does have an infinite dimensional gradient descent interpretation. In other words, the gradient flow interpretation for the stochastic HNN holds in the macroscopic or ensemble sense. This viewpoint seems novel in the context of stochastic HNN, and indeed leads to proximal algorithms enabling fast computation (Section V-C). A comparative summary of the gradient flow interpretations for the deterministic and the stochastic HNN is provided in Table I (after Section V-C).


The notation stands for the Euclidean gradient operator. We sometimes put a subscript as in

to denote the gradient w.r.t. vector

, and omit the subscript when its meaning is obvious from the context. As usual, acting on vector-valued function returns the Jacobian. The symbol is used for the Hessian. We use the superscript to denote matrix transposition, and

to denote the expectation operator w.r.t. the probability density function (PDF)

, i.e., . In the development that follows, we assume that all probability measures are absolutely continuous, i.e., the corresponding PDFs exist. The notation stands for the set of all joint PDFs supported on

with finite second raw moments, i.e.,

. The inequality means that the matrix is symmetric positive definite. The symbol

denotes an identity matrix of appropriate dimension. For

, the symbol denotes the tangent space of at ; we use to denote the tangent space at a generic point in . The set of natural numbers is denoted as . We use the symbols and for denoting element-wise (i.e., Hadamard) product and division, respectively.

Ii HNN Flow as Natural Gradient Descent

We consider the deterministic HNN dynamics given by


with initial condition . Here, the function is continuously differentiable, and therefore admits a minimum on . Without loss of generality, we assume to be non-negative on . The vectors are referred to as the state and hidden state, respectively. The so called “activation function” is assumed to satisfy the following properties:


differentiable almost everywhere,

has structure


i.e., its -th component depends only on the -th component of the argument,

for all are strictly increasing.

Examples of activation function include the logistic function, hyperbolic tangent function, among others.

To view the continuous-time dynamics (1) from a geometric standpoint, we start by rewriting it as


In words, the flow of is generated by a vector field that is negative of the Jacobian of evaluated at , times the gradient of w.r.t. . Due to (2), the Jacobian in (3) is a diagonal matrix. Furthermore, since each is strictly increasing and differentiable almost everywhere, the diagonal terms are positive. Therefore, the right-hand-side (RHS) of (3) has the following form: negative of a positive definite matrix times the gradient of . This leads to the result below.

Theorem 1.

The flow governed by (3) (equivalently by (1)) is a natural gradient descent for function on the manifold with metric tensor , given by


where , and denotes derivative.


Amari’s natural gradient descent [11] describes the steepest descent of on a Riemannian manifold with metric tensor , and is given by the recursion


where the discrete iteration index , and the step-size is small. For , we get the natural gradient dynamics


Notice that due to the assumptions listed on the activation function , its Jacobian appearing in (3) is a diagonal matrix with positive entries. Therefore, (3) is of the form (6) with , . In particular, for all . This completes the proof.

Theorem 1 allows us to interpret the flow generated by (3) as the steepest descent of in a Riemannian manifold with length element given by


i.e., an orthogonal coordinate system can be associated with . Further insights can be obtained by interpreting as a Hessian manifold [12]. To this end, we would like to view the positive definite metric tensor as the Hessian of a (twice differentiable) strictly convex function. Such an identification will allow us to associate a mirror descent [13, 14] in the dual manifold [15] corresponding to the natural gradient descent in manifold . Before delving into mirror descent, we next point out that (7) allows us to compute geodesics on the manifold , which completes the natural gradient descent interpretation.

Ii-a Geodesics

The (minimal) geodesic curve parameterized via that connects , solves the well-known Euler-Lagrange equation


subject to the boundary conditions , . Here denote the Cristoffel symbols of the second kind for , and are given by


wherein denotes the -th element of the inverse metric tensor . For , and diagonal , we get


For our diagonal metric tensor (4), since the -th diagonal element of only depends on , the terms (10a)–(10c) are all zero, i.e., the only non-zero Cristoffel symbols in (10) are (10d). This simplifies (8) as a set of decoupled ODEs:


Using (4) and (10d), for , we can rewrite (11) as


which solved together with the endpoint conditions , , yields the geodesic curve . In Section IV, we will see an explicitly solvable example for the system of nonlinear ODEs (12).

Once the geodesic is obtained from (12), the geodesic distance induced by the metric tensor , can be computed as


This allows us to formally conclude that the flow generated by (3), or equivalently by (6), can be seen as the gradient descent of on the manifold , measured w.r.t. the distance given by (13b).

Remark 1.

In view of the notation , we can define the Riemannian gradient , and rewrite the dynamics (6) as .

Iii HNN Flow as Mirror Descent

We now provide an alternative way to interpret the flow generated by (3) as mirror descent on a Hessian manifold that is dual to . For and , consider the mapping given by the map for some strictly convex and differentiable map , i.e.,

The map is referred to as the “mirror map”, formally defined below. In the following, the notation stands for the domain of .

Definition 1.

(Mirror map) A differentiable, strictly convex function , on an open convex set , satisfying as the argument of approaches the boundary of the closure of , is called a mirror map.

Given a mirror map, one can introduce a Bregman divergence [16] associated with it as follows.

Definition 2.

(Bregman divergence associated with a mirror map) For a mirror map , the associated Bregman divergence , is given by


Geometrically, is the error at due to first order Taylor approximation of about . It is easy to verify that iff . However, the Bregman divergence is not symmetric in general, i.e., .

In the sequel, we assume that the mirror map is twice differentiable. Then, being strictly convex, its Hessian is positive definite. This allows one to think of as a Hessian manifold with metric tensor . The mirror descent for function on a convex set , is a recursion of the form


where is the step-size, and the Bregman projection


For background on mirror descent, we refer the readers to [13, 14]. For a recent reference, see [17, Ch. 4].

Raskutti and Mukherjee [15] pointed out that if is twice differentiable, then the map given by , establishes a one-to-one correspondence between the natural gradient descent (5) and the mirror descent (15). Specifically, the mirror descent (15) on associated with the mirror map , is equivalent to the natural gradient descent (5) on with metric tensor , where


is the Legendre-Fenchel conjugate [18, Section 12] of . In our context, the Hessian is given by (4). To proceed further, the following Lemma will be useful.

Lemma 1.

(Duality) Consider a mirror map and its associated Bregman divergence as in Definitions 1 and 2. It holds that


See for example, [21, Section 2.2].

Combining (4) with (18a), we get


for , where the notation means that the function does not depend on the -th component of the vector . Therefore, the map associated with the HNN flow is of the form (19).

To derive the mirror descent (15) associated with the HNN flow, it remains to determine . Thanks to (18b), we can do so without explicitly computing . Specifically, integrating (19) yields , i.e.,


where is defined in (19), and are constants. The constants define a parametrized family of maps . Combining (14) and (20) yields , and hence (due to (18b)). In Section IV, we illustrate these ideas on a concrete example.

Remark 2.

We note that one may rewrite (1) in the form of a mirror descent ODE [19, Section 2.1], given by


for some to-be-determined mirror map , thereby interpreting (1) as mirror descent in itself, with as the dual variable, and as the primal variable. This interpretation, however, is contingent on the assumption that the monotone vector function is expressible as the gradient of a convex function (here, ), for which the necessary and sufficient condition is that the map be maximally cyclically monotone [20, Section 2, Corollary 1 and 2]. This indeed holds under the structural assumption (2).

Fig. 1: For , the activation function associated with the soft-projection operator (22).

Iv An Example

Consider the activation function given by the soft-projection operator


see Fig. 1. We next illustrate the natural gradient descent and the mirror descent interpretations of (3) with activation function (22).

Iv-a Natural Gradient Descent Interpretation

Direct calculations yield


for all . Notice that implies , and hence the RHS of (23) is positive. We have


which allows us to deduce the following: the HNN flow with activation function (22) is the natural gradient descent (5) with diagonal metric tensor given by (24), i.e., , .

In this case, (12) reduces to the nonlinear ODE


which solved with the boundary conditions , , , determines the geodesic curve connecting . Introducing the change-of-variable , noting that , and enforcing , we can transcribe (25) into the separable ODE


which gives


and consequently,


where are constants of integration. Using the boundary conditions , in (28) yields the geodesic curve as


where , and . From (29), it is easy to verify that is component-wise in , and satisfies the boundary conditions , . From (13b) and (29), the geodesic distance associated with the activation function (22) is


where all vector operands such as square-root, arcsin, division (denoted by ), are element-wise. The contour plots in Fig. 2 show the geodesic balls of given by (30), centered at . We summarize: the HNN flow generated by (3) with activation function (22) is natural gradient descent of measured w.r.t. the geodesic distance (30).

Fig. 2: For , shown above are the contour plots for the geodesic balls of given by (30) with for , centered at .

Iv-B Mirror Descent Interpretation

To derive a mirror descent interpretation, integrating (24) we obtain


where , and as before, means that the function does not depend on the -th component of the vector . Equation (31) implies that


where are constants. Therefore, (31) can be re-written as


Comparing the RHS of (33) with (19), for this specific example, we have that


Iv-B1 The case

Let us consider a special case of (32), namely , that is particularly insightful. In this case,

i.e., a separable (weighted) sum of the bit entropy (see [21, Table 1] and [22, Table 1]). The associated Bregman divergence


can be interpreted as the weighted sum of logistic loss functions associated with the Bernoulli random variables. Furthermore,



Consequently, using (18b) and (36), we get


which is the dual Logistic loss [21, Table 1], and the corresponding mirror map can be identified as


Thus, an alternative interpretation of the HNN dynamics with activation function (22) is as follows: it can be seen as mirror descent of the form (15) on in variables given by (33); a candidate mirror map is given by (38) with associated Bregman divergence (37).

V Stochastic HNN Flow As Gradient Descent In the Space of Probability Measures

We next consider the stochastic HNN flow, also known as the “diffusion machine” [8, Section 4], given by the Itô stochastic differential equation (SDE)


where the state***We make the standard assumption that at . If the activation functions do not satisfy this, then a reflecting boundary is needed at each , to keep the sample paths within the unit hypercube; see e.g., [23]. , the standard Wiener process , the notation stands for the vector of ones, and the parameter denotes the thermodynamic/annealing temperature. Recall that the metric tensor given by (4) is diagonal, and hence its inverse and square roots are diagonal too with respective diagonal elements being the inverse and square roots of the original diagonal elements.

The diffusion machine (39) was proposed in [8, 24] to solve global optimization problems on the unit hypercube, and is a stochastic version of the deterministic HNN flow discussed in Sections IIIV. This can be seen as a generalization of the Langevin model [25, 26] for the Boltzmann machine [27]

. Alternatively, one can think of the diffusion machine as the continuous-state continuous-time version of the HNN in which Gaussian white noise is injected at each node.

Let be the Fokker-Planck-Kolmogorov (FPK) operator [28] associated with (39) governing the evolution of the joint PDFs , i.e.,


where the given initial joint PDF . The drift and diffusion coefficients in (39) are tailored so that admits the stationary joint PDF


i.e., a Gibbs PDF where is a normalizing constant (known as the “partition function”) to ensure . This follows from noting that for (39), we haveWe remind the readers that the raised indices denote the elements of the inverse of the metric tensor, i.e., .


and that for each . From (41), the critical points of coincides with that of , which is what allows to interpret (39) as an analog machine for globally optimizing the function . In fact, for the FPK operator (42), the solution of (40) enjoys an exponential rate of convergence [29, p. 1358-1359] to (41). We remark here that the idea of using SDEs for solving global optimization problems have also appeared in [30, 31].

In the context of diffusion machine, the key observation that part of the drift term can be canceled by part of the diffusion term in the FPK operator (42a), thereby guaranteeing that the stationary PDF associated with (42) is (41), was first pointed out by Wong [8]. While this renders the stationary solution of the FPK operator meaningful for globally minimizing , it is not known if the transient PDFs generated by (40) admit a natural interpretation. We next argue that (40) is indeed a gradient flow in the space of probability measures supported on the manifold .

V-a Wasserstein Gradient Flow

Using (42b), we rewrite the FPK PDE (40) as




For , consider the free energy functional given by


which is a sum of the potential energy , and the internal energy . The free energy (45) serves as the Lyapunov functional for (43) since direct calculation yields (Appendix -A)


along any solution generated by (43), thanks to for all . In particular, the RHS of (46) equals zero iff , i.e., at the stationary PDF (41). For , the RHS of (46) is for any transient PDF solving (43).

To view (43) as gradient flow over , we will need the notion of (quadratic) Wasserstein metric between two probability measures and on , defined as


where denotes a joint probability measure supported on , and the geodesic distance is given by (13). The infimum in (47) is taken over the set , which we define as the set of all joint probability measures having finite second moment that are supported on , with prescribed -marginal , and prescribed -marginal . If and have respective PDFs and , then we can use the notation in lieu of . The subscript in (47) is indicative of its dependence on the ground Riemannian metric , and generalizes the Euclidean notion of Wasserstein metric [32], i.e., case. See also [33, Section 3] and [34]. Following [32, Ch. 7] and using the positive definiteness of , it is easy to verify that (47) indeed defines a metric on , i.e., is non-negative, zero iff , symmetric in its arguments, and satisfies the triangle inequality.

The square of (47) is referred to as the “optimal transport cost” (see [35] for the Euclidean case) that quantifies the minimum amount of work needed to reshape the PDF to (or equivalently, to ). This can be formalized via the following dynamic variational formula [36, Corollary 2.5] for (47):


The infimum above is taken over the PDF-vector field pairs , where . Recognizing that (48a) is the continuity/Liouville equation for the single integrator dynamics , one can interpret (48) as a fixed horizon “minimum weighted energy” stochastic control problem subject to endpoint PDF constraints (48b)-(48c).

Following [37] and [36, Section 1.2], we can endow with the metric , resulting in an infinite dimensional Riemannian structure. Specifically, the tangent space can be equipped with the inner product given by


for tangent vectors . Given a smooth curve in , the associated tangent vector is characterized as the vector field that solves (48). We refer the readers to [38, Ch. 13] for further details. In particular, the “Wasserstein gradient” [39, Ch. 8] of a functional at , can be defined as


where denotes the functional derivative of . Consequently, the FPK PDE (43) can be transcribed into the familiar gradient flow form:


Furthermore, the Euler (i.e., first order in time) discretization of the LHS of (51) yields the familiar gradient descent form:


where for . The functional recursion (52) evolves on the metric space . We will see next that (52) can be recast as an equivalent proximal recursion, which will prove helpful for computation.

Remark 3.

We clarify here that the FPK operator appearing in the RHS of (43), or equivalently in the RHS of (51), is not quite same as the Laplace-Beltrami operator [40]


This is because the Itô SDE (39) was hand-crafted in [8] such that the associated stationary PDF becomes (41) w.r.t. the volume measure , which is desired from an optimization perspective since then, the local minima of would coincide with the location of the modes of . In contrast, (53) admits stationary PDF w.r.t. the volume measure (up to a normalization constant), and in general, would not correspond to the local minima of .

V-B Proximal Recursion on

A consequence of identifying (43) as gradient flow in w.r.t. the metric is that the solution of (51) can be approximated [36] via variational recursion


with the initial PDF , and small step-size . In particular, in strong sense as . That a discrete time-stepping procedure like (54) can approximate the solution of FPK PDE, was first proved in [41] for , and has since become a topic of burgeoning research; see e.g., [39, 42, 43].

The RHS in (54) is an infinite dimensional version of the Moreau-Yosida proximal operator [44, 45, 46], denoted as , i.e., (54) can be written succinctly as


and (41) is the fixed point of this recursion. Just like the proximal viewpoint of finite dimensional gradient descent, (54) can be taken as an alternative definition of gradient descent of on the manifold w.r.t. the metric ; see e.g., [47, 48, 49, 50].

The utilitarian value of (54) over (52) is as follows. While the algebraic recursion (52) involves first order differential operator (w.r.t. ), the variational recursion (54) is zero-th order. In the optimization community, the proximal operator is known [51] to be amenable for large scale computation. In our context too, we will see in Section V-C that (52) allows scattered weighted point cloud-based computation avoiding function approximation or spatial discretization, which will otherwise be impossible had we pursued a direct numerical method for the algebraic recursion (52).

Notice from (45) that is strictly convex in . Furthermore, the map is convex in for any . Therefore, (55) is a strictly convex functional recursion wherein each step admits unique minimizer.

(a) The function given by (59) over .
(b) The PDF given by (41) with as in (59), and .
Fig. 3: (a) The scaled Himmelblau function given by (59) has four local minima in . (b) The corresponding stationary PDF has four modes, i.e., local maxima, positioned at the argmins of . The proximal recursion (54) for the Example in Section V-C1 converges to this stationary PDF , as shown via the weighted point cloud evolution in Fig. 4.

V-B1 Interpretation for small

In the global optimization context, the parameter in the free energy (45) plays an important role. For , the SDE (39) reduces to the ODE (6), and (45) simplifies to with being the time-varying Dirac PDF along the solution of (6). In that case,