# Discrete symbolic optimization and Boltzmann sampling by continuous neural dynamics: Gradient Symbolic Computation

Gradient Symbolic Computation is proposed as a means of solving discrete global optimization problems using a neurally plausible continuous stochastic dynamical system. Gradient symbolic dynamics involves two free parameters that must be adjusted as a function of time to obtain the global maximizer at the end of the computation. We provide a summary of what is known about the GSC dynamics for special cases of settings of the parameters, and also establish that there is a schedule for the two parameters for which convergence to the correct answer occurs with high probability. These results put the empirical results already obtained for GSC on a sound theoretical footing.

## Authors

• 4 publications
• 18 publications
• 1 publication

Stochastic Gradient Langevin Dynamics (SGLD) ensures strong guarantees w...
06/12/2020 ∙ by Vyacheslav Kungurtsev, et al. ∙ 0

• ### Dimensionally Constrained Symbolic Regression

We describe dimensionally constrained symbolic regression which has been...
06/20/2011 ∙ by Suyong Choi, et al. ∙ 0

• ### Continuous-time integral dynamics for Aggregative Game equilibrium seeking

In this paper, we consider continuous-time semi-decentralized dynamics f...
03/28/2018 ∙ by Claudio De Persis, et al. ∙ 0

• ### Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem

Model-free reinforcement learning attempts to find an optimal control ac...
12/26/2019 ∙ by Hesameddin Mohammadi, et al. ∙ 0

• ### Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization

We present a unified framework to analyze the global convergence of Lang...
07/20/2017 ∙ by Pan Xu, et al. ∙ 0

• ### Hypocoercivity properties of adaptive Langevin dynamics

Adaptive Langevin dynamics is a method for sampling the Boltzmann-Gibbs ...
08/25/2019 ∙ by Benedict Leimkuhler, et al. ∙ 0

• ### Hypercoercivity properties of adaptive Langevin dynamics

Adaptive Langevin dynamics is a method for sampling the Boltzmann-Gibbs ...
08/25/2019 ∙ by Benedict Leimkuhler, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction: Unifying symbolic and neural optimization

### 1.1 The historical and contemporary context of the work

The recent spectacular successes and real-world deployment of neural networks in Artificial Intelligence (AI) systems have placed a premium on understanding the knowledge in such networks and on explaining their behavior

Voosen (2017)

. Such explanations are difficult in part because of the very different formal universes in which neural networks and human understanding live. Usually, the state space of a neural network is taken to be a continuous real vector space. But an explanation of how any system performs a task must, by definition, make contact with the human conceptual system

Keil (1989). Higher cognition deploys (at least approximately) discrete categories and rules — as characterized formally by traditional theories, rooted in symbolic computation, within cognitive science Chomsky (1965); Newell  Simon (1972); Marcus (2001) and AI Nilsson (1980). Unifying neural and symbolic computation promises a path not only to explainable neural networks but also to a new generation of AI systems and cognitive theories that combine the strengths of these two very different formalizations of computation Eliasmith . (2012).

The work presented here contributes to the mathematical foundations of one approach to such neural-symbolic unification Smolensky  Legendre (2006). In this approach, a single processing system can be formally characterized at two levels of description. At a fine-grained, lower level of description, it takes the form of a specially-structured type of neural network; at a coarser-grained, higher level of description, it constitutes a novel type of symbolic computation in which symbols within structures have continuously variable levels of activity. This architecture defines Gradient Symbolic Computation (GSC) Smolensky . (2014); Cho . (2017).

A key aspect of the GSC approach characterizes processing as optimization: given an input, processing constructs an output which maximizes a well-formedness measure called Harmony () Smolensky (1983, 1986). Thus GSC networks instantiate an ‘energy-based approach’ Hopfield (1982); Cohen  Grossberg (1983); Ackley . (1985); LeCun . (2007) (with Harmony corresponding to negative energy).

Crucially, in GSC a discrete subset of the real vector space of network states is isomorphic to a combinatorial space of discrete symbol structures: an explicit mapping ‘embeds’ each symbol structure as a distributed numerical vector Smolensky (1990); Legendre . (1991), as sometimes done in classic and contemporary Deep Neural Network modeling Pollack (1990); Plate (1993); Socher . (2010). Harmony can be computed at both the neural and the symbolic levels Smolensky (2006): the symbolic-level Harmony of a symbol structure equals the neural-level Harmony of the vector that embeds that structure.

At the symbolic level, an important case is when the Harmony of a symbol structure measures the wellformedness of that structure — the extent to which it satisfies the requirements of a grammar: for a given input, the symbol structure with maximal Harmony (typically unique) is the grammar’s output. This is Harmonic Grammar, in its deterministic form Legendre . (1990). In Probabilistic Harmonic Grammar Culbertson . (2013), the maximum-Harmony output is the most likely one, but other symbol structures also have non-zero probabilities of being the output: the probability that the output equals a given symbol structure is an exponentially increasing function of its Harmony .

Note that although we will sometimes refer to Harmonic Grammar, the results in the paper are general, and no assumptions regarding grammar per se are made.

### 1.2 The technical problem

To summarize the preceding discussion:

 (1) The desired output
1. Deterministic problem: global optimization. Output the symbol structure that (globally) maximizes Harmony (typically unique).

2. Probabilistic problem: sampling. Output a symbol structure with probability proportional to .

The randomness parameter , the computational temperature

, determines how concentrated the probability distribution is on the maximum-Harmony structure: the lower

, the greater the concentration. In the limit , the probabilistic problem reduces to the deterministic problem. Indeed, we will see that, when processing a single input, we need to decrease to zero during the computation, leading the network to converge to a stable state. For solving the sampling problem, decreases only to the level desired for the Boltzmann distribution being sampled. (In the computational linguistics literature, Probabilistic Harmonic Grammars are known under the name Maximum-Entropy or Maxent grammars Goldwater  Johnson (2003); Hayes  Wilson (2008), and typically has a single, non-dynamic value.)

The particular problem of interest is:

 (2) The desired computation
1. Optimization Problem. Through a continuous dynamical process in the embedding space (a continuous processing algorithm for the underlying neural network), converge to the vector embedding the symbol structure that maximizes Harmony (presumed unique).

2. Sampling Problem. Through a continuous dynamical process in the embedding space, converge to the vector embedding a symbol structure with probability proportional to , i.e., produce a sample from the Boltzmann distribution defined by the Harmony function .

GSC is a framework for models of how language-users might meet the requirements of grammar use in a neurally plausible way, searching continuously through a real vector space in which the discrete candidate outputs are embedded, rather than jumping directly from one discrete alternative to the next.

Smolensky . (2014) proposed a GSC neural dynamics and conjectured that these dynamics solve the problems (2). The correctness of the conjecture is required to validate the fundamental mode of explanation deployed in GSC, which uses the Harmony of output candidates to reason about their relative probabilities as outputs of the GSC network dynamics. In this paper we establish formal results concerning the correctness of the method, although the dynamics we study differs technically (but not conceptually) from that of Smolensky . (2014). (The new dynamics studied here has been used in cognitive models in Cho  Smolensky (2016); Cho . (2017).)

With the results presented here in place, in the grammatical setting the processing behavior of the underlying neural network can justifiably be formally understood in terms of maximizing symbolic grammatical Harmony, with knowledge of the grammar being realized in the interconnection weights that determine the Harmony of neural states Smolensky (1993). (We note that the problem of learning such weights is not addressed here.)

Thus the work presented below bears on several issues of considerable interest, some quite general, others more specialized:

• opening a path towards explainable neural network processing

• developing the mathematical foundations of an architecture for unifying neural and symbolic computation

• providing theorems concerning global optimization and sampling over discrete symbol structures through continuous neural computation

• validating the grounding in neural computation of a grammatical formalism widely employed in phonological theory.

(Regarding the last point, the grammatical theory directly relevant here, Harmonic Grammar, is itself currently attracting increasing attention of linguists Pater (2009); Potts . (2010). Importantly, Harmonic Grammar also provides the neural grounding of Optimality Theory Prince  Smolensky (1993/2004), which grew out of Harmonic Grammar and can be viewed as a crucial special case of it Prince  Smolensky (1997). Optimality Theory has been deployed by linguists to analyze all levels of linguistic structure, from phonetics to pragmatics ($$\mathtt{http://roa.rutgers.edu}$$); it is a dominant paradigm in the theory of phonology Prince  Smolensky (1993/2004); McCarthy  Prince (1993); McCarthy (2008), which characterizes the complex discrete symbol structures that mentally represent the physical realization of linguistic expressions, motoric and perceptual Chomsky  Halle (1968); Goldsmith (1990). GSC departs from previous Harmonic Grammar and Optimality Theory work in exploiting gradient symbolic representations.)

The innovation implicit in Smolensky . (2014) and explicit in Cho  Smolensky (2016) was to introduce another type of Harmony in addition to the Harmony defining grammatical wellformedness which we would like to optimize or sample from. The new type of Harmony is called quantization Harmony . assigns 0 Harmony to all neural states which are the embedding of a discrete symbolic structure, so adding it to grammatical Harmony does not change which discrete structure has maximal Harmony. But penalizes with negative Harmony neural states that are not near symbolically-interpretable states. The total Harmony is a weighted sum of the grammatical Harmony and quantization Harmony; maximizing this requires finding states that have high grammatical Harmony and have symbolic interpretation (avoiding a penalty from ). The weight assigned to in the total Harmony is called ; it measures the strength, relative to grammatical Harmony, of the requirement for symbolic interpretability (or ‘discreteness’). The quantity turns out to play a central conceptual and formal role in the theory: in order to produce a discretely-interpretable output, increases during the processing of an input. The interplay between the dynamics of change of and the dynamics of neural activation is at the heart of GSC.

Thus the stochastic dynamical equations we study here have two dynamic parameters: the degree of randomness, , and the level of discreteness, . By choosing a schedule for how and are changed over time during computation of an output, the system can be shown to perform global optimization — that is, enter into the global maximum of grammatical Harmony in a finite period of time with high probability. More generally, under other schedules for and , the system can be shown to perform Boltzmann sampling — that is, to terminate near discrete outputs with a probability that is exponential in the Harmony of those outputs.

The plan for the paper goes as follows. Section 2 formally specifies the computational task that needs to be solved in GSC, describing it in terms of maximizing a Harmony function over the states in a discrete ‘grid’, or more generally, producing discrete outputs in accordance with the Boltzmann distribution at some non-zero temperature . We introduce the Harmony function defined on all points in the vector space of neural states and specify a system of stochastic differential equations whose trajectories seek optima of . In Section 3 we state some of the basic properties of the local maxima of the function as . Section 4 establishes several basic mathematical results about the behavior of our stochastic differential equation for various limiting cases of and , establishing formal senses in which the GSC dynamics solves problems (2). We use these results in Section 5 to derive the existence of cooling schedules that with arbitrarily high probability lead the system to be arbitrarily close to the Harmony maximum. In Section 6 we conclude with a discussion of the use of our framework for Gradient Symbolic Computation.

## 2 The formal problem and preview of results

The formal embedding of discrete symbol structures in GSC employs tensor product representations Smolensky (1990). This method starts by choosing a filler/role decomposition of the target set of discrete structures: each particular structure is characterized as a set of filler/role bindings , each binding identifying which symbol from an alphabet fills each structural role , where the roles determine the structural type of instances of . For example, if is the set of strings over alphabet , a natural filler/role decomposition employs roles identifying the element of the string. Then, e.g., ; the filler/role binding encodes that the second element of is . Let and respectively denote the number of distinct fillers (symbols) and roles.

Once a filler/role decomposition has been chosen for , the remaining steps in defining a tensor product representation are to choose a vector-embedding mapping for the set of fillers, and another such mapping for the set of roles, . The tensor product representation is thus determined: it is the vector-embedding mapping

 (3) ψS:S→VS≡VF⊗VR,ψS:S↦∑f/r∈B(S)ψF(f)⊗ψR(r)

Below, the set of discrete structures of interest will be the set of candidate outputs for a Harmonic Grammar: e.g., a set of strings, or trees, or attribute-value structures. We will assume that the structural roles have been defined such that no discrete structure may have more than one symbol in any given role (i.e., position). It is convenient for the ensuing analysis to further assume that in any instance , every role has exactly one filler. This assumption does not entail any real loss of generality as we can always assume there is a ‘null filler’  Ø that fills any role that would otherwise be empty (e.g., the role for the length-2 string ). Furthermore we assume that permits any symbol in to fill any role . This too will not entail any loss of generality in the following analysis as we can always, if necessary, add to the Harmony function terms that assign high penalties to structures in which certain disfavored symbols fill particular roles; then, despite being in the set of candidate outputs , candidates with those disfavored filler/role bindings will be selected with vanishing probability as outputs of the Harmony-maximizing dynamics we will define.

Under these assumptions of convenience we can concisely characterize the set of discrete output candidates to be exactly those in which each role is filled by exactly one symbol. Then under a tensor product representation embedding, the vectors encoding elements of constitute the following set (where is an enumeration of the elements of and an enumeration of ):

 (4)

where here and below we abbreviate the filler vector as and the role vector as ; abbreviates .

We refer to the set as the grid. In the simplest non-trivial case, where and , is the set of four vectors

 f1⊗r1+f1⊗r2,     f1⊗r1+f2⊗r2,     f2⊗r1+f1⊗r2,     f2⊗r1+f2⊗r2.

These vectors can be rewritten as where and . Thus they lie on the vertices of a parallelogram in .

Finally, assume that the filler vectors , and the role vectors , are respectively bases of and . This implies that each of these sets is linearly independent, which is essential, and that they span their respective vector spaces, which is just a convenient assumption. The independence of and implies the independence of the set . Then it follows that is a basis for , where is the product of the number of fillers () and the number of roles (). Henceforth, variables such as etc. will be assumed to range over , while etc. range over . Thus the general vector can be characterized by the coefficients :

 (5) y=∑φρyφρfφ⊗rρ

Correspondingly, a vector on the grid has the form:

 (6) x=∑φρxφρfφ⊗rρ=∑ρfφx(rρ)⊗rρ∈G

Since , for each role there is a single filler which has coefficient ; all other . Algebraically, we can express this statement as:

 (7) x∈G if and only if for all φ,ρ:
 a.xφρ(1−xφρ)=0b.∑φ′x2φ′ρ−1=0

Focussing now on the Optimization Problem in GSC (2.1), the problem is to find the point that maximizes , i.e.,

 (8) maxx∈GH(x)

where is the plain Grammatical Harmony (without any quantization Harmony component).

A fundamental assumption of GSC is that takes the form of a quadratic function

 (9) H(y)=12yTWy+bTy.

This is the quadratic function that we want to maximize over , a finite set of discrete points.

Since we want to use neural computation to model how the brain solves the problem of maximizing Harmony on the grid , we choose to implement our computations in units with continuous activation varying continuously in time. We need to show how our discrete optimization problem can be encoded in this continuous way. GSC employs a standard way of doing this: we create a function that penalizes distance away from the grid. In particular, we choose to be zero on the grid and negative off the grid. Then, for large values of , we maximize over all of . Then we can either send to infinity or find some other way to round finally-computed states to the nearest grid point to obtain a maximizer on the grid.

With these goals and (7) in mind, we define Quantization Harmony as:

 (10) Q(y)≡−12∑ρ{[∑φy2φρ−1]2+∑φy2φρ(1−yφρ)2}

The first term in braces is if and only if for the particular role , the sum of the squared is , which is condition (7b) . The second term is if and only if or , which is condition (7a). Together, this ensures that for all and if and only if .

Together, and define the Total Harmony , parameterized by :

 (11) Hq≡H+qQ.

Now a basic strategy for solving (8) suggests itself. Choose a ‘large’ value of . Solve the continuous optimization problem

 (12) maxy∈RNHq(y)

For any finite , the maximizer of will not be a point on . But, as we will show, it will be close to the optimal solution on the grid.

In optimization this approach is sometimes known as a penalty method (Nocedal  Wright, 2006, Ch. 17). The problem is first solved with a small value of and then is gradually increased while the solution is updated. The procedure stops when increasing further does not change the solution significantly.

A penalty method does not however solve a fundamental aspect of our problem which is that we want global maxima of rather than local maxima. For large , every point in is close to a separate local maximum of , and so we can not use a simple steepest ascent method to find global maxima.

One solution to the problem of finding global maxima is to use simulated annealing Kirkpatrick . (1983). Simulated annealing is a popular method when the function to be optimized has many local optima but a global optimum is desired. Standard algorithms for finding local optima typically involve going ‘uphill’ until a local maximum is found. Simulated annealing combines uphill moves with occasional downhill moves to explore more of the state space. During simulated annealing the parameter , known as ‘temperature’, is decreased with time. When temperature is high, the algorithm is almost equally likely to take downhill steps as uphill steps. As is decreased, the algorithm becomes more and more conservative, eventually only going uphill. A wealth of computational experiments and theoretical analysis has shown simulated annealing to be effective for many global optimization problems.

Thus our approach is to combine both a penalty method (in that we choose a sufficiently large value of ) and simulated annealing to find solutions to (8). See Robini  Reissman (2013) for a similar framework.

To implement our method for solving problem (8), we introduce the following system of stochastic differential equations:

 (13) dy=∇Hq(y)dt+√2TdB

where is a standard -dimensional Brownian motion. This equation describes how the activations, given in the vector , change in time. The first term indicates there is a net drift of in the direction of increasing . The second term indicates that on top of this drift there is noise being continually added to the values of . Together they show that the system is undergoing a noisy random walk biased towards going uphill with respect to . When is large, the noise is large compared to the uphill motion, whereas when is small the randomness is negligible.

If we rewrite these equations in the form

 (14) dy=−∇(−Hq(y))dt+√2TdB

we see that it is a standard equation of mathematical physics known as either (overdamped) Langevin diffusion Roberts  Tweedie (1996); Mattingly . (2002), Brownian dynamics Schuss (2013), or a gradient system with additive noise Givon . (2004).

For a system of the form (13), we say is an invariant measure if when is distributed according to then is distributed according to for any . In other words, once the the state of the system is distributed according to , it remains distributed according to . Invariant measures are extremely important for systems that are ergodic, that is, where the system has a unique invariant measure and, given an initial distribution (including a deterministic one), the distribution of the system converges to the invariant measure. Among other results, in Section 4 we will see that the GSC dynamics are ergodic for any finite fixed and .

Under reasonable conditions on , the dynamics has an invariant distribution . The particular assumptions GSC makes about the function guarantee the following results, as we will show in Section 4.

Fact 1. For any fixed , the density is the unique invariant distribution of (13) and can be normalized to be a probability density. For all initial conditions , the probability distribution of converges to this unique invariant probability measure exponentially fast. (Note that the rate of convergence will depend on and .)

Fact 2. For fixed , as all the probability mass in the equilibrium distribution will be concentrated near points in the grid . The probability of being near point is proportional to — as required for solving the Sampling Problem (2.2).

Fact 3. For fixed , there is a cooling schedule for such that with probability 1, will converge to the maximum of —as required for solving the Optimization Problem (2.1)

In Section 5 we’ll use these results to establish a schedule for and that will suffice for the process to converge to the global maximum on the grid. Since this schedule will take infinite time to converge, we will also describe finite-time schedules such that for any there is a combined schedule for and such that with probability at least , will converge to the maximum of on the grid .

In what follows we will make use of two different notions of convergence of random variables. Suppose

is a probability measure for a random variable , so that

 P(X∈A)=π(A)

for all belonging to the collection of measurable sets. And suppose is the measure for another random variable . We define the total variation metric Gibbs  Su (2002) between and (or equivalently between and ) to be

 ∥π−ν∥=supA∈M|π(A)−ν(A)|.

Given a sequence of random variables with probability measures , , we say converges to in total variation if as . Another, weaker, definition of convergence is that of weak convergence where we say weakly converges to if for all bounded continuous

 ∫fdπn→∫fdπ.

## 3 Basic Properties of the Harmony Function as q→∞

Recall that the harmony function is given by

 Hq(y)=H(y)+qQ(y)

where is the quadratic function in (9) and is given in (10). has the property that it is at grid points and negative elsewhere, so it has global maxima at all points on the grid. For large , we expect to act like , and so we might hope that has local maxima near the grid points. We would like the global maximum of to be near the grid point where is greatest.

Our hopes are well founded, as the following result shows. First, in order to use the implicit function theorem, we compute the first two derivatives of .

 (15) ∇H = Wy+b ∇2H = W ∇Hq(y) = Wy+b+q∇Q(y) ∇2Hq(y) = W+q∇2Q(y)

where

 (16) [∇Q(y)]φρ=−2yφρ[∑φ′y2φ′ρ−1]−yφρ(1−yφρ)(1−2yφρ),
 (17) [∇2Q(y)]φ′ρ′,φρ = −δρρ′{2δφφ′[∑φ′′(yφ"ρ)2−1]+4yφρyφ′ρ+δφφ′[1−6(yφρ)(1−yφρ)]}.

An important observation is that for grid points we have

 [∇2Q(x)]φ′ρ′,φρ=−δρρ′δφφ′(1+4x2φρ),

because on the grid, , , and for any , unless . So the Laplacian of is diagonal and negative definite at grid points.

The following theorem shows that the local maxima of are within of the nearest point in . As well, the values of are within of the values of on .

###### Theorem 3.1.

Let . There is a neighborhood of and a such that for there is a local maximum of with

and

 Hq(xq)=H(x∗)−q−1∇H(x∗)T[∇2Q(x∗)]−1∇H(x∗)+O(q−2)
###### Proof.

Note that since is smooth, local optima satisfy .

To study how solutions to this equation depend on we let and study the equivalent equation

 h(x,ϵ)≡ϵ∇H(x)+∇Q(x)=0.

We use the implicit function theorem: see (Nocedal  Wright, 2006, p. 631). Since has solution , is twice continuously differentiable everywhere, and is nonsingular at the point , we have that we can uniquely solve for in a neighborhood of in terms of for all sufficiently close to . Furthermore, is twice continuously differentiable and

 dxϵdϵ(0)=−[∇xh(x∗,0)]−1∇ϵh(x∗,0)=−[∇2Q(x∗)]−1∇H(x∗).

Putting it back in terms of and building the Taylor expansion of gives the first result. The second result comes from substituting the first result into the Taylor expansion for about . ∎

As a straightforward corollary of the previous theorem, we have that the global maximum of is close to the global maximum of on for large .

###### Corollary 3.2.

Suppose has a unique global maximum on and the gap between the global maximum of on and the second highest local maximum is at least . Let be given. Then there is a such that for , has a unique global maximum satisfying both

 ∥xq−x∗∥≤η1,

and is at least away from the value of the second highest maximum.

###### Proof.

Follows straightforwardly from the previous theorem. ∎

## 4 Mathematical Results for the GSC Dynamics

In this section we establish the mathematical results about GSC dynamics that we outlined in Section 2.

### 4.1 For fixed q,T, convergence to invariant distribution as t→∞

We use the framework of Roberts  Tweedie (1996) to obtain the following result:

###### Theorem 4.1.

For the stochastic differential equation defined by (13) with fixed and

1. is defined for all time

2. there is a unique invariant probability measure where is a normalizing constant depending on and .

3. for any fixed initial condition the distribution of converges exponentially to as

Here exponential convergence of the distribution of to means there are constants and such that

 (18) ∥Pt(y0,⋅)−π∥≤Ry0ρt

for . is the probability that is in given . The constant in general depends on .

###### Proof.

The paper Roberts  Tweedie (1996) studies equations of the form

 (19) dxτ=12∇logπ(xτ)dτ+dBτ

where , , and as before is dimensional standard Brownian motion. The authors assume that is everywhere non-zero, differentiable and integrates to 1. This equation has invariant density as can be checked via the Fokker-Planck equation (Gardiner, 2004, Ch. 5).

We can transform our equation (13) into the form of Roberts  Tweedie (1996) using change of variables, , and . This gives

 dx=12T∇Hq(x)dτ+dB.

Now this can be transformed into (19) by letting

 12logπ(x)=12THq(x)

or

 π(x)=exp[T−1Hq(x)].

So everything Roberts  Tweedie (1996) proved for (19) with applies for our equation, though with a -dependent rescaling of time.

Theorem 2.1 of Roberts  Tweedie (1996) asserts that if is continuously differentiable, and if for some ,

 [∇logπ(x)]⋅x≤a∥x∥2+b,    for all ∥x∥>M,

then the dynamics are almost surely defined for all time, and the probability density function of the process converges to

in the total variation norm for all initial conditions.

In our case . So we need an expression for . Since , we first compute,

 ∇H(y)⋅y=(Wy+b)⋅y=yTWy+bTy≤wmax∥y∥2+∥b∥∥y∥≤(wmax+∥b∥)∥y∥2

if , where

is the maximum eigenvalue of

. Then, using the expression for from (16), we have

 [q∇Q(y)]⋅y=−q∑φρyφρ⎧⎨⎩2yφρ⎡⎣∑φ′y2φ′ρ−1⎤⎦+yφρ(1−yφρ)(1−2yφρ)⎫⎬⎭

The first term on the right can be bounded as

 −q∑φρyφρ⎧⎨⎩2yφρ⎡⎣∑φ′y2φ′ρ−1⎤⎦⎫⎬⎭ ≤ −2q∑φρy2φρ[y2φρ−1] ≤ 2q∑φρy2φρ=2q∥y∥2.

For the second term on the right we have

 −q∑φρyφρ{yφρ(1−yφρ)(1−2yφρ)}≤q8∑φρy2φρ=q8∥y∥2

where we have used that for all . Putting these bounds together gives, if ,

 1T∇Hq(y)⋅y≤1T[(wmax+∥b∥)∥y∥2+178q∥y∥2].

Theorem 2.1 of Roberts  Tweedie (1996) then gives results 1 and 2 in the statement of Theorem 4.1

To demonstrate exponential convergence to we use Theorem 2.3 of Roberts  Tweedie (1996). They state the exponential convergence is guaranteed for (13) if

1. there exist an such that is bounded for .

2. there exists a , , such that

 (20) liminf|x|→∞(1−d)∥∇logπ(x)∥2+∇2logπ(x)>0.

The first condition is true even for , since is bounded. For the second condition, recall that . So

 ∇2logπ=T−1Trace[∇2Hq(y)]=T−1%Trace[W+q∇2Q(y)].

If we take the trace of (17) we find that is negative for large but that no term grows faster than quadratically in . On the other hand,

 ∇logπ(y) = 1T∇Hq(y)=1T[Wy+b+q∇Q(y)], [∇logπ(y)]φρ = −2qTyφρ⎡⎣∑φ′y2φ′ρ+y2φρ⎤⎦+O(∥y∥2).

So

 (21) ∥∇logπ(y)∥2≥4q2T2∑φρy6φρ+O(∥y∥5).

So the left-hand side in (20) grows like a 6th order polynomial in for and and so the condition is satisfied. Theorem 2.3 of Roberts  Tweedie (1996) then gives result 3 in the statement of our theorem. ∎

Exponential convergence sounds good but the in (18) may be quite close to for large and small

. To give a rough estimate of how the rate of convergence scales with

and we perform in informal analysis using the Arrhenius formula (see (Gardiner, 2004, p. 141) or (Van Kampen, 1992, p. 334)). The Arrhenius formula gives an order of magnitude estimate for how long it takes a diffusion to exit one optimum and enter another. Generally, for a process to reach the equilibrium distribution from a fixed initial condition, the process has to visit representative points in the state space more than once. So the time to get from one local maximum to another gives a very conservative lower bound on how long it will take the process to reach equilibrium.

The one-dimensional version of the Arrhenius formula gives that the expected time to exit a state by getting over a saddle at point is

 τ=2π√U′′(a)|U′′(b)|exp[U(b)−U(a)T]

where is the potential. Adapting the result to our multidimensional case, we know that scales like as . So the expected time to converge to equilibrium at least goes like for some constant . This time diverges very rapidly as either or .

### 4.2 For fixed T, as q→∞, convergence to Boltzmann distribution

Here we consider fixed and see what happens to the equilibrium distribution as . From the previous subsection we know that the equilibrium distribution for fixed is

 Z−1qexp[Hq(x)/T]=Z−1qexp[(H(x)+qQ(x))/T]

where is a - and -dependent constant chosen to yield a probability distribution. (We suppress the dependence on in this section since we will not vary .)

In what follows let be all the points within distance of . Let be all the points where , that is, the grid points .

###### Theorem 4.2.

There exists a such that for all

 limq→∞∫B(xi,η)Z−1qexp[(H(x)+qQ(x))/T]dx=Z−1exp[H(xi)/T]

where is a normalizing constant depending on but not on or . Furthermore,

 limq→∞∫RN∖∪iB(xi,η)Z−1qexp[(H(x)+qQ(x))/T]dx=0.

Informally: for large , all the probability mass of the equilibrium distribution is concentrated about the . In this limit, each has probability mass proportional to . So this result effectively shows that GSC is Probabilistic Harmonic Grammar in the limit, solving the Sampling Problem (2.2).

###### Proof.

This result is obtained straightforwardly from (Kolokoltsov, 2007, Prop B2). We quote the result there in full: Let

 I(h)=∫Ωf(x)exp[−S(x)/h]dx

where is any closed subset of the Euclidean space , the functions and are continuous and for some positive .

We need the following assumptions:

1. the above integral is absolutely convergent for .

2. is thrice continuously differentiable

3. contains a neighborhood of the origin. As well, for and .

4. and is positive definite.

5. there exists positive such that

1. for all and some positive real

2. , where .

Let .

Proposition B2 states: Let the above assumptions hold and let be two times continuously differentiable and let be four times continuously differentiable. Then

 I(h)=(2πh)d/2(f(0)D−1/2+h[D−1/2δ1(h)+Λ−d/2δ2(h)])+δ3(h),

where

• has an -independent bound,

• has an -independent bound,

• converges to zero as goes to zero faster than any polynomial.

There are explicit expressions for all these bounds, but we do not need them here.

To apply this result to our integrals (without the normalizing constant), we need to change some variables. So we let

 f(x) = exp[H(x)/T] S(x) = −Q(x)/T h = 1/q

Also, for each integral we imagine translating the functions so that is at the origin. Our corresponds to , where is a constant we define shortly.

So do our and satisfy the conditions of the theorem? We go through the conditions stated above in the same order.

1. Immediately true because is bounded and the integrand is bounded on bounded sets.

2. Follows because is thrice continuously differentiable.

3. This is true as long as we choose to be small enough so that includes only one local maximum of .

4. Recall that the are local maxima of , so . As shown in Section 2, is negative definite and so is positive definite.

5. This is vacuously true since is bounded.

6. This is another constraint on how big is. Let be small enough so that for some . This is possible since is positive definite. This is condition 6(b). Condition 6(a) follows since is then convex on . (The only point is . So the minimum of on is either on the inner boundary or the outer boundary. If it is on the outer boundary then there must be an even lower point on the inner boundary by convexity.) To see that condition 6(c) holds for some and , note that . So just let be small enough so that .

Furthermore, just and being smooth satisfies the additional conditions. So we get that

 (22) ∫B(xi,η)exp[(H(x)+qQ(x))/T]dx=(2π/q)d/2(exp[H(xi)/T]D−1/2+q−1δ(q))

where has a -independent bound, for all sufficiently large . This is true for all . Note that in our case

 D=T−1det−∇2Q(xi)=T−1∏rf(1+4xrf)2=T−15R,

where is the number of roles, which takes the same values for all grid points.

What about the rest of ? Another useful result in Kolokoltsov (2007) is the following. If then

 I(h)≤Cexp[−M/h]

where is some -independent constant.

For our case this becomes

 (23) ∫Rd∖∪iB(xi,η)exp[(H+qQ)/T]dx≤Cexp[−Mq].

By the definition of the normalization constant we know that

 1 = Z−1q∫Rdexp[(H(x)+qQ(x))/T]dx = Z−1q∑i∫B(xi,η)exp[(H(x)+qQ(x))/T]dx+Z−1q∫Rd∖∪iB(xi,η)exp[(H+qQ)/T]dx.

Taking the limit as and observing the decreases faster than any power of , we obtain

 1=limq→∞[Z−1q(2πq)d/2D−1/2]∑iexp[H(xi)/T].

If we let

 Z=limq→∞Zq(2πq)−d/2D−1/2

then we obtain the desired result.

### 4.3 For fixed q, as T→0, convergence to the global maximum of Hq

Reducing to over time—simulated annealing—is a common technique for finding global optima of functions Kirkpatrick . (1983); Van Laarhoven  Aarts (1987); Hajek (1988). For a sufficiently slow cooling schedule, it is known that the process will converge to the global maximum of . There are many versions of this result; here we use Chiang . (1987) as it is closest to our framework, using a diffusion process to optimize a continuous function on . We will use their results to show that converges to the maximum of as , for sufficiently slow cooling, showing that the GSC dynamics can solve the Optimization Problem (2.1)

Here we state Chiang . (1987)’s main result. They consider the diffusion equation

 dX(t)=−∇U(X(t))dt+σ(t)dW(t),    X(0)=x0.

where is -dimensional Brownian motion. Let be the probability distribution proportional to and let have a unique weak limit as , . Let be a twice continuously differentiable function from to such that

1. and as .

2. .

Finally, assume that and for large . Then there is a such that for , for any bounded continuous function

 p(0,x0,t,f)→π(f),   as t→∞.

Here is the expected value of given .

The in Chiang . (1987) corresponds to our except that . This can be obtained by replacing with . This does not change the dynamics or the location of the maxima at all. Continuing to apply their framework, we need to take . We let be the distribution normalized to be a probability distribution (i.e. to have total mass 1). We need to show that has a weak limit as . This again can be shown using the results of Kolokoltsov (2007). We let be the normalizing constant such that . We let be the local maxima of and let be the points where it attains its global maxima. (Typically, we expect there to be only one point in . The result is that the limit distribution as is equal point masses distributed at all the points of . Of primary interest to us is the situation when there is a unique global maximum and hence the limiting distribution is a single point mass at the global optimum.)

###### Lemma 4.3.

For sufficiently large , the probability distribution with density converges weakly to

 1|G′q|∑x∈