Exploring Discrete Analogs of Hamiltonian Monte Carlo

09/11/2019 ∙ by Guangyao Zhou, et al. ∙ 0

Hamiltonian Monte Carlo (HMC) has emerged as a powerful way to sample from continuous distributions. However, a fundamental limitation of HMC is that it can't be applied to discrete distributions. In this paper, we propose the momentum sampler as a general framework for exploring the idea of building discrete analogs of HMC. The momentum sampler is a novel family of Markov Chain Monte Carlo (MCMC) algorithms that extends arbitrary single-site Metropolis Hastings (MH) samplers for discrete distributions with an auxiliary momentum process. It connects and generalizes a variety of existing methods, and points to a more general framework where we can mix stochastic evolution of discrete variables with deterministic evolution of continuous variables. The auxiliary momentum process can be integrated exactly, resulting in a rejection-free sampling process. The various trade-offs involved in picking different proposal distributions and kinetic energies have clear physical interpretations, and are demonstrated by numerical experiments on a simple Potts model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov Chain Monte Carlo (MCMC) methods are widely used to sample from complex probability distributions. Recently, Hamiltonian Monte Carlo (HMC)

Neal2012-cp has emerged as a powerful MCMC method to sample from continuous distributions. A fundamental limitation of HMC is that it can’t be applied to discrete distributions. There have been some recent efforts Pakman2013-gi ; Nishimura2017-wv ; Zhang2012-qm ; Dinh2017-td in applying HMC to discrete distributions. However, they often support only a limited family of distributions: Pakman2013-gi only works for binary distributions; Zhang2012-qm , without approximation, only works for binary pairwise models; and Dinh2017-td is specifically designed for phylogenetic trees. More importantly, there are no clear connections between the behavior of these methods and that of HMC: these methods are usually based on embedding the discrete state space into the continuum, and naively applying HMC (or its discontinuous Pakman2013-gi ; Mohasel_Afshar2015-uh ; Nishimura2017-wv and probabilistic path Dinh2017-td variants) on the continuous state space, while the keys to HMC’s success lie in its ability to exploit gradient information, to make global proposals by traversing long isoprobability contours, and to make use of the “momentum behavior” (i.e. accelerate in high probability/low energy regions and decelerate in low probability/high energy regions).

In this paper, we explore the idea of building discrete analogs of HMC. We propose the momentum sampler as a general framework for this exploration. The momentum sampler is a novel family of MCMC algorithms that extends arbitrary single-site Metropolis Hastings (MH) samplers for discrete distributions with an auxiliary momentum process. The auxiliary momentum process can be integrated exactly, resulting in a rejection-free sampling process. The various trade-offs involved in picking different proposal distributions and kinetic energies have clear physical interpretations.

In single-site MH samplers (sometimes also referred to as Metropolis-within-Gibbs samplers), each proposal only changes the value of a single random variable. Most of the commonly used MCMC methods for discrete distributions, including the random walk Metropolis sampler and the Gibbs sampler, fall within this family. In fact, many embedding-based HMC methods for discrete distributions are also closely connected to single-site MH samplers. In Section 4.3 of

Nishimura2017-wv , the authors point out connections between their method and single-site MH samplers. In Section 3.1 of this paper, we show that the Gaussian binary HMC sampler proposed in Pakman2013-gi with travel time behaves almost the same as a systematic scan single-site random walk Metropolis sampler.

The momentum sampler has rich connections to existing methods. In Section 3, we prove the Gaussian/exponential binary HMC samplers proposed in Pakman2013-gi are special cases of the momentum sampler, and show that the momentum sampler further generalizes the existing discontinuous Pakman2013-gi ; Mohasel_Afshar2015-uh and probabilistic path Dinh2017-td variants of HMC: instead of working with only a uniform proposal (as in Dinh2017-td ), it allows the use of arbitrary proposals while maintaining the desired behavior of HMC. It also points to a more general framework where we can mix stochastic evolution of discrete variables with deterministic evolution of continuous variables.

The rest of the paper is organized as follows. In Section 2, we describe the momentum sampler, prove that it samples from the correct distribution, and discuss the various trade-offs involved in picking different proposal distributions and kinetic energies. In Section 3, we establish connections between the momentum sampler and a variety of existing methods Pakman2013-gi ; Dinh2017-td ; Mohasel_Afshar2015-uh . In Section 4, we present some numerical experiments on a simple 1D 6-state Potts model, to demonstrate the behavior of the momentum sampler with different choices of travel time, proposal distributions and kinetic energies. Finally, we give conclusions and future directions in Section 5.

2 The Momentum Sampler

2.1 The Algorithm

We want to sample from a target discrete distribution on (known up to a normalization constant), where

is a random vector, and

. Use to denote single-site proposal distributions for a single-site MH sampler, where only when . In other words, the proposal distribution only changes the value of the random variable . We require each to be irreducible.

Formally, the momentum sampler operates on the expanded state space , where denotes the -dimensional flat torus. We are going to identify as the -dimensional hypercube with the 0’s and 1’s in different dimensions glued together. The auxiliary location variable and auxiliary momentum variable evolve according to the Hamiltonian dynamics, with a flat potential energy and a kinetic energy , where is some one-dimensional kinetic energy.

On a high level, the momentum sampler evolves according to the following dynamics: If , remains unchanged, and and evolve according to the Hamiltonian dynamics

If hits the either or at site (i.e. ), we propose a new using , calculate , and modify and accordingly

Because of the flat potential energy , we can exactly integrate the Hamiltonian dynamics for arbitrary kinetic energy . How to choose the kinetic energy for HMC is an active research area Livingstone2017-zd ; Betancourt2017-vx . In general, needs to be differentiable and symmetric around 0, have a well-defined inverse on (which we denote by , i.e. ), and we should be able to sample from . In this paper, we consider a simple family of kinetic energies that’s general enough to encompass all commonly used kinetic energies and clearly demonstrates the underlying physical interpretations. Use

to denote the generalized normal distribution with location

, scale and shape , corresponds to , and .

A detailed description of the momentum sampler algorithm is given in Algorithm 1

1:, target discrete distribution; , single-site proposals; , parameter for the kinetic energy ; , number of MCMC samples; , travel time for each step
2:, MCMC samples from the momentum sampler
3:Initialize (e.g. sample uniformly from )
4:
5:for  from to  do
6:     
7:      See Algorithm 2
8:     
9:end for
10:return
Algorithm 1 The Momentum Sampler Algorithm
1:, target discrete distribution; , single-site proposal distributions; , parameter for the kinetic energy
2:, current discrete state; , current location; , current momentum; , travel time
3:, next discrete state; , next location; , next momentum
4:function TakeMomentumStep()
5:     , ,
6:     
7:     
8:     while   do
9:         
10:         
11:         
12:         
13:         if  then
14:              
15:              
16:              
17:              if  then
18:                  
19:                  
20:                  
21:                  
22:              else
23:                  
24:                  
25:              end if
26:              
27:         end if
28:     end while
29:      Exists to ensure the TakeMomentumStep function is reversible
30:     return
31:end function
Algorithm 2 The TakeMomentumStep Function

2.2 The Momentum Sampler Samples from the Correct Distribution

In this section, we prove that the momentum sampler samples from the correct distribution . For this purpose, we show the momentum sampler preserves the joint invariant distribution , and establish its irreducibility and aperiodicity.

At each iteration, the resampling of can be seen as a Gibbs step, where we sample from the conditional distribution of given . This obviously preserves . So we only need to prove detailed balance of the TakeMomentumStep function w.r.t. . Formally, , TakeMomentumStep defines a transition probability kernel : and measurable

For all measurable, and , define . We have

Theorem 1.

(Detailed Balance) The TakeMomentumStep function satisfies detailed balance w.r.t. the joint invariant distribution , i.e. for any measurable sets ,

We defer the proof of this theorem to the supplementary materials. This theorem immediately implies that the momentum sampler preserves the joint invariant distribution .

In each iteration, since we can sample arbitrary values for the momentum variable , we can affect the location variable in arbitrary ways. This, coupled with the irreducibility of the proposal distributions , can typically guarantee the irreducibility of the momentum sampler. The potential problem of periodicity can be solved by randomly choosing the travel time for each TakeMomentumStep call. The above argument establishes the irreducibility and aperiodicity of the momentum sampler, and proves the momentum sampler samples from the correct distribution .

2.3 Various Trade-offs

2.3.1 Trade-offs in Picking Single-site MH Proposal Distributions

To understand the trade-offs in picking different single-site MH proposal distributions, we look at the key step in Algorithm 2: . Note that also appears in MH acceptance probability. Intuitively, it measures how far the ’s are from detailed balance.

The main trade-offs in picking the ’s are between how far the ’s are from detailed balance and how much information goes into the auxiliary momentum process:

  • [noitemsep, nolistsep]

  • On one extreme, we have the Gibbs sampler, which perfectly satisfies detailed balance. In this case, is always 0, no information goes into the auxiliary momentum process, and the momentum sampler is equivalent to the systematic scan Gibbs sampler.

  • On the other extreme, we have the random walk Metropolis sampler, which is “furthest away” from detailed balance. In this case, , and arguably the largest amount of information goes into the auxiliary momentum process.

  • In between the two extremes, we have various other choices. In many cases, we want the ’s to be “informed” to some degree. Similar problems have recently been considered in Zanella2017-ro . Thinking physically, these informed proposals can be seen as providing the analog of gradients information for discrete distributions. Depending on how far the ’s are from detailed balance, different amounts of information go into the auxiliary momentum process.

2.3.2 Trade-offs in Picking Kinetic Energies

We start with some illuminating calculations. Given , the velocity is given by . Regarding the magnitude of the velocity as a function of gives us . Taking the derivative of this function gives us .

The main trade-offs in picking kinetic energies are between how the velocity is affected by changes in the kinetic energy and how much kinetic energy we typically have:

  • [noitemsep, nolistsep]

  • One the one hand, to capture the momentum behavior of HMC, it’s desirable to have larger velocity when the kinetic energy increases. This requires to be convex, or more precisely . Furthermore, we want the changes in velocity to be sensitive to the changes in the kinetic energy. This typically points to having a larger .

  • On the other hand, we need to make use of the kinetic energy to overcome energy barriers , and intuitively, it’s desirable to typically have large kinetic energy. Calculating the expected kinetic energy points to having a smaller .

3 Connections to Existing Methods

3.1 Binary HMC Samplers are Special Cases of the Momentum Sampler

In this section, we prove that the Gaussian/exponential binary HMC samplers proposed in [5] are special cases of the momentum sampler. We use “visit site ” to denote hitting 0 at site for the binary HMC samplers, and hitting 0 or 1 at site for the momentum sampler. The key to proving this result is to understand that, in essence, the auxiliary momentum process in the momentum sampler and the embedding into the continuum in the binary HMC samplers are modeling the same two things: the time it takes to visit a site again (or more practically, at which site are we going to make the next proposal?), and how much total (potential and kinetic) energy we have at each site.

Formally, in order to establish the equivalences between the binary HMC samplers and certain momentum samplers, for a particular site , we need to study:

  1. [noitemsep, nolistsep]

  2. The distribution on the initial time it takes to visit site

  3. The distribution on the initial total energy at site

  4. After we visit site , the time it takes to visit site again if we have total energy at site

For two samplers, if we can establish the equivalence of the joint distribution on the initial time and initial total energy, and the equivalence of the time it takes to visit this site again, we would be able to establish the equivalence of the corresponding samplers.

To simplify the proof, for the momentum sampler, rather than using the discrete state space , we work with instead. For the Gaussian binary HMC sampler, we have

Proposition 1.

The Gaussian binary HMC sampler is equivalent to the momentum sampler with deterministic proposals and .

As a by-product of the above proposition, we also have

Proposition 2.

If, in addition to resampling , we also resample at every iteration, then the Gaussian binary HMC sampler with travel time is equivalent to a systematic scan single-site random walk Metropolis sampler, where the site-visitation order is a random permutation of and is refreshed after every site visits.

For the original Gaussian binary HMC sampler with travel time , since we don’t resample at each iteration, its behavior is not exactly the same as a systematic scan single-site random walk Metropolis sampler. Nevertheless, it’s easy to see that these two samplers are closely related.

For the exponential binary HMC sampler, we have

Proposition 3.

The exponential binary HMC sampler is equivalent to the momentum sampler with deterministic proposal and .

We defer the proofs of these propositions to the supplementary materials. These results also explain the observations in Pakman2013-gi that the exponential binary HMC sampler is less efficient than the Gaussian binary HMC sampler: looking back at the trade-offs discussed in Section 2.3.2, it’s easy to see that using induces undesirable momentum behavior (where we decelerate in high probability/low energy regions and accelerate in low probability/high energy regions), makes it easy for the process to get trapped in local energy minima, and makes the algorithm less efficient.

3.2 The Momentum Sampler Generalizes Discontinuous and Probabilistic Path HMCs

A foundational tool in applying HMC to discrete distributions is the discontinuous variant Pakman2013-gi ; Mohasel_Afshar2015-uh of HMC, which operates on piecewise continuous potential energies. The discontinuous HMC was first proposed in Pakman2013-gi , where the authors considered a simple piecewise continuous potential energy arising from embedding binary distributions into the continuum. The Hamiltonian dynamics was integrated exactly across the discontinuities. This was later generalized in Mohasel_Afshar2015-uh to arbitrary piecewise continuous potential energies with discontinuities across affine boundaries, and a modified Leapfrog algorithm was introduced to account for such discontinuities.

In Dinh2017-td , the authors introduced a further generalization, the probabilistic path variant of HMC. In addition to its successful applications to the space of phylogenetic trees, an important contribution of probabilistic path HMC is the introduction of probabilistic paths to Hamiltonian dynamics, mixing the stochastic evolution of discrete variables with the deterministic evolution of continuous variables. However, the authors only considered a simple uniform proposal distribution for the evolution of the discrete variables. Moreover, a version of the probabilistic path HMC algorithm which allows for discontinuities in the potential energy was presented, but detailed balance, irreducibility and aperiodicity were only established for the case of continuous potential energy.

The momentum sampler generalizes discontinuous Pakman2013-gi ; Mohasel_Afshar2015-uh and probabilistic path Dinh2017-td HMCs, by introducing probabilistic paths that can evolve according to arbitrary proposal distributions to Hamiltonian dynamics, while properly accounting for the corresponding discontinuities in the potential energy. Its detailed balance, irreducibility and aperiodicity are rigorously established. In the momentum sampler, we focus on a very simple auxiliary momentum process. However, it’s easy to see that the essential idea is applicable to more general cases, in which we have complicated auxiliary state spaces and nontrivial potential energies. This points to a much more general framework where we can introduce interesting interactions between stochastic evolution of discrete variables and deterministic evolution of continuous variables.

4 Numerical Experiments on a 1D Potts Model

4.1 Experimental Setup

In this section, we carry out numerical experiments on a simple 1D 6-state Potts model of size with periodic conditions, to demonstrate the behavior of the momentum sampler. More concretely, we consider the discrete distribution

controls the strength of correlations among neighboring locations. In our experiments, we use , which induces relatively strong correlations among neighboring locations.

For the momentum sampler, following Zanella2017-ro , we consider 4 different proposals: RW, GB, LB1, LB2. For , the four different proposals share the common form

For RW, GB, LB1, LB2, takes the form respectively. We experiment with and travel time .

For comparison, we also test four basic single-site MH samplers: random/systematic scan Gibbs samplers and random/systematic scan MH samplers with RW, GB, LB1, LB2 proposals (which we call “informed samplers”). For random scan samplers, at each iteration, we randomly pick a site and make a proposal using , while for systematic scan samplers, we repeatedly pick a random order and visit all 400 sites in this order to make proposals. Note that our informed samplers are different from the samplers considered in Zanella2017-ro : in Zanella2017-ro , they use proposal distributions of the form

which simultaneously pick a site to visit and propose a new state for this site; for our informed samplers, we rely on the random/systematic scan to pick a site , and use to make the proposal.

For the Potts model, a useful summary statistic is the “order parameter”. For a particular , the order parameter is defined as . The order parameter ranges from 0 (totally disordered) to 1 (totally ordered). In our experiments, for each run of the sampler, we would also calculate the corresponding sequence of order parameters.

To make a fair comparison of different samplers, we run each sampler until they make approximately site visits. For the momentum sampler with travel time , this translates to approximately (exactly for ) samples. We use the effective sample size (ESS) (of the order parameters) per site visits as our performance measure. The ESS is calculated using the effective_sample_size function in the Arviz python package Kumar2019-uh

. To improve estimation accuracy, for a particular sampler, we use the order parameters from 50 different runs to estimate the ESS. In our experiments, we always initialize the system in a totally ordered state, where all

’s are in state 1.

4.2 Results

Figure 1: Evolution of States and Order Parameters. Visualizations of the evolution of states and order parameters for 3 best-performing samplers: the systematic scan Gibbs sampler, the systematic scan informed sampler with proposal GB, and the momentum sampler with proposal GB, , and travel time . The corresponding ESS’s of the order parameters (larger is better) are shown.
Figure 2: ESS as a Function of travel time , proposal distributions and . Upper figure: visualizations of how the ESS changes as a function of travel time for the momentum sampler with proposals GB, LB1, LB2, and reference ESS’s of systematic scan Gibbs sampler and systematic scan informed sampler with proposal GB. Lower figure: visualizations of how the ESS changes as a function of proposal distributions and for the momentum sampler with proposals GB, LB1, LB2.

Before going into the details of the experimental results, we want to emphasize that the goal of our experiments is not to argue for the efficiency of the momentum sampler, but rather to demonstrate its behavior. Since the momentum sampler is built on top of single-site MH samplers, it’s not hard to imagine having more efficient cluster Monte Carlo algorithms Ito1994-tp . Even if we only consider single-site MH algorithms, we can make use of the problem structure to design more efficient samplers: for our 1D Potts model, using a fixed sequential scan from 1 to 400 with GB can easily outperform all the above samplers. The momentum sampler provides a general framework to explore the idea of building discrete analogs of HMC, and the experiments here serve the purpose of demonstrating how its performance changes as we vary the travel time , proposal distributions and , and how the auxiliary momentum process affects the performances of the underlying single-site MH samplers.

Since we are working with a 1D model, we first visualize the evolution of the states and order parameters for three good-performing samplers: the systematic scan Gibbs sampler, the systematic scan informed sampler with proposal GB, and the momentum sampler with proposal GB, travel time and . These are in fact the best-performing Gibbs/informed/momentum samplers. We can clearly see that the momentum sampler performs the best, improving upon the underlying single-site MH sampler (the informed sampler with proposal GB), while the Gibbs sampler performs the worst among the three different kinds of samplers.

To better understand the performance of the momentum sampler, we investigate how the ESS changes as a function of the travel time , proposal distributions and . The results are summarized in Figure 2. From the figure, we can see that for the momentum sampler with proposals GB, LB1 and LB2, ESS generally increases as increases. This effect is particularly pronounced for GB, and is potentially a indicator of the momentum sampler’s ability to make global proposals by traversing long isoprobability contours, much like HMC. We also see that the best performance for the momentum sampler generally happens at around , demonstrating the trade-offs discussed in Section 2.3.2.

5 Conclusions and Future Directions

In this paper, we propose the momentum sampler as a general framework for exploring the idea of building discrete analogs of HMC. It connects and generalizes a variety of existing methods, and points to a more general framework where we can mix stochastic evolution of discrete variables with deterministic evolution of continuous variables. Numerical experiments on a simple 1D Potts model demonstrate the value of the auxiliary momentum process for single-site MH samplers, and the various trade-offs involved in picking different proposal distributions and kinetic energies.

The 1D Potts model used in this paper only serves as a proof-of-concept, to demonstrate the behavior of the momentum sampler. Comprehensive numerical experiments on more complicated discrete distributions should be an immediate next step. In addition, working out the more general framework indicated by the momentum sampler in full generality and establish further connections with other existing methods remain an exciting future direction.

Appendix A Proof of Theorem 1

a.1 Proof of the Theorem

Theorem 1.

(Detailed Balance) The TakeMomentumStep function satisfies detailed balance w.r.t. the joint invariant distribution , i.e. for any measurable sets ,

Proof.

For notational simplicity, we use and to denote two points in .

Sequence of Proposals and Probabilistic Paths

If we start from , for a given travel time , a concrete run of the TakeMomentumStep function would involve a finite sequence of random proposals. Assume the length of the sequence is . The sequence of random proposals can be denoted as

This sequence of proposals indicates that, for this particular run of TakeMomentumStep, we reach 0 or 1 at individual sites times, and each time the system makes a proposal to go to the discrete state from the current discrete state.

If we fix , the TakeMomentumStep function in fact specifies a deterministic mapping, and would map to a single point . For each such sequence of proposals , we introduce an associated probabilistic path , which contains all the information of the system going from to in time through the function TakeMomentumStep. Formally, is specified by

  • The sequence of random proposals

  • The indices of the sites for the site visitations

  • The times of the site visitations

  • The discrete states of the system at site visitations

  • Accept/reject decisions for the site visitations , where

  • The evolution of the location variables and the momentum variables . Note that we might have discontinuities in . We use to denote the left limit and to denote the right limit.

Countable Number of Probabilistic Paths

In order for a probabilistic path to be valid, the different components of have to interact with each other in a way as determined by the TakeMomentumStep function. For example, we should have   and

For and some given travel time , we say a sequence of proposals is compatible with and TakeMomentumStep if we can find a corresponding probabilistic path that’s valid.

Not all sequences of proposals correspond to valid probabilistic paths. But even if we don’t consider the compatibility of the sequence of proposals with and TakeMomentumStep, the set of all possible such sequences has only a countable number of elements. This is because we only need to look at sequences of finite length (because of the fixed travel time ), and all the individual proposals are on discrete state spaces with a finite number of states.

The above analysis indicates that for some starting point and travel time , running the TakeMomentumStep function would result in only a countable number of possible destinations . Furthermore, for which , there are at most a countable number of probabilistic paths which bring to in time through TakeMomentumStep.

Formally, given some travel time and a sequence of proposals , define

Use to denote the deterministic mapping defined by TakeMomentumStep for the given in time (so that represents the domain of the mapping ), and use

to denote the image of the mapping . For a given , use

to denote the deterministic mapping induced by on . (i.e. , , where ). Define

for which , further define

Then both and have at most a countable number of elements.

Proof of Detailed Balance

For a given travel time and a sequence of proposals , , we use to denote the probability of going from to through the probabilistic path . Since TakeMomentumStep defines a deterministic mapping for given and , the only non-zero term is . For all , we have .

Using the above notation, and measurable, we can write as

For a given travel time , , if , then . In Lemma 3, we prove that , the absolute value of the determinant of the Jacobian of is , for all where . Furthermore, the deterministic mapping is reversible, and there exists a sequence of proposals , s.t. .

In Lemma 4, we prove that

Using the above results, it’s not hard to see that

which proves the desired detailed balance property of TakeMomentumStep w.r.t.

a.2 Useful Lemmas

In this section, we prove a few useful lemmas to complete the proof of Theorem 1.

First, we prove two lemmas, similar to Lemma 1 and Lemma 2 in Section 5.1 of Mohasel_Afshar2015-uh .

Lemma 1.

(Refraction) Let be a transformation in that takes a unit mass located at and moves it with constant velocity . Assume it reaches 0 or 1 at site first. Subsequently is changed to , and is changed to (where is a constant and satisfies ). The move is carried on, with the velocity changed to , for the total time period till it ends in location and momentum , before it reaches 0 or 1 again at any sites. Then is volume preserving, i.e. the absolute value of the determinant of its Jacobian .

Proof.

Following the same argument as in the proof of Lemma 1 of Mohasel_Afshar2015-uh , we have

If we define , then

This implies

Lemma 2.

(Reflection) Let be a transformation in that takes a unit mass located at and moves it with constant velocity . Assume it reaches 0 or 1 at site first. Subsequently is changed to . The move is carried on, with the velocity changed to , for the total time period till it ends in location and momentum , before it reaches 0 or 1 at any sites again. Then is volume preserving, i.e. the absolute value of the determinant of its Jacobian .

Proof.

Following the same argument as in the proof of Lemma 2 of Mohasel_Afshar2015-uh , we have

If we define , then