Markov Chain Monte Carlo (MCMC) methods are widely used to sample from complex probability distributions. Recently, Hamiltonian Monte Carlo (HMC)Neal2012-cp has emerged as a powerful MCMC method to sample from continuous distributions. A fundamental limitation of HMC is that it can’t be applied to discrete distributions. There have been some recent efforts Pakman2013-gi ; Nishimura2017-wv ; Zhang2012-qm ; Dinh2017-td in applying HMC to discrete distributions. However, they often support only a limited family of distributions: Pakman2013-gi only works for binary distributions; Zhang2012-qm , without approximation, only works for binary pairwise models; and Dinh2017-td is specifically designed for phylogenetic trees. More importantly, there are no clear connections between the behavior of these methods and that of HMC: these methods are usually based on embedding the discrete state space into the continuum, and naively applying HMC (or its discontinuous Pakman2013-gi ; Mohasel_Afshar2015-uh ; Nishimura2017-wv and probabilistic path Dinh2017-td variants) on the continuous state space, while the keys to HMC’s success lie in its ability to exploit gradient information, to make global proposals by traversing long isoprobability contours, and to make use of the “momentum behavior” (i.e. accelerate in high probability/low energy regions and decelerate in low probability/high energy regions).
In this paper, we explore the idea of building discrete analogs of HMC. We propose the momentum sampler as a general framework for this exploration. The momentum sampler is a novel family of MCMC algorithms that extends arbitrary single-site Metropolis Hastings (MH) samplers for discrete distributions with an auxiliary momentum process. The auxiliary momentum process can be integrated exactly, resulting in a rejection-free sampling process. The various trade-offs involved in picking different proposal distributions and kinetic energies have clear physical interpretations.
In single-site MH samplers (sometimes also referred to as Metropolis-within-Gibbs samplers), each proposal only changes the value of a single random variable. Most of the commonly used MCMC methods for discrete distributions, including the random walk Metropolis sampler and the Gibbs sampler, fall within this family. In fact, many embedding-based HMC methods for discrete distributions are also closely connected to single-site MH samplers. In Section 4.3 ofNishimura2017-wv , the authors point out connections between their method and single-site MH samplers. In Section 3.1 of this paper, we show that the Gaussian binary HMC sampler proposed in Pakman2013-gi with travel time behaves almost the same as a systematic scan single-site random walk Metropolis sampler.
The momentum sampler has rich connections to existing methods. In Section 3, we prove the Gaussian/exponential binary HMC samplers proposed in Pakman2013-gi are special cases of the momentum sampler, and show that the momentum sampler further generalizes the existing discontinuous Pakman2013-gi ; Mohasel_Afshar2015-uh and probabilistic path Dinh2017-td variants of HMC: instead of working with only a uniform proposal (as in Dinh2017-td ), it allows the use of arbitrary proposals while maintaining the desired behavior of HMC. It also points to a more general framework where we can mix stochastic evolution of discrete variables with deterministic evolution of continuous variables.
The rest of the paper is organized as follows. In Section 2, we describe the momentum sampler, prove that it samples from the correct distribution, and discuss the various trade-offs involved in picking different proposal distributions and kinetic energies. In Section 3, we establish connections between the momentum sampler and a variety of existing methods Pakman2013-gi ; Dinh2017-td ; Mohasel_Afshar2015-uh . In Section 4, we present some numerical experiments on a simple 1D 6-state Potts model, to demonstrate the behavior of the momentum sampler with different choices of travel time, proposal distributions and kinetic energies. Finally, we give conclusions and future directions in Section 5.
2 The Momentum Sampler
2.1 The Algorithm
We want to sample from a target discrete distribution on (known up to a normalization constant), where
is a random vector, and. Use to denote single-site proposal distributions for a single-site MH sampler, where only when . In other words, the proposal distribution only changes the value of the random variable . We require each to be irreducible.
Formally, the momentum sampler operates on the expanded state space , where denotes the -dimensional flat torus. We are going to identify as the -dimensional hypercube with the 0’s and 1’s in different dimensions glued together. The auxiliary location variable and auxiliary momentum variable evolve according to the Hamiltonian dynamics, with a flat potential energy and a kinetic energy , where is some one-dimensional kinetic energy.
On a high level, the momentum sampler evolves according to the following dynamics: If , remains unchanged, and and evolve according to the Hamiltonian dynamics
If hits the either or at site (i.e. ), we propose a new using , calculate , and modify and accordingly
Because of the flat potential energy , we can exactly integrate the Hamiltonian dynamics for arbitrary kinetic energy . How to choose the kinetic energy for HMC is an active research area Livingstone2017-zd ; Betancourt2017-vx . In general, needs to be differentiable and symmetric around 0, have a well-defined inverse on (which we denote by , i.e. ), and we should be able to sample from . In this paper, we consider a simple family of kinetic energies that’s general enough to encompass all commonly used kinetic energies and clearly demonstrates the underlying physical interpretations. Use
to denote the generalized normal distribution with location, scale and shape , corresponds to , and .
A detailed description of the momentum sampler algorithm is given in Algorithm 1
2.2 The Momentum Sampler Samples from the Correct Distribution
In this section, we prove that the momentum sampler samples from the correct distribution . For this purpose, we show the momentum sampler preserves the joint invariant distribution , and establish its irreducibility and aperiodicity.
At each iteration, the resampling of can be seen as a Gibbs step, where we sample from the conditional distribution of given . This obviously preserves . So we only need to prove detailed balance of the TakeMomentumStep function w.r.t. . Formally, , TakeMomentumStep defines a transition probability kernel : and measurable
For all measurable, and , define . We have
(Detailed Balance) The TakeMomentumStep function satisfies detailed balance w.r.t. the joint invariant distribution , i.e. for any measurable sets ,
We defer the proof of this theorem to the supplementary materials. This theorem immediately implies that the momentum sampler preserves the joint invariant distribution .
In each iteration, since we can sample arbitrary values for the momentum variable , we can affect the location variable in arbitrary ways. This, coupled with the irreducibility of the proposal distributions , can typically guarantee the irreducibility of the momentum sampler. The potential problem of periodicity can be solved by randomly choosing the travel time for each TakeMomentumStep call. The above argument establishes the irreducibility and aperiodicity of the momentum sampler, and proves the momentum sampler samples from the correct distribution .
2.3 Various Trade-offs
2.3.1 Trade-offs in Picking Single-site MH Proposal Distributions
To understand the trade-offs in picking different single-site MH proposal distributions, we look at the key step in Algorithm 2: . Note that also appears in MH acceptance probability. Intuitively, it measures how far the ’s are from detailed balance.
The main trade-offs in picking the ’s are between how far the ’s are from detailed balance and how much information goes into the auxiliary momentum process:
On one extreme, we have the Gibbs sampler, which perfectly satisfies detailed balance. In this case, is always 0, no information goes into the auxiliary momentum process, and the momentum sampler is equivalent to the systematic scan Gibbs sampler.
On the other extreme, we have the random walk Metropolis sampler, which is “furthest away” from detailed balance. In this case, , and arguably the largest amount of information goes into the auxiliary momentum process.
In between the two extremes, we have various other choices. In many cases, we want the ’s to be “informed” to some degree. Similar problems have recently been considered in Zanella2017-ro . Thinking physically, these informed proposals can be seen as providing the analog of gradients information for discrete distributions. Depending on how far the ’s are from detailed balance, different amounts of information go into the auxiliary momentum process.
2.3.2 Trade-offs in Picking Kinetic Energies
We start with some illuminating calculations. Given , the velocity is given by . Regarding the magnitude of the velocity as a function of gives us . Taking the derivative of this function gives us .
The main trade-offs in picking kinetic energies are between how the velocity is affected by changes in the kinetic energy and how much kinetic energy we typically have:
One the one hand, to capture the momentum behavior of HMC, it’s desirable to have larger velocity when the kinetic energy increases. This requires to be convex, or more precisely . Furthermore, we want the changes in velocity to be sensitive to the changes in the kinetic energy. This typically points to having a larger .
On the other hand, we need to make use of the kinetic energy to overcome energy barriers , and intuitively, it’s desirable to typically have large kinetic energy. Calculating the expected kinetic energy points to having a smaller .
3 Connections to Existing Methods
3.1 Binary HMC Samplers are Special Cases of the Momentum Sampler
In this section, we prove that the Gaussian/exponential binary HMC samplers proposed in  are special cases of the momentum sampler. We use “visit site ” to denote hitting 0 at site for the binary HMC samplers, and hitting 0 or 1 at site for the momentum sampler. The key to proving this result is to understand that, in essence, the auxiliary momentum process in the momentum sampler and the embedding into the continuum in the binary HMC samplers are modeling the same two things: the time it takes to visit a site again (or more practically, at which site are we going to make the next proposal?), and how much total (potential and kinetic) energy we have at each site.
Formally, in order to establish the equivalences between the binary HMC samplers and certain momentum samplers, for a particular site , we need to study:
The distribution on the initial time it takes to visit site
The distribution on the initial total energy at site
After we visit site , the time it takes to visit site again if we have total energy at site
For two samplers, if we can establish the equivalence of the joint distribution on the initial time and initial total energy, and the equivalence of the time it takes to visit this site again, we would be able to establish the equivalence of the corresponding samplers.
To simplify the proof, for the momentum sampler, rather than using the discrete state space , we work with instead. For the Gaussian binary HMC sampler, we have
The Gaussian binary HMC sampler is equivalent to the momentum sampler with deterministic proposals and .
As a by-product of the above proposition, we also have
If, in addition to resampling , we also resample at every iteration, then the Gaussian binary HMC sampler with travel time is equivalent to a systematic scan single-site random walk Metropolis sampler, where the site-visitation order is a random permutation of and is refreshed after every site visits.
For the original Gaussian binary HMC sampler with travel time , since we don’t resample at each iteration, its behavior is not exactly the same as a systematic scan single-site random walk Metropolis sampler. Nevertheless, it’s easy to see that these two samplers are closely related.
For the exponential binary HMC sampler, we have
The exponential binary HMC sampler is equivalent to the momentum sampler with deterministic proposal and .
We defer the proofs of these propositions to the supplementary materials. These results also explain the observations in Pakman2013-gi that the exponential binary HMC sampler is less efficient than the Gaussian binary HMC sampler: looking back at the trade-offs discussed in Section 2.3.2, it’s easy to see that using induces undesirable momentum behavior (where we decelerate in high probability/low energy regions and accelerate in low probability/high energy regions), makes it easy for the process to get trapped in local energy minima, and makes the algorithm less efficient.
3.2 The Momentum Sampler Generalizes Discontinuous and Probabilistic Path HMCs
A foundational tool in applying HMC to discrete distributions is the discontinuous variant Pakman2013-gi ; Mohasel_Afshar2015-uh of HMC, which operates on piecewise continuous potential energies. The discontinuous HMC was first proposed in Pakman2013-gi , where the authors considered a simple piecewise continuous potential energy arising from embedding binary distributions into the continuum. The Hamiltonian dynamics was integrated exactly across the discontinuities. This was later generalized in Mohasel_Afshar2015-uh to arbitrary piecewise continuous potential energies with discontinuities across affine boundaries, and a modified Leapfrog algorithm was introduced to account for such discontinuities.
In Dinh2017-td , the authors introduced a further generalization, the probabilistic path variant of HMC. In addition to its successful applications to the space of phylogenetic trees, an important contribution of probabilistic path HMC is the introduction of probabilistic paths to Hamiltonian dynamics, mixing the stochastic evolution of discrete variables with the deterministic evolution of continuous variables. However, the authors only considered a simple uniform proposal distribution for the evolution of the discrete variables. Moreover, a version of the probabilistic path HMC algorithm which allows for discontinuities in the potential energy was presented, but detailed balance, irreducibility and aperiodicity were only established for the case of continuous potential energy.
The momentum sampler generalizes discontinuous Pakman2013-gi ; Mohasel_Afshar2015-uh and probabilistic path Dinh2017-td HMCs, by introducing probabilistic paths that can evolve according to arbitrary proposal distributions to Hamiltonian dynamics, while properly accounting for the corresponding discontinuities in the potential energy. Its detailed balance, irreducibility and aperiodicity are rigorously established. In the momentum sampler, we focus on a very simple auxiliary momentum process. However, it’s easy to see that the essential idea is applicable to more general cases, in which we have complicated auxiliary state spaces and nontrivial potential energies. This points to a much more general framework where we can introduce interesting interactions between stochastic evolution of discrete variables and deterministic evolution of continuous variables.
4 Numerical Experiments on a 1D Potts Model
4.1 Experimental Setup
In this section, we carry out numerical experiments on a simple 1D 6-state Potts model of size with periodic conditions, to demonstrate the behavior of the momentum sampler. More concretely, we consider the discrete distribution
controls the strength of correlations among neighboring locations. In our experiments, we use , which induces relatively strong correlations among neighboring locations.
For the momentum sampler, following Zanella2017-ro , we consider 4 different proposals: RW, GB, LB1, LB2. For , the four different proposals share the common form
For RW, GB, LB1, LB2, takes the form respectively. We experiment with and travel time .
For comparison, we also test four basic single-site MH samplers: random/systematic scan Gibbs samplers and random/systematic scan MH samplers with RW, GB, LB1, LB2 proposals (which we call “informed samplers”). For random scan samplers, at each iteration, we randomly pick a site and make a proposal using , while for systematic scan samplers, we repeatedly pick a random order and visit all 400 sites in this order to make proposals. Note that our informed samplers are different from the samplers considered in Zanella2017-ro : in Zanella2017-ro , they use proposal distributions of the form
which simultaneously pick a site to visit and propose a new state for this site; for our informed samplers, we rely on the random/systematic scan to pick a site , and use to make the proposal.
For the Potts model, a useful summary statistic is the “order parameter”. For a particular , the order parameter is defined as . The order parameter ranges from 0 (totally disordered) to 1 (totally ordered). In our experiments, for each run of the sampler, we would also calculate the corresponding sequence of order parameters.
To make a fair comparison of different samplers, we run each sampler until they make approximately site visits. For the momentum sampler with travel time , this translates to approximately (exactly for ) samples. We use the effective sample size (ESS) (of the order parameters) per site visits as our performance measure. The ESS is calculated using the effective_sample_size function in the Arviz python package Kumar2019-uh
. To improve estimation accuracy, for a particular sampler, we use the order parameters from 50 different runs to estimate the ESS. In our experiments, we always initialize the system in a totally ordered state, where all’s are in state 1.
Before going into the details of the experimental results, we want to emphasize that the goal of our experiments is not to argue for the efficiency of the momentum sampler, but rather to demonstrate its behavior. Since the momentum sampler is built on top of single-site MH samplers, it’s not hard to imagine having more efficient cluster Monte Carlo algorithms Ito1994-tp . Even if we only consider single-site MH algorithms, we can make use of the problem structure to design more efficient samplers: for our 1D Potts model, using a fixed sequential scan from 1 to 400 with GB can easily outperform all the above samplers. The momentum sampler provides a general framework to explore the idea of building discrete analogs of HMC, and the experiments here serve the purpose of demonstrating how its performance changes as we vary the travel time , proposal distributions and , and how the auxiliary momentum process affects the performances of the underlying single-site MH samplers.
Since we are working with a 1D model, we first visualize the evolution of the states and order parameters for three good-performing samplers: the systematic scan Gibbs sampler, the systematic scan informed sampler with proposal GB, and the momentum sampler with proposal GB, travel time and . These are in fact the best-performing Gibbs/informed/momentum samplers. We can clearly see that the momentum sampler performs the best, improving upon the underlying single-site MH sampler (the informed sampler with proposal GB), while the Gibbs sampler performs the worst among the three different kinds of samplers.
To better understand the performance of the momentum sampler, we investigate how the ESS changes as a function of the travel time , proposal distributions and . The results are summarized in Figure 2. From the figure, we can see that for the momentum sampler with proposals GB, LB1 and LB2, ESS generally increases as increases. This effect is particularly pronounced for GB, and is potentially a indicator of the momentum sampler’s ability to make global proposals by traversing long isoprobability contours, much like HMC. We also see that the best performance for the momentum sampler generally happens at around , demonstrating the trade-offs discussed in Section 2.3.2.
5 Conclusions and Future Directions
In this paper, we propose the momentum sampler as a general framework for exploring the idea of building discrete analogs of HMC. It connects and generalizes a variety of existing methods, and points to a more general framework where we can mix stochastic evolution of discrete variables with deterministic evolution of continuous variables. Numerical experiments on a simple 1D Potts model demonstrate the value of the auxiliary momentum process for single-site MH samplers, and the various trade-offs involved in picking different proposal distributions and kinetic energies.
The 1D Potts model used in this paper only serves as a proof-of-concept, to demonstrate the behavior of the momentum sampler. Comprehensive numerical experiments on more complicated discrete distributions should be an immediate next step. In addition, working out the more general framework indicated by the momentum sampler in full generality and establish further connections with other existing methods remain an exciting future direction.
Appendix A Proof of Theorem 1
a.1 Proof of the Theorem
(Detailed Balance) The TakeMomentumStep function satisfies detailed balance w.r.t. the joint invariant distribution , i.e. for any measurable sets ,
For notational simplicity, we use and to denote two points in .
Sequence of Proposals and Probabilistic Paths
If we start from , for a given travel time , a concrete run of the TakeMomentumStep function would involve a finite sequence of random proposals. Assume the length of the sequence is . The sequence of random proposals can be denoted as
This sequence of proposals indicates that, for this particular run of TakeMomentumStep, we reach 0 or 1 at individual sites times, and each time the system makes a proposal to go to the discrete state from the current discrete state.
If we fix , the TakeMomentumStep function in fact specifies a deterministic mapping, and would map to a single point . For each such sequence of proposals , we introduce an associated probabilistic path , which contains all the information of the system going from to in time through the function TakeMomentumStep. Formally, is specified by
The sequence of random proposals
The indices of the sites for the site visitations
The times of the site visitations
The discrete states of the system at site visitations
Accept/reject decisions for the site visitations , where
The evolution of the location variables and the momentum variables . Note that we might have discontinuities in . We use to denote the left limit and to denote the right limit.
Countable Number of Probabilistic Paths
In order for a probabilistic path to be valid, the different components of have to interact with each other in a way as determined by the TakeMomentumStep function. For example, we should have and
For and some given travel time , we say a sequence of proposals is compatible with and TakeMomentumStep if we can find a corresponding probabilistic path that’s valid.
Not all sequences of proposals correspond to valid probabilistic paths. But even if we don’t consider the compatibility of the sequence of proposals with and TakeMomentumStep, the set of all possible such sequences has only a countable number of elements. This is because we only need to look at sequences of finite length (because of the fixed travel time ), and all the individual proposals are on discrete state spaces with a finite number of states.
The above analysis indicates that for some starting point and travel time , running the TakeMomentumStep function would result in only a countable number of possible destinations . Furthermore, for which , there are at most a countable number of probabilistic paths which bring to in time through TakeMomentumStep.
Formally, given some travel time and a sequence of proposals , define
Use to denote the deterministic mapping defined by TakeMomentumStep for the given in time (so that represents the domain of the mapping ), and use
to denote the image of the mapping . For a given , use
to denote the deterministic mapping induced by on . (i.e. , , where ). Define
for which , further define
Then both and have at most a countable number of elements.
Proof of Detailed Balance
For a given travel time and a sequence of proposals , , we use to denote the probability of going from to through the probabilistic path . Since TakeMomentumStep defines a deterministic mapping for given and , the only non-zero term is . For all , we have .
Using the above notation, and measurable, we can write as
For a given travel time , , if , then . In Lemma 3, we prove that , the absolute value of the determinant of the Jacobian of is , for all where . Furthermore, the deterministic mapping is reversible, and there exists a sequence of proposals , s.t. .
In Lemma 4, we prove that
Using the above results, it’s not hard to see that
which proves the desired detailed balance property of TakeMomentumStep w.r.t.
a.2 Useful Lemmas
In this section, we prove a few useful lemmas to complete the proof of Theorem 1.
First, we prove two lemmas, similar to Lemma 1 and Lemma 2 in Section 5.1 of Mohasel_Afshar2015-uh .
(Refraction) Let be a transformation in that takes a unit mass located at and moves it with constant velocity . Assume it reaches 0 or 1 at site first. Subsequently is changed to , and is changed to (where is a constant and satisfies ). The move is carried on, with the velocity changed to , for the total time period till it ends in location and momentum , before it reaches 0 or 1 again at any sites. Then is volume preserving, i.e. the absolute value of the determinant of its Jacobian .
Following the same argument as in the proof of Lemma 1 of Mohasel_Afshar2015-uh , we have
If we define , then
(Reflection) Let be a transformation in that takes a unit mass located at and moves it with constant velocity . Assume it reaches 0 or 1 at site first. Subsequently is changed to . The move is carried on, with the velocity changed to , for the total time period till it ends in location and momentum , before it reaches 0 or 1 at any sites again. Then is volume preserving, i.e. the absolute value of the determinant of its Jacobian .
Following the same argument as in the proof of Lemma 2 of Mohasel_Afshar2015-uh , we have
If we define , then