Roll-back Hamiltonian Monte Carlo

09/08/2017 ∙ by Kexin Yi, et al. ∙ Harvard University 0

We propose a new framework for Hamiltonian Monte Carlo (HMC) on truncated probability distributions with smooth underlying density functions. Traditional HMC requires computing the gradient of potential function associated with the target distribution, and therefore does not perform its full power on truncated distributions due to lack of continuity and differentiability. In our framework, we introduce a sharp sigmoid factor in the density function to approximate the probability drop at the truncation boundary. The target potential function is approximated by a new potential which smoothly extends to the entire sample space. HMC is then performed on the approximate potential. While our method is easy to implement and applies to a wide range of problems, it also achieves comparable computational efficiency on various sampling tasks compared to other baseline methods. RBHMC also gives rise to a new approach for Bayesian inference on constrained spaces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov Chain Monte Carlo (MCMC) provides an important family of sampling tools for Bayesian statistical inference Brooks2011 . Among various MCMC methods, Hamiltonian Monte Carlo (HMC) is one of the most widely adopted samplers for its ability to reduce random walk behavior and efficiently explore the sample space Neal2011 ; Hoffman2014 ; Betancourt2017 ; Carpenter2016 . Given a smooth target distribution , HMC maps it to a potential function by

(1)

and proposes new samples according to the motion of a particle moving under this potential. The particle’s dynamics is governed by the Hamiltonian equations of motion Goldstein1965

(2)

where is the spatial position of the particle in the sample space, is the momentum, and is the Hamiltonian of the particle defined as . At each iteration, a trajectory of the particle is simulated by discrete leapfrog updates with length and step size . A new sample is proposed at the end point of the trajectory (See Neal2011 ).

In many applications, it is often desirable to draw samples from a truncated distribution other than the full distribution. For example, some Bayesian models, such as the Bayesian non-negative matrix factorization Schmidt2009

, assigns truncated prior to the latent variables to fix their values within a constrained region. Many common probability distributions by themselves are also defined on subregions of a larger space such as the exponential distribution and the Beta distribution. This kind truncated distributions can be characterized by an underlying density function

and a region of interest (ROI), such that

(3)

Simulation of Hamiltonian dynamics requires computing at any point of the particle trajectory. Therefore, for truncated distributions, HMC is no longer effective due to the discontinuity at the truncation boundary, even though the density function might still be globally differentiable. When the particle steps out of ROI, the probability density drops to zero and boosts to infinity, causing the Hamiltonian dynamics to be ill-defined. The sampler might still be able to proceed by terminating the current trajectory and discarding the new sample, but this treatment will lead to lower acceptance rate especially in high dimensional spaces.

Several studies have been conducted on extending HMC to discontinuous distributions: Afshar et al. Afshar2015 proposed the Reflective Hamiltonian Monte Carlo (RHMC), which incorporates reflection and refraction moves into the particle’s dynamics at discontinuous boundaries. Pakman et al.

developed an HMC method that samples from piecewise Gaussian distribution by exactly solving the Hamiltonian equations of motion

Pakman2013 ; Pakman2014 . Lan et al. introduced spherical HMC for distributions on constraint boundary specified by q-norm of sample parameters Lan2014 . Other baseline methods for sampling from truncated distributions include rejection sampling MacKay1998 and Gibbs sampler Gelfand1992 .

The existing HMC-based methods either impose explicit boundary checks to the particle’s dynamics Afshar2015 or require a specific form in the density function Pakman2014 or the domain boundary Lan2014 . Other non-HMC methods often suffer from random walk behavior Neal2011 . In this work, we introduce a new framework of HMC that is both easy to implement and applies to general distributions with arbitrary truncation boundaries. Motivated by the barrier function in constrained optimization Lueberger1973

, we introduce a sigmoid function to approximate the drop in probability density, which effectively builds an infinitely rising inclined surface in potential at the truncation boundaries. Meanwhile, the target distribution remains almost unchanged within the region of interest. We call this method Roll-back Hamiltonian Monte Carlo (RBHMC), because the particle always rolls back whenever it reaches the boundary and climbs up the rising surface.

2 Sigmoid approximator for the truncation boundary

In this section we introduce the key ingredient of our method, which is to use a sharp sigmoid function to approximate the probability drop at the truncation boundary. It is assumed that the underlying density function is continuous and differentiable over the entire sample space, which we call smooth in this context. We begin with the one-dimensional unit step function, and then generalize the approximator to arbitrary boundary shapes in higher dimensions.

2.1 Sigmoid approximation for the 1D step function

On the real line, a truncated distribution can be characterized by the density function plus a unit step function at the truncation point

(4)

where when , when . The ROI corresponding to the above expression is . In general one can add multiple step functions at various locations to specify multiple truncation points. As an example, the exponential distribution can be written as .

The sigmoid function

(5)

is continuous and differentiable on the entire real line, whose value monotonically increases from 0 at to 1 at . When goes to , the sigmoid function converges to the unit step function located at the origin (figure 1a). This observation suggests that a smooth approximation of the truncated distribution can be obtained by replacing with and choosing a reasonably large . An example of the approximated exponential distribution is shown in figure 1c. The sigmoid function is able to sharply flatten the exponentially rising head for negative (red dashed curve), but on the other hand, causes very little variations on the target density function when . The precision of approximation can always be matched to the requirement of any numerical purpose by increasing the value of . In practice, one can invert the sign or add constant shift to in order to create approximated truncation at any other locations.

(a)
(b)
(c)
Figure 1: (a) Sigmoid function with . (b) Boundary potential associated with the sigmoid factor. (c) Continuous approximation of exponential distribution (blue curve) by multiplying the sigmoid factor to the underlying density function (red dashed curve).

2.2 Generalization to multi-dimensional sample spaces

In the 1D example, the sudden jump in the sigmoid function occurs at . We can use this property to generalize the approximator to arbitrary region boundaries in higher dimensions by introducing an intermediate mapping such that the ROI is given by

(6)

Similar to the unit step function in 1D, the following modified sigmoid function

(7)

continuously approximates the region’s indicator function . For example, assume we want to approximate an arbitrary smooth density function in truncated by a unit circle centered at the origin. We can choose to be . Then gives the smooth approximator of the target distribution.

In principle, The choice of is ambiguous as long as it specifies the correct ROI. However, in order to create a sharp jump in probability at the region boundary, we desire to be sensitive to change in each component of at the boundary. We will discuss the numerical requirements in the following section.

3 Roll-back Hamiltonian Monte Carlo

With the sigmoid trick, we can now simulate the Hamiltonian dynamics and apply HMC to draw samples from the smooth distribution. In this section, we discuss the particle dynamics under the new approximate potential, and address some numerical issues associated with the simulation of Hamiltonian dynamics.

3.1 Hamiltonian dynamics near ROI boundary

The potential function associated with the approximate distribution is given by

(8)

where the second term is called the boundary potential and denoted as because it originates from the sigmoid approximator for the truncation boundary. The term has the following desired properties. First, its value remains close to zero when . This means the particle dynamics within the ROI is not affected and HMC updates preserve the original target distribution. Second, the potential puts up an infinitely rising inclined surface at the boundary (figure 1b). When the particle steps out of ROI, it first climbs up the inclined surface and then quickly rolls back in a reflective fashion. Without having to restart the current trajectory, the particle’s motion is restricted within the region of interest.

We now consider the particle’s dynamics near the truncation boundary. During each leapfrog step, the change in momentum caused by the boundary potential follows the direction of , which always points to the perpendicular direction. When is very large, this term dominates the total gradient of outside the boundary, therefore the parallel component of momentum remains unchanged. On the other hand, conservation of energy requires the particle to have the same kinetic energy before and after rolling back, so the perpendicular component of momentum must be inverted with the same magnitude. After all, the particle effectively experiences a reflection at the boundary (figure 2).

(9)
(a)
(b)
Figure 2: (a) Schematic sketch of roll-back dynamics at ROI boundary. Under the desired approximation setup, the incidence angle equals to the reflection angle . (b) Simulated reflective trajectory on a 2D Gaussian distribution truncated by the -axis. , step size .

We note that RBHMC leads to the same reflective dynamics at truncation boundaries as RHMC Afshar2015 . However, the difference between the two methods lies in the fact that RHMC manually imposes the reflective move to the particle’s dynamics, while in RBHMC it naturally emerges from the Hamiltonian dynamics under the approximate potential.

3.2 Numerical requirements on model and sampler parameters

In practice, we don’t have access to the exact continuous dynamics of the particle, and instead, the motion of the particle is simulated by the leapfrog algorithm based on discrete time steps. This leads to two major numerical issues that should be addressed for RBHMC to be correctly applied.

First, the slope of the inclined surface introduced by the boundary term should be large enough to dominate over the original potential. Consider the gradient of which is proportional to the change of momentum during a leapfrog update at point :

(10)

When and is large, the second term (gradient of the boundary term) can be approximated by , which suggests that the slope of the inclined surface is proportional to . To enforce the reflective roll-back and prevent the particle from going too far into the forbidden region, the second term should dominate over the first term at the boundary:

(11)

This condition suggests that other than being large, should also have large or at least non-zero gradient at the boundary. Since the samples are drawn from the approximate distribution rather than the true distribution, a large is also required for the sake of a precise approximation and good sample quality.

Second, in order to simulate the full roll-back process, the leapfrog step size can not be too large such that the particle should be able to feel the smooth rise in potential at the boundary. Assume the kinetic energy is given by , the step size should satisfy

(12)

This upper bound can be derived from the following physical picture. By pushing forward one leapfrog step against the boundary, the particle should not be able to gain more potential energy than its incoming kinetic energy. Then , from which the above expression of the upper bound can be derived. Failing to meet this constraint will result in the particle picking up more perpendicular momentum after bouncing off the boundary, and lead to the wrong dynamics.

3.3 Implementation

We summarize the steps for implementing RBHMC as following. Given the density function of the target distribution , write down the potential function and its gradient . For each truncation boundary, select that satisfies equation 6 and at every point of the boundary. Then choose a reasonably large , add to the potential function and to the gradient. Finally, the HMC step size is chosen to satisfy equation 12, and the trajectory length is tuned to optimize the sampler’s exploration horizon over the region of interest.

In contrast to our method, RHMC requires computing the first intersection between the ongoing trajectory and the truncation boundary at every leapfrog step Afshar2015 . This can be very hard to implement because it requires the analytic expression of the location of intersection starting from any and , especially when there are multiple boundaries. Our method is free from this complication and can be easily applied to arbitrarily complex truncation setups, since multiple truncation boundaries can be combined by simply summing up the corresponding boundary potential terms.

4 Empirical Study

In this section, we post experiment results of RBHMC on various sampling tasks. We first show its performance as a general sampling method and demonstrate the ease of implementation on a 2D Gaussian distribution under various truncation boundaries. We also compare its computational efficiency with other HMC-based methods and investigate its application to Bayesian inference with an example of Bayesian non-negative matrix factorization.

4.1 Sample from 2D truncated Gaussian distribution

We first test RBHMC as a basic sampling tool on 2D truncated Gaussian distribution. In general, HMC’s advantage is most obvious in higher dimensions, but here we stay in 2D in order to provide straight forward visualization of the sampling result with respect to the density function.

We run the sampler under 5 different truncation boundaries. As shown in figure 3, all sample histograms are properly normalized and compared to the true density function, which is chosen to be the 2D standard Gaussian distribution . Figure 3a shows the samples from the full distribution with no truncation. Figure 3b, 3c show the results for single and multiple linear truncation boundaries. Figure 3d, 3e show samples drawn from a disk and a half-disk centered at the origin. Figure 3f shows result under an open parabolic truncation boundary. In all 6 cases, the sampler parameters are set to be , and the histograms all match to the density function very well. The boundary potential terms and their closed form gradients for each truncation boundary are listed in table 1. As mentioned in the previous section, combining multiple truncation boundaries in RBHMC is a trivial task of adding up the boundary terms and their gradients. As a demonstration of this nice property, the semi-circle boundary in figure 3e can be constructed by simply adding the boundary term for the straight line in 3b to the boundary term for the full-circle in 3d.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 3: Sample histograms of 2D Gaussian distribution under various truncation boundaries drawn by RBHMC. Each square on the surface of density function corresponds to a horizontal area. The expressions of the boundary terms associated with each subfigure and their gradients are listed in table 1.
Subfigure
(a) 0 0
(b)
(c)
(d)
(e)
(f)
Table 1: List of and for various truncation boundaries

4.2 Evaluate computational performance

We now test the computational performance of RBHMC using a similar experimental setup as in the RHMC paper Afshar2015 . The potential function is given by the following piecewise expression

(13)

where

is the Euclidean norm of vector

and is a positive definite matrix. We compare our method to RHMC and the baseline HMC in which current trajectory is terminated and new sample is rejected whenever the sampler steps out of ROI. Since the potential is centered at the origin, a Markov chain that has reached stationary distribution should have zero sample mean on all directions. For the HMC samplers, we take the mean of all drawn samples and compute the deviation from the origin to measure how well the sampler mixes. More specifically, we introduce the worst mean absolute error (WMAE)

(14)

The same error measure is used in the experiments of Afshar2015 .

Under three different dimensionalities , each sampler is run for 10 rounds. For simplicity, we set matrix to be diagonal, whose elements are randomly selected from with equal probability for each round. All samplers are initialized at the same random location. The results are shown in figure 4. The horizontal axis is the log-scale running time in seconds, and the vertical axis is the WMAE of all samples drawn up to the current iteration. The thick lines are the average WMAE values over the 10 rounds. The color shaded area represents the confidence region. The results suggest that RBHMC and RHMC perform better in high dimension than baseline HMC, since in higher dimensions the particle has more chance to move out of ROI and baseline HMC always rejects these trajectories. In this particular set of experiments, RBHMC performs slightly better than RHMC under all dimensionalities. We reemphasize that even though the computational cost for both methods are comparable to each other and strongly dependent to the shape of truncation boundaries, RBHMC requires much less effort to implement than RHMC.

(a)
(b)
(c)
Figure 4: Worst mean absolute error (WMAE) versus running time (in seconds) for RBHMC, RHMC and baseline HMC under 3 different dimensionalities. When the baseline HMC is not able to accept any new samples. For all 3 samplers, , , .

4.3 Bayesian Non-negative matrix factorization

Many Bayesian models have truncated prior distributions on the latent variables in order to constrain their values within certain desired range. For example, consider the following simplified version of the Bayesian model for non-negative matrix factorization (NMF) Schmidt2009

(15)
(16)

where is an by observation matrix. The goal is to factorize into the product of an by matrix and a by matrix . It is required that elements of both and are non-negative. This constraint is imposed by the exponential prior on and , which causes a discontinuous drop in probability density at 0.

Here we demonstrate the usage of RBHMC to sample from the posterior of and . By replacing the unit step function by a sigmoid factor, the approximate potential function for Bayesian NMF can be written as

(17)

We run the sampler on a synthetic data set of 6 by 6 images generated by 4 base images shown in figure 5a. Each image is associated with 4 randomly drawn binary numbers which represents the presence/absence of a corresponding base image in the data. The resulting observation matrix is the product of an by 4 binary matrix and a 4 by 36 feature matrix whose rows are the rearranged pixel values of the base images. A Gaussian noise is also added to the generated images (figure 5b). RBHMC simultaneously draws samples of and from the posterior. Figure 5c shows the 4 base images reconstructed from one sample of matrix drawn by RBHMC when . The result indicates a clean restoration of the original base pattern in the posterior sample.

To further quantify the performance of the sampler, we use the mean absolute difference between the observation matrix and the product of the sampled component matrices as the error measure:

(18)

We compare the performance of RBHMC with Gibbs sampler, which is the baseline inference method introduced in the original proposal of the model Schmidt2009 . Each sampler is run 10 times for 2000 iterations under . The average mean absolute difference over the 10 rounds at each iteration is shown in figure 6

. The average difference over all rounds and all samples drawn after the 100th iteration for RBHMC is 0.4023819. The standard deviation across the 10 rounds averaged over the last 1900 iterations is 0.0013394. For Gibbs sampler, the average difference is 0.4065109, and the average standard deviation is 0.0013314. The results show a statistically significant difference in performance between the two methods, which suggest that RBHMC is able to draw samples that better reconstructs the observation matrix than the Gibbs sampler, and therefore serves as a nice alternative inference method for the Bayesian model.

(a)
(b)
(c)
Figure 5: (a) Base images from which the observation matrix is generated. Each bright pixel corresponds to value 1 in the matrix element. (b) Examples of generated observation images (c) Sampled base images from RBHMC.
Figure 6: Mean absolute difference versus iteration for RBHMC and Gibbs sampler.

5 Conclusion

In this work, we introduce RBHMC as an extension of Hamiltonian Monte Carlo for sampling from truncated distributions. Despite the fact that there exists some numerical restrictions, we have demonstrated through our discussions the following advantages of the new method. First, RBHMC provides an easy-to-implement method for sampling from arbitrarily truncated distributions in high dimensions, with the only requirement being that the density function is smooth over the sample space. Second, RBHMC achieves comparable computational efficiency compared to other existing HMC-based methods on constrained spaces. Third, RBHMC gives rise to a new inference method for Bayesian models with truncated prior distributions. Furthermore, as a potential direction for future research, since conjugacy is not required for HMC, our method has rich potential applications in non-conjugate constrained space Bayesian models.

References

  • [1] Steve Brooks, Andrew Gelman, Galin L Jones, and Xiao-Li Meng. Handbook of markov chain monte carlo. Chapman and Hall/CRC, 2011.
  • [2] Radford M Neal. MCMC using Hamiltonian dynamics, volume 2, pages 113–162. 2011.
  • [3] Matthew D Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo.

    Journal of Machine Learning Research

    , 15(1):1593–1623, 2014.
  • [4] Michael Betancourt. A conceptual introduction to hamiltonian monte carlo. arXiv preprint arXiv:1701.02434, 2017.
  • [5] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 20, 2016.
  • [6] Herbert Goldstein. Classical mechanics. Pearson Education India, 1965.
  • [7] Mikkel N Schmidt, Ole Winther, and Lars Kai Hansen. Bayesian non-negative matrix factorization. In

    International Conference on Independent Component Analysis and Signal Separation

    , pages 540–547. Springer, 2009.
  • [8] Hadi Mohasel Afshar and Justin Domke. Reflection, refraction, and hamiltonian monte carlo. In Advances in Neural Information Processing Systems, pages 3007–3015, 2015.
  • [9] Ari Pakman and Liam Paninski. Auxiliary-variable exact hamiltonian monte carlo samplers for binary distributions. In Advances in Neural Information Processing Systems, pages 2490–2498, 2013.
  • [10] Ari Pakman and Liam Paninski. Exact hamiltonian monte carlo for truncated multivariate gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542, 2014.
  • [11] Shiwei Lan, Bo Zhou, and Babak Shahbaba. Spherical hamiltonian monte carlo for constrained target distributions. In International Conference on Machine Learning, pages 629–637.
  • [12] David JC MacKay. Introduction to monte carlo methods, pages 175–204. Springer, 1998.
  • [13] Alan E Gelfand, Adrian FM Smith, and Tai-Ming Lee. Bayesian analysis of constrained parameter and truncated data problems using gibbs sampling. Journal of the American Statistical Association, 87(418):523–532, 1992.
  • [14] Lueberger. Introduction to linear and nonlinear programming. Addison-Wesley, Reading, MA, 1973.