NIPS2016
This project collects the different accepted papers and their link to Arxiv or Gitxiv
view repo
We unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demonstrating their connection via the Hamiltonian-Jacobi equation from Hamiltonian mechanics. This insight enables extension of HMC and slice sampling to a broader family of samplers, called Monomial Gamma Samplers (MGS). We provide a theoretical analysis of the mixing performance of such samplers, proving that in the limit of a single parameter, the MGS draws decorrelated samples from the desired target distribution. We further show that as this parameter tends toward this limit, performance gains are achieved at a cost of increasing numerical difficulty and some practical convergence issues. Our theoretical results are validated with synthetic data and real-world applications.
READ FULL TEXT VIEW PDF
Hamiltonian Monte Carlo (HMC) is an efficient Bayesian sampling method t...
read it
In this paper, we develop Bayesian Hamiltonian Monte Carlo methods for
i...
read it
Gradient-based Monte Carlo sampling algorithms, like Langevin dynamics a...
read it
We give numerical evidence that the standard leapfrog algorithm may not ...
read it
Slice Sampling has emerged as a powerful Markov Chain Monte Carlo algori...
read it
In this paper, we propose a new sampling method named as the
thermostat-...
read it
Hamiltonian Monte Carlo (HMC) has been widely adopted in the statistics
...
read it
This project collects the different accepted papers and their link to Arxiv or Gitxiv
Markov Chain Monte Carlo (MCMC) sampling [1] stands as a fundamental approach for probabilistic inference in many computational statistical problems. In MCMC one typically seeks to design methods to efficiently draw samples from an unnormalized density function. Two popular auxiliary-variable sampling schemes for this task are Hamiltonian Monte Carlo (HMC) [2, 3] and the slice sampler [4]. HMC exploits gradient information to propose samples along a trajectory that follows Hamiltonian dynamics [3], introducing momentum as an auxiliary variable. Extending the random proposal associated with Metropolis-Hastings sampling [4], HMC is often able to propose large moves with acceptance rates close to one [2]. Recent attempts toward improving HMC have leveraged geometric manifold information [5] and have used better numerical integrators [6]. Limitations of HMC include being sensitive to parameter tuning and being restricted to continuous distributions. These issues can be partially solved by using adaptive approaches [7, 8], and by transforming sampling from discrete distributions into sampling from continuous ones [9, 10].
Seemingly distinct from HMC, the slice sampler [4]
alternates between drawing conditional samples based on a target distribution and a uniformly distributed slice variable (the auxiliary variable). One problem with the slice sampler is the difficulty of solving for the slice interval,
i.e., the domain of the uniform distribution, especially in high dimensions. As a consequence, adaptive methods are often applied [4]. Alternatively, one recent attempt to perform efficient slice sampling on latent Gaussian models samples from a high-dimensional elliptical curve parameterized by a single scalar [11]. It has been shown that in some cases slice sampling is more efficient than Gibbs sampling and Metropolis-Hastings, due to the adaptability of the sampler to the scale of the region currently being sampled [4].Despite the success of slice sampling and HMC, little research has been performed to investigate their connections. In this paper we use the Hamilton-Jacobi equation from classical mechanics to show that slice sampling is equivalent to HMC with a (simply) generalized kinetic function. Further, we also show that different settings of the HMC kinetic function correspond to generalized slice sampling, with a non-uniform conditional slicing distribution. Based on this relationship, we develop theory to analyze the newly proposed broad family of auxiliary-variable-based samplers. We prove that under this special family of distributions for the momentum in HMC, as the distribution becomes more heavy-tailed, the one-step autocorrelation of samples from the target distribution converges asymptotically to zero, leading to potentially decorrelated samples. While of limited practical
impact, this theoretical result provides insights into the properties of the proposed family of samplers. We also elaborate on the practical tradeoff between the increased computational complexity associated with improved theoretical sampling efficiency. In the experiments, we validate our theory on both synthetic data and with real-world problems, including Bayesian Logistic Regression (BLR) and Independent Component Analysis (ICA), for which we compare the mixing performance of our approach with that of standard HMC and slice sampling.
A Hamiltonian system consists of a kinetic function with momentum variable , and a potential energy function with coordinate
. We elaborate on multivariate cases in the Appendix. The dynamics of a Hamiltonian system are completely determined by a set of first-order Partial Differential Equations (PDEs) known as
Hamilton’s equations [12]:(1) |
where is the Hamiltonian, and is the system time. Solving (1) gives the dynamics of and as a function of system time . In a Hamiltonian system governed by (1), is a constant for every [12]. A specified , together with the initial point , defines a Hamiltonian trajectory , in space.
It is well known that in many practical cases, a direct solution to (1) may be difficult [13]. Alternatively, one might seek to transform the original HMC system to a dual space in hope that the transformed PDEs in the dual space becomes simpler than the original PDEs in (1). One promising approach consists of using the Legendre transformation [12, 14]. This family of transformations defines a unique mapping between primed and original variables, where the system time, , is identical. In the transformed space, the resulting dynamics are often simpler than the original Hamiltonian system.
An important property of the Legendre transformation is that the form of (1) is preserved in the new space [15], i.e., To guarantee a valid Legendre transformation between the original Hamiltonian system and the transformed Hamiltonian system , both systems should satisfy the Hamilton’s principle [13], which equivalently express Hamilton’s equations (1). The form of this Legendre transformation is not unique. One possibility is to use a generating function approach [13], which requires the transformed variables to satisfy , where
follows from the chain rule and
is a Type-2 generating function defined as [15], with being the Hamilton’s principal function [16], defined below. The following holds due to the independency of , and in the previous transformation (after replacing by its definition):(2) |
We then obtain the desired Legendre transformation by setting . The resulting (2) is known as the Hamilton-Jacobi equation (HJE). We refer the reader to [13, 12] for extensive discussions on the Legendre transformation and HJE.
Recall from above that the Legendre transformation preserves the form of (1). Since , are time-invariant (constant for every ). Importantly, the time-invariant point corresponds to a Hamiltonian trajectory in the original space, and it defines the initial point in the original space ; hence, given , one may update the point along the trajectory by specifying the time . A new point in the original space along the Hamiltonian trajectory, with system time , can be determined from the transformed point via solving (2).
One typically specifies the kinetic function as [2], and Hamilton’s principal function as , where is a function to be determined (defined below). From (2), and the definition of , we can write
(3) |
where the second equality is obtained by replacing and the third equality by replacing from (2) into . From (3), represents the total Hamiltonian in the original space , and uniquely defines a Hamiltonian trajectory in .
Define as the slice interval, which for constant corresponds to a set of valid coordinates in the original space . Solving (3) for gives
(4) |
where and is a constant. In addition, from (2) we have
(5) |
where the second equality is obtained by substituting by its definition and the third equality is obtained by applying Fubini’s theorem on (4). Hence, for constant , equation (5) uniquely defines in the original space, for a specified system time .
Suppose we are interested in sampling a random variable
from an unnormalized density function , where is the potential energy function. Hamiltonian Monte Carlo (HMC) augments the target density with an auxiliary momentum random variable , that is independent of . The distribution of is specified as , where is the kinetic energy function. Define as the Hamiltonian. We have omitted the dependency of , and on the system time for simplicity. HMC iteratively performs dynamic evolving and momentum resampling steps, by sampling from the target distribution and from the momentum distribution (Gaussian as ), respectively, for iterations. Figure 1 illustrates two iterations of this procedure. Starting from point at the -th (discrete) iteration, HMC leverages the Hamiltonian dynamics, governed by Hamilton’s equations in (1) to propose the next sample , at system time . The position in HMC at iteration is updated as (dynamic evolving). A new momentumis resampled independently from a Gaussian distribution (assuming
), establishing the next initial point for iteration (momentum resampling). The latter point corresponds to the initial point of a new trajectory because the Hamiltonian is commensurately updated. This means that trajectories correspond to distinct values of .Typically, numerical integrators such as the leap-frog method [2] are employed to numerically approximate the Hamiltonian dynamics. In practice, a random number (uniformly drawn from a fixed range) of discrete numerical integration steps (leap-frog steps) are often used (corresponding to random time along the trajectory), which has been shown to have better convergence properties than a single leap-frog step [17]. The discretization error introduced by the numerical integration is corrected by a Metropolis Hastings (MH) step.
Slice sampling is conceptually simpler than HMC. It augments the target unnormalized density with a random variable
, with joint distribution expressed as
, s.t. , where is the normalization constant, and the marginal distribution ofexactly recovers the target normalized distribution
. To sample from the target density, slice sampling iteratively performs a conditional sampling step from and sampling a slice from . At iteration , starting from , a slice is uniformly drawn from . Then, the next sample , at iteration , is uniformly drawn from the slice interval .HMC and slice sampling both augment the target distribution with auxiliary variables
and can propose long-range moves with high acceptance probability.
Consider the dynamic evolving step in HMC, i.e., in Figure 1. From Section 2, the Hamiltonian dynamics in space with initial point can be performed by mapping to space and updating via selecting a and solving (5). As we show in the Appendix, from (5) and in univariate cases^{*}^{*}*For multidimensional cases, the Hamiltonian dynamics are semi-periodic, yet a similar conclusion still holds. Details are discussed in the Appendix. the Hamiltonian dynamics has period and is symmetric along (due to the symmetric form of the kinetic function). Also from (5), the system time, , is specified uniformly sampled from a half-period of the Hamiltonian dynamics. i.e., . Intuitively, is the “anchor” of the initial point , w.r.t. the start of the first half period, i.e, when . Further, we only need consider half a period because for a symmetric kinetic function, , the Hamiltonian dynamics for the two half-periods are mirrored [15]. For the same reason, Figure 1 only shows half of the space, when .
Given the sampled and the constant , equation (5) can be solved for , i.e., the value of at time . Interestingly, the integral in (5) can be interpreted as (up to normalization constant) a cumulative density function (CDF) of . From the inverse CDF transform sampling method, uniformly sampling from half of a period and solving for from (5), are equivalent to directly sampling from the following density
(6) |
We note that this transformation does not make the analytic solution of generally tractable. However, it provides the basic setup to reveal the connection between the slice sampler and HMC.
In the momentum resampling step of HMC, i.e., in Figure 1, and using the previously described kinetic function, , resampling corresponds to drawing from a Gaussian distribution [2].
The algorithm to analytically sample from the HMC (analytic HMC) proceeds as follows: at iteration , momentum is drawn from a Gaussian distribution. The previously sampled value of and the newly sampled yield a Hamiltonian . Then, the next sample is drawn from (6). This procedure relates HMC to the slice sampler. To clearly see the connection, we denote . Instead of directly sampling as just described, we sample instead. By substituting with in (6), the conditional updates for this new sampling procedure can be rewritten as below, yielding the HMC slice sampler (HMC-SS), with conditional distributions defined as
(7) | |||
(8) |
where (other values of considered below), is an unnormalized density, and and are the normalization constants.
Comparing these two procedures, analytic HMC and HMC-SS, we see that the resampling momentum in analytic HMC corresponds to sampling a slice in HMC-SS. Further, the dynamic evolving in HMC corresponds to the conditional sampling in MG-SS. We have thus shown that HMC can be equivalently formulated as a slice sampler procedure via (7) and (8).
In standard slice sampling (described in Section 3.1), both conditional sampling and sampling a slice are drawn from uniform distributions. However those for HMC-SS in (7) and (8) represent non-uniform distributions. Interestingly, if we change in (7) and (8) from to , we obtain the desired uniform distributions for standard slice sampling. This key observation leads us to consider a generalized form of the kinetic function for HMC, described below.
Consider the generalized family of kinetic functions with . One may rederive equations (3)-(8) using this generalized kinetic energy. As shown in the Appendix, these equations remained unchanged, with the update that each isolated in these equations is replaced by , and is replaced by .
Sampling (for the momentum resampling step) with the generalized kinetics, corresponds to drawing from , with . All the formulation in the paper still holds for arbitrary , see Appendix for details. We denote this distribution the monomial Gamma (MG) distribution, MG, where is the mass parameter, and is the monomial parameter. Note that this is equivalent to the exponential power distribution with zero-mean, described in [18]. We summarize some properties of the MG distribution in the Appendix.
To generate random samples from the MG distribution, one can draw and a uniform sign variable , then follows the MG distribution. We call the HMC sampler based on the generalized kinetic function, : Monomial Gamma Hamiltonian Monte Carlo (MG-HMC). The algorithm to analytically sample from the MG-HMC is shown in Algorithm 1. The only difference between this procedure and the previously described is the momentum resampling step, in that for analytic HMC, is drawn Gaussian instead of MG. However, note that the Gaussian distribution is a special case of MG when .
Interestingly, when , the Monomial Gamma Slice sampler (MG-SS) in Algorithm 2 recovers exactly the same update formulas as in standard slice sampling, described in Section 3.1, where the conditional distributions in (7) and (8) are both uniform. When , we have to iteratively alternate between sampling from non-uniform distributions (7) and (8), for both auxiliary (slicing) variable and target variable .
Using the same argument from the convergence analysis of standard slice sampling [4], the iterative sampling procedure in (7) and (8), converges to an invariant joint distribution (detailed in the Appendix). Further, the marginal distribution of recovers the target distribution as , while the marginal distribution of is given by .
The MG-SS can be divided into three broad regimes: and (illustrated in the Appendix). When , the conditional distribution
is skewed towards the current unnormalized density value
. The conditional draw of encourages taking samples with smaller density value (inefficient moves), within the domain of the slice interval . On the other hand, when , draws of tend to take smaller values, while draws of encourage sampling from those with large density function values (efficient moves). The case corresponds to the conventional slice sampler. Intuitively, setting to be small makes the auxiliary variable, , stay close to , thus is close to . As a result, a larger seems more desirable. This intuition is justified in the following sections.We analyze theoretical properties of the MG sampler. All the proofs as well as the ergodicity properties of analytic MG-SS are given in the Appendix.
One-step autocorrelation of analytic MG-SS We present results on the univariate distribution case: . We first investigate the impact of the monomial parameter on the one-step autocorrelation function (ACF), , as . Theorem 1 characterizes the limiting behavior of .
For a univariate target distribution, if is thrice differentiable with bounded third-order derivative, and has finite integral over , the one-step autocorrelation of the MG-SS parameterized by , asymptotically approaches zero as , i.e., .
In the Appendix we also show that . In addition, we show that is a non-negative decreasing function of the time lag in discrete steps .
Effective sample size
The variance of a Monte Carlo estimator is determined by its Effective Sample Size (ESS)
[19], defined as , where is the total number of samples, is the -step autocorrelation function, which can be calculated in a recursive manner. We prove in the Appendix that is non-negative. Further, assuming the MG sampler is uniformly ergodic and is monotonically decreasing, it can be shown that . When ESS approaches full sample size, , the resulting sampler delivers excellent mixing efficiency [5]. Details and further discussion are provided in the Appendix.Case study
To examine a specific 1D example, we consider sampling from the exponential distribution,
, with energy function given by , where . This case has analytic and ESS. After some algebra (details in the Appendix),These results are in agreement with Theorem 1 and related arguments of ESS and monotonicity of autocorrelation w.r.t. . Here denotes the expectation of the -lag sample, starting from any . The relative difference decays exponentially in , with a factor of . In fact, the for the exponential family class of models introduced in [20], with potential energy , where , can be analytically calculated. The result, provided in the Appendix, indicates that for this family, decays at a rate of .
MG-HMC mixing performance In theory, the analytic MG-HMC (the dynamics in (5) can be solved exactly) is expected to have the same theoretical properties of the analytic MG-SS for unimodal cases, since they are derived from the same setup. However, the mixing performance of the two methods could differ significantly when sampling from a multimodal distribution, due to the fact that the Hamiltonian dynamics may get “trapped” into a single closed trajectory (one of the modes) with low energy, whereas the analytic MG-SS does not suffer from this problem as is able to sample from disjoint slice intervals (one per mode). This is a well-known property of slice sampling [4] that arises from (7) and (8). However, if is large enough, as we show in the Appendix, the probability of getting into a low-energy level associated with more than one Hamiltonian trajectory, which restrict movement between modes, is arbitrarily small. As a result, the analytic MG-HMC with large value of is able to approach the stationary mixing performance of MG-SS.
MG-HMC with numerical integrator In practice, MG-SS (performing Algorithm 2) requires: 1) analytically solving for the slice interval , which is typically infeasible for multivariate cases [4]; or 2) analytically computing the integral over , implied by the non-uniform conditionals from MG-SS. These are usually computationally infeasible, though adaptive estimation of could be done using schemes like “doubling” and “shrinking” strategies from the slice sampling literature [4].
It is more convenient to perform approximate MG-HMC using a numerical integrator like in traditional HMC, i.e., in each iteration, the momentum is first initialized by sampling from , then second order Störmer-Verlet integration [2] is performed for the Hamiltonian dynamics updates:
(9) |
where . When , for any dimension , independent of and . To avoid moving on a grid when , we employ a random step-size from a uniform distribution within non-negative range , as suggested in [2].
No free lunch With a numerical integrator for MG-HMC, however, the argument about choosing large (of great theoretical advantage as discussed in the previous section) may face practical issues.
First, a large value of will lead to a less accurate numerical integrator. This is because as gets larger, the trajectory of the total Hamiltonian becomes “stiffer”, i.e., that the maximum curvature becomes larger. When , the Hamiltonian trajectory in the phase space, , has at least (
denotes the total dimension) non-differentiable points (“turnovers”), at each intersection point with the hyperplane
. As a result, directly applying Störmer-Verlet integration would lead to high integration error as becomes large.Second, if the sampler is initialized in the tail region of a light-tailed target distribution, MG-HMC with may converge arbitrarily slow to the true target distribution, i.e., the burn-in period could take arbitrarily long time. For example, with , can be very large when is in the light-tailed region, leading the update to be arbitrary close to , i.e., the sampler does not move.
To ameliorate these issues, we provide mitigating strategies. For the first (numerical) issue, we propose two possibilities: 1) As an analog to the “reflection” action of [2], in (9), whenever the -th dimension(s) of the momentum changes sign, we “recoil” the point of these dimension(s) to the previous iteration, and negate the momentum of these dimension(s), i.e., , . 2) Substituting the kinetic function with a “softened” kinetic function, and use importance sampling to sample the momentum. The details and comparison between the “reflection” action and “softened” kinetics are discussed in the Appendix.
For the second (convergence) issue, we suggest using a step-size decay scheme, e.g., . In our experiments we use , where is problem-specific. This approach empirically alleviates the slow convergence problem, however we note that a more principled way would be adaptively selecting during sampling, which is left for further investigation.
As a compromise between theoretical gains and practical issues, we suggest setting (HMC implementation of a slice sampler) when the dimension is relatively large. This is because in our experiments, when , numerical errors and convergence issues tend to overwhelm the theoretical mixing performance gains described in Section 4.
[figure]style=plain,subcapbesideposition=top
1D unimodal problems We first evaluate the performance of the MG sampler with several univariate distributions: 1) Exponential distribution, . 2) Truncated Gaussian,
. 3) Gamma distribution,
. Note that the performance of the sampler does not depend on the scale parameter . We compare the empirical and ESS of the analytic MG-SS and MG-HMC with their theoretical values. In the Gamma distribution case, analytic derivations of the autocorrelations and ESS are difficult, thus we resort to a numerical approach to compute and ESS. Details are provided in the Appendix. Each method is run for 30,000 iterations with 10,000 burn-in samples. The number of leap-frog steps is set to be uniformly drawn from with , as suggested by [17]. We also compared MG-HMC () with standard slice sampling using doubling and shrinking scheme [4] As expected, the resulting ESS (not shown) for these two methods is almost identical. The experiment settings and results are provided in the Appendix. The acceptance rates decrease from around to around for each case, when grows from to , as shown in Figure 2(a)-(d),The results for analytic MG-SS match well with the theoretical results, however MG-HMC seems to suffer from practical difficulties when is large, evidenced by results gradually deviating from the theoretical values. This issue is more evident in the Gamma case (see Figure 2(e)), where first decreases then increases. Meanwhile, the acceptance rates decreases from to .
1D and 2D bimodal problems We further conduct simulation studies to evaluate the efficiency of MG-HMC when sampling 1D and 2D multimodal distributions. For the univariate case, the potential energy is given by ; whereas in the bivariate case. We show in the Appendix that if the energy functions are symmetric along , where is a constant, in theory, the analytic MG-SS will have ESS equal to the total sample size. However, as shown in Section 4, the analytic MG-HMC is expected to have an ESS less than its corresponding analytic MG-SS, and the gap between the analytic MG-HMC and analytic MG-SS counterpart should decrease with . As a result, despite numerical difficulties, we expect the MG-HMC based on numerical integration to have better mixing performance with large .
1D | ESS | |
---|---|---|
5175 | 0.60 | |
10157 | 0.43 | |
24298 | 0.11 |
2D | ESS | |
---|---|---|
4691 | 0.67 | |
16349 | 0.60 | |
18007 | 0.53 |
To verify our theory, we run MG-HMC for for 30,000 iterations with 10,000 burn-in samples. The parameter settings and the acceptance rates are detailed in the Appendix. Empirically, we find that the efficiency of HMC is significantly improved with a large as shown in Table 1, which coincides with the theory in Section 4.
From Figure 3, we observe that the MG-HMC sampler with monomial parameter performs better at jumping between modes of the target distribution, when compared to standard HMC, which confirms the theory in Section 4. We also compared MG-HMC () with standard SS [4]. As expected, in the 1D case, the standard SS yields ESS close to full sample size, while in 2D case, the resulting ESS is lower than MG-HMC () (details are provided in the Appendix).
Bayesian logistic regression We evaluate our methods on 6 real-world datasets from the UCI repository [21]: German credit (G), Australian credit (A), Pima Indian (P), Heart (H), Ripley (R) and Caravan (C) [22]. Feature dimensions range from to , and total data instances are between to . All datasets are normalized to have zero mean and unit variance. Gaussian priors are imposed on the regression coefficients. We draw 5000 iterations with 1000 burn-in samples for each experiment. The leap-frog steps are set to be uniformly drawn from with . Other experimental settings ( and ) are provided in the Appendix.
Results in terms of minimum ESS are summarized in Table 2. Prediction accuracies estimated via cross-validation are almost identical all across (reported in the Appendix). It can be seen that MG-HMC with outperforms (in terms of ESS) the other two settings with and , indicating increased numerical difficulties counter the theoretical gains when becomes large. This can be also seen by noting that the acceptance rates drop from around to around as increases from to . The dimensionality also seems to have an impact on the optimal setting of , since in the high-dimensional dataset Cavaran, the improvement of MG-HMC with is less significant compared with other datasets, and seems to suffer more of numerical difficulties. Comparisons between MG-HMC () and standard slice sampling are provided in the Appendix. In general, standard slice sampling with adaptive search underperforms relative to MG-HMC ().
Dataset (dim) | A (15) | G (25) | H (14) | P (8) | R (7) | C (87) |
---|---|---|---|---|---|---|
3124 | 3447 | 3524 | 3434 | 3317 | 33 (median 3987) | |
4308 | 4353 | 4591 | 4664 | 4226 | 36 (median 4531) | |
1490 | 3646 | 4315 | 4424 | 1490 | 7 (median 740) |
ICA (25) |
---|
2677 |
3029 |
1534 |
ICA We finally evaluate our methods on the MEG [23] dataset for Independent Component Analysis (ICA), with 17,730 time points and 25 feature dimension. All experiments are based on 5000 MCMC samples. The acceptance rates for are . Running time is almost identical for different . Settings (including and ) are provided in the Appendix. As shown in Table 2, when , MG-HMC has better mixing performance compared with other settings.
We demonstrated the connection between HMC and slice sampling, introducing a new method for implementing a slice sampler via an augmented form of HMC. With few modifications to standard HMC, our MG-HMC can be seen as a drop-in replacement for any scenario where HMC and its variants apply, for example, Hamiltonian Variational Inference (HVI) [24]. We showed the theoretical advantages of our method over standard HMC, as well as numerical difficulties associated with it. Several future extensions can be explored to mitigate numerical issues, e.g., performing MG-HMC on the Riemann manifold [5] so that step-sizes can be adaptively chosen, and using a high-order symplectic numerical method [25, 26] to reduce the discretization error introduced by the integrator.
The Journal of Machine Learning Research
, 15(1), 2014.
Comments
There are no comments yet.