Stochastic Approximation Hamiltonian Monte Carlo

10/11/2018 ∙ by Ick Hoon Jin, et al. ∙ 0

Recently, the Hamilton Monte Carlo (HMC) has become widespread as one of the more reliable approaches to efficient sample generation processes. However, HMC is difficult to sample in a multimodal posterior distribution because the HMC chain cannot cross energy barrier between modes due to the energy conservation property. In this paper, we propose a Stochastic Approximate Hamilton Monte Carlo (SAHMC) algorithm for generating samples from multimodal density under the HMC framework. SAHMC can adaptively lower the energy barrier to move the Hamiltonian trajectory more frequently and more easily between modes. The convergence of the algorithm is established under mild conditions. Gaussian mixture model and neural network model show that SAHMC is superior to HMC when target posterior density has multiple modes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov chain Monte Carlo (MCMC) is one of the conventional algorithms for Bayesian inference to generate samples from the posterior distribution. However, many proposed MCMC algorithms are very inefficient due to randomness in sample generation. Recently, HMC (Duane et al., 1987; Neal, 2010), which imitates Hamiltonian dynamics in the sample generation procedure, has gained popularity in the MCMC communities as one of the efficient sampling methods (Carpenter et al., 2017).

To construct the MCMC using Hamiltonian dynamics, first define a Hamiltonian function for the posterior probability distribution for which you want to generate a sample. The Hamiltonian function consists of two variables: the ”position” variable corresponding to the realization from the posterior probability distribution, and the auxiliary ”momentum” variable, which helps the chain to produce efficient samples using energy conservation characteristics. Typically an independent Gaussian distribution

(Neal, 2010) or Riemannian manifold (RMHMC: Fisher information) (Girolami and Calderhead, 2011) are assumed as the momentum variable. By updating the momentum variable, a new state can be proposed by calculating the trajectories moving along the same set of levels according to the energy conservation properties. The state proposed by Hamiltonian mechanics can be far from the current state, but nonetheless it is likely to accept the sample.

HMC can be classified as one of the auxiliary variable MCMC algorithms and is called

hybrid Monte Carlo because it combines MCMC and the deterministic simulation algorithm, the leapfrog method for implementing Hamiltonian dynamics. The tuning parameters of the leapfrog method, leapfrog step-size, , and the number of leapfrog are one of the practical issues in implementing the HMC. The no-U-turn sampler (Hoffman and Gelman, 2014) has been proposed to eliminate the need to set the HMC’s the number of leapfrog, by stopping trajectory automatically when it starts to double back and retrace its steps. A detailed description of HMC based on geometric foundations is given in Betancourt et al. (2014) and the theoretical properties of HMC have been studied in Livingstone et al. (2016) and Durmus et al. (2017).

One of the well-known problems in HMC is the generation of samples from multimodal posterior distributions; The HMC trajectory cannot move to another mode beyond the energy barrier (Nishimura and Dunson, 2016). The energy barrier with respect to from a position to , is the minimum amount of kinetic energy to reach from in a single iteration (Nishimura and Dunson, 2016):

(1)

where denotes a class of continuous function. Note that due to the energy conservation property,

(2)

If the kinetic energy of is less than the energy barrier , then HMC will not be able to reach . HMC and its variants (Girolami and Calderhead, 2011; Hoffman and Gelman, 2014; Neal, 2010; Shahbaba et al., 2013; Sohl-Dickstein et al., 2014) cannot easily move from one mode to another within a small number of iterations because the energy barrier between modes is high when the parameters of interest have different modes separated by low probability areas.

Several variants of the HMC algorithm have been proposed to solve the multimodality problem of HMC. Wormhole Hamiltonian Monte Carlo (Lan et al., 2014) modifies the RMHMC (Girolami and Calderhead, 2011) to create a short path (a so-called wormhole) that connects modes to facilitate movement between modes. This method is not practically useful because the location of the modes must be informed in advance. A tempering approach is combined with HMC in Nishimura and Dunson (2016) and the authors suggest a new type of mass matrix to lower the energy barrier between modes so that the trajectory of Hamiltonian mechanics can move more often from one mode to the other. However, this algorithm must specify an appropriate temperature schedule so that the trajectory can efficiently navigate the parameter space. In addition, the standard integrator is not applicable to this sampler because the velocity of this sampler is unbounded in the area of the low probability region.

In this article, we propose a new algorithm, Stochastic Approximation Hamiltonian Monte Carlo (SAHMC), for generating samples from a multimodal density under the framework of the HMC. SAHMC is an HMC version of the Stochastic Approximation Monte Carlo (SAMC) algorithm (Liang, 2007; Liang et al., 2007), which draws samples from each subregions with a pre-determined frequency. SAHMC use weights in SAMC, which are updated proportionate to the differences between actual number of visits of subregions with the pre-specified frequency using stochastic approximation (Robbins and Monro, 1951), to adaptively lower the energy barrier between modes and allow chains to easily pass through low probability areas between modes. The convergence of the algorithm is established under mild conditions. Compared to Lan et al. (2014), SAHMC does not need to know the location of the mode before implementation. Compared to Nishimura and Dunson (2016), the specification of neither temperature schedule nor variable-step integrators is required for SAHMC implementations. Numerical results show that the new algorithm works well, especially if the posterior density has multiple modes.

The rest of the article is organized as follows. In Section 2, we describe the SAMC algorithm. In Section 3, we incorporate HMC under the framework of SAMC and study its theoretical property. In Section 4, we test our SAHMC algorithm to the Gaussian mixture models along with extensive comparison with HMC. In Section 5, we apply our SAHMC to neural network model and compare results with HMC. In Section 6, we conclude the article with brief discussions.

2 SAMC Algorithm

Suppose that we are interested in sampling from the distribution

(3)

where is an unknown normalizing constant and is the sample space. For mathematical convenience, we assume that is either finite or compact.

We let , , denote partition of according to the potential energy function, , i.e., , , , , and , where are pre-specified numbers. Let be an

-vector with

and

, and denote the desired sampling frequency for each of the subregions. Then, the estimate of equation (

3) can be written as

(4)

where the partition of normalizing constant, . SAMC allows the existence of empty subregions in simulations and provides an automatic way to learn the normalizing constants , , .

Let denote the gain factor sequence which is positive, non-increasing sequence satisfying

(5)

for some . For example, (Liang et al., 2007) suggests

(6)

for some specified value of .

Let denote a working estimate of obtained at iteration . With the foregoing notation, one iteration of SAMC can be described as follows:

SAMC Algorithm

  1. (Sample Generation) Simulate a sample by a single MH update, of which the proposal density is and the invariant distribution is .

  2. (-updating step) Set

    where and if and 0 otherwise. If , set ; otherwise, set , where can be an arbitrary vector which satisfies the condition .

For effective implementation of SAMC, several issues must be considered (Liang et al., 2007):

  • Sample Space Partition  The sample space are partitioned according to our goal and the complexity of the given problem. For example, when we generate samples from the distribution, the sample space can be partitioned according to the energy function. The maximum energy difference in each subregion should be bounded, for example, Liang et al. (2007) suggests to use 2. Within the same subregion, the behavior of the SAHMC move reduces to the local HMC.

  • Choice of the desired sampling distribution  If we aim to estimate , then we may set the desired distribution to be uniform, as is done in all examples in this article. However, we may set the desired distribution biased to low-energy regions. To ensure convergence, the partition of all sample spaces must be visited in proportion to the desired sampling distribution.

  • Choice of and the number of iterations  To estimate , should be very close to 0 at the end of simulations and the speed of going to 0 can be controlled by . In practice, we choose according to the complexity of the problem; The more complex the problem is, the larger the value of that should be chosen. A large will make SAHMC reach all subregions quickly, even in the presence of multiple local energy minima.

3 SAHMC Algorithm

To substitute the sample generation step in SAMC to HMC, we at first define the potential energy function as and the kinetic energy function as

(7)

where the auxiliary variable is interpreted as a momentum variable, is the dimension of , and covariance matrix denotes a mass matrix. We can prespecify the mass matrix using a diagonal matrix or define it using the Riemannian manifold (Girolami and Calderhead, 2011). Then, the Hamiltonian and its corresponding probability function are

(8)

The partial derivatives of determines how and change over time, according to Hamiltonian equation,

(9)

Note that can be interpreted as velocity.

Under the aforementioned energy partition, the estimate of equation (3) can be written as

(10)

where the partition of normalizing constant, , is

(11)

SAHMC allows the existence of empty subregions in simulations and provides an automatic way to learn the normalizing constants , , .

Let denote a working estimate of obtained at iteration . With the foregoing notation, one iteration of SAHMC can be described as follows:

SAHMC Algorithm

  1. (Momentum Updating) Draw an independent normal random variable

    , and set and . Also, set and .

  2. (Proposal Step)

    1. Set . Make a half step for the momentum at the beginning with

    2. Alternate full steps for position and momentum. For , do the following

      1. Make a full step for the position: .

      2. Make a full step for the momentum, except at the end of trajectory:

      3. Make a half step for momentum at the end:

      4. Set negative momentum at the end of trajectory to make the proposal symmetric: . Also, set .

  3. (Decision Step) Set and , calculate

    (12)

    where denotes the index of the subregion that belongs to. Accept the proposal with probability . If accepted, set ; otherwise .

  4. (-updating step) Set

    where and if and 0 otherwise. If , set ; otherwise, set , where can be an arbitrary vector which satisfies the condition .

Like SAMC (Liang, 2009; Liang et al., 2007), SAHMC also falls into the category of stochastic approximation algorithms (Andrieu et al., 2005; Benveniste et al., 1990). Theoretical results on the convergence of SAHMC are given in the Appendix. The theory states that under mind conditions, we have

(13)

where and is the number of empty subregions, and represents an arbitrary constant. Since is invariant with respect to a location transformation of , cannot be determined by the samples drawn from . To determine the value of , extra information is needed; for example, is equal to a known number.

The main advantage of the SAHMC algorithm is that it can adaptively lower the energy barrier between modes and move the Hamiltonian trajectory more frequently and easily across the low probability regions between modes. Suppose there are two modes, and , in our target density and these two modes are separated by the low probability region. In addition, assume the current location of the chain is near . For HMC, from equation (2), the kinetic energy of should be larger than the energy barrier to make the chain reach the other peak . However, for SAHMC, the equation (1) can be rewritten as

(14)

where and denote the index of the subregions that and belong to, respectively. From our assumption that the chain currently stays near for several iterations. That means the sample of is oversampled rather than , while the sample of is undersampled than , resulting in being larger than . Then, under the SAHMC framework, the energy barrier can be lowered by , so kinetic energy in can move the trajectory more easily than in other modes . The amount of energy barriers lowered by SAHMC is determined adaptively according to the frequency of visits to the subregion.

The other benefit of the SAHMC algorithm is its flexibility for other variants of the HMC. Because SAHMC can be implemented by adding one more step to the HMC, all existing HMC variants can be easily implemented under the SAHMC framework. For example, by replacing mass matrix with Fisher information matrix, we can easily implement RMHMC (Girolami and Calderhead, 2011) under the framework of SAHMC. The no-U-turn sampler (Hoffman and Gelman, 2014) can also be easily implemented to SAHMC.

4 An Illustrative Example: A Gaussian Mixture Distribution

Set 1: a = -6, b = 4
Parameter Method Time(s) ESS (min, med, max) s / min ESS Relative Speed
HMC 3.9 ( 787, 882, 941) 0.00496 1.0
SAHMC 3.9 (2041, 2984, 3549) 0.00191 2.6
HMC 3.9 ( 802, 879, 950) 0.00486 1.0
SAHMC 3.9 (2120, 3034, 3493) 0.00184 2.6
Set 2: a = -8, b = 6
Parameter Method Time(s) ESS (min, med, max) s / min ESS Relative Speed
HMC 3.9 ( 18, 25, 78) 0.21666 1.0
SAHMC 3.9 (533, 723, 1033) 0.00732 29.6
HMC 3.9 ( 17, 26, 78) 0.22941 1.0
SAHMC 3.9 (581, 683, 825) 0.00671 34.2
Table 1: Gaussian Mixture Examples: Comparison of SAHMC and HMC

As our illustrative example, we compare SAHMC with HMC using the following Gaussian mixture distribution:

(15)

which is identical to that given by Gilks et al. (1998) except that the mean vectors are separated by a larger distance in each dimension. With this example, we show how SAHMC outperform to the original HMC method in generating samples from multimodal density.

To compare SAHMC with HMC, we used two sets of ; and . For both HMC and SAHMC, we set the leapfrog step-size, , and leapfrog steps, . To run SAHMC, we set the sample space to be compact and it was partitioned with equal energy bandwidth into the following subregions: , , , and . Additionally, we set and the desired sampling distribution to be uniform for SAHMC. Both HMC and SAHMC were independently run ten times and each run consists of 1,000,000 iterations, where the first 200,000 iterations were discarded as a burn-in process.

Table 1 summarizes the performance of HMC and SAHMC in aspects to the effective sample sizes and relative speeds. There are no differences in total computational time; for both HMC and SAHMC, each run takes 3.9s on a 4.0 GHz Intel i7 processor. However, the relative speed, the computation time for generating one effective sample, of our SAHMC algorithm is about 2.6 times faster than HMC for Set 1 (a = -6, b = 4). For Set 2 (a = -8, b = 6), which has larger distance between modes and faces more difficulties in generating MCMC samples, the performance of SAHMC algorithm is approximately 30 times better than those of HMC.

(a) HMC Position (b) HMC Momentum (c) Scatter Plot of HMC
(a) SAHMC Position (b) SAHMC Momentum (c) Scatter Plot of SAHMC
Figure 1: Position and momentum coordinates for the first 1,000 iterations and scatter plots for the Gaussian mixture example with a = -8 and b = 6. Row 1: HMC (a-c). Row 2: SAHMC (d-f).

Figure 1(a-c) show the position and momentum coordinates for the first 1000 iterations and the scatter plots of MCMC samples drawn by HMC and Figure 1(d-f) exhibits those of MCMC samples generated by SAHMC. Scatter plots show that HMC cannot reach one of its three modes, whereas our SAHMC explores all parameter spaces. The weight terms, , which are updated adaptively based on the frequencies of chain visits, help the SAHMC chains move more widely for both position and momentum coordinates so that it can explore more sample spaces while maintaining the efficiencies of HMC. Therefore, our SAHMC method performs much better than HMC when our target density has multi-mode.

5 An Application to Neural Networks

Feed-forward neural networks, which are also known as multiple layer perceptrons (MLP), are one of well-known models in machine learning community (Schmidhuber, 2012). Given a group of connection weights , the MLP can be written as

(16)

where is the number of hidden units, is the number of input units, is the -th input patterns, and , and are the weights on the connections from the -th hidden unit to the output unit, from the -th input unit to the -th hidden unit, respectively. The function

is the activation function of the hidden and output units. Popular choices of

include the sigmoid function, the hyperbolic tangent function, and the rectified linear unit (ReLU).

(a) SAHMC (b) HMC
Figure 2: An example of trace plots.

There have been multiple studies regarding computational aspects of Bayesian neural network models via MCMC algorithms (Andrieu et al., 2000; Lampinen and Vehtari, 2001)but their practical performance were questioned due to the highly correlated parameters on posterior space. Alternatively, in Neal (2012) the HMC was used to sample the weight parameters, improving the convergence of the MCMC chain. However, the highly multimodal nature of the posterior distribution of the MLP still hinders the practical implementation of neural network models. To solve this issue, we apply the SAHMC to neural network models, and consider simulated datasets to examine the capacity of the SAHMC to efficiently explore the multimodal posterior space.

A Simulation Study

We consider a simple regression settings with one predictor for one-layered feedforward neural network, and we generate the data from , where the true function with the ReLU function , and for . We independently replicates 200 simulated data sets and use SAHMC and HMC to sample from the posterior distribution of the connection weights.

For both HMC and SAHMC, we set the leapfrog step-size, , and leapfrog steps, . To run SAHMC, we set the sample space to be compact and it was partitioned with equal energy bandwidth into the following subregions: , , , and . Additionally, we set and the desired sampling distribution to be uniform for SAHMC. Both HMC and SAHMC were independently run and each run consists of 55,000 iterations, where the first 5,000 iterations were discarded as a burn-in process. All initial points are randomly selected for all simulations.

To measure the performance of each procedure, we consider a Posterior risk of , that is , where is the expectation operator with respect to the posterior distribution of , and we also consider the ESS. We also report the minimum energy found by the SAHMC and the HMC and the proportion of the cases where the SAHMC procedure finds the smaller energy region than the energy found by the other procedure. The results are averaged over 200 replicated data sets.

Method Posterior Loss ESS (min, med, max)
HMC 0.112 (4.1, 40.3, 282.7)
SAHMC 0.077 (8.3, 184.9, 335.8)
Table 2: Results of the Simulation Study: Comparison of SAHMC and HMC

Table 2 summarizes the performance of the SAHMC and the HMC shows that the posterior expectation of loss from the regression function evaluated by the SAHMC achieves a smaller than that from the HMC. The median ESS of the SAHMC is about 4.5 times larger than that of the HMC.

In Figure 2, we provide an example of the trace plot of a SAHMC chain and a HMC chain for a simulated data set. This example illustrates how different the Markov chains of SAHMC and HMC are. The SAHMC chain explores all energy level between 503 and 571, while the HMC chain searches only narrower energy level between 509 to 541. This shows the capacity of SAHMC in escaping local maxima of the posterior distribution and exploring wider region of the posterior space.

Pima Indians Diabetes Classification

We consider a real data set that contains 768 records of female Pima Indians, characterized by eight physiological predictors and the presence of diabetes (Smith et al., 1988). We model the data by an artificial neural network (ANN) of a single layer equipped with 25 hidden nodes. The ANN is trained using SAHMC and HMC on a randomly selected of samples, and the out-sample prediction error was evaluated over of the other samples. We replicates this procedure 100 times and report the test error, ESS, and the average of minimum energy values found by each procedure. All other algorithm settings are same except the energy partitions: , , , and .

Method Test Error ESS (min, med, max) min.Energy
HMC 0.383 (1.7,4.0,24.9) 294.90
SAHMC 0.265 (4.4,15.0,34.5) 289.13
Table 3: Pima Indians Diabetes Data Set

Table 3 shows that the SAHMC algorithm collects more effective samples and achieves a smaller test error compared to the HMC. The average of the minimum energy searched by the SAHMC is also 5.77 lower than that found by the HMC.

6 Concluding Remarks

In this paper, we propose a new algorithm which generates samples from multimodal density under the HMC framework. Because SAHMC can adaptively lower the energy barrier, our proposed algorithm, SAHMC can explore the rugged energy space efficiently. We compare the results of SAHMC with those of HMC and show that SAHMC works more efficiently when there exists multiple modes in our target density.

One difficulty in the application of SAHMC is how to set up the boundary of sample space partition. One approach we can take to overcome this difficulty is running our SAHMC with two stages. At the first stage, we run HMC with a few hundreds iterations, and then, run SAHMC with the range of the sample space determined by the results of the first stages.

References

  • Andrieu et al. (2000) Andrieu, C., N. De Freitas, and A. Doucet (2000). Reversible jump MCMC simulated annealing for neural networks. In

    Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

    , pp. 11–18. Morgan Kaufmann Publishers Inc.
  • Andrieu et al. (2005) Andrieu, C., E. Moulines, and P. Priouret (2005). Stability of stochastic approximation under verifiable condition. SIAM Journal of Control and Optimization 44, 283–312.
  • Benveniste et al. (1990) Benveniste, A., M. Métivier, and P. Priouret (1990). Adaptive Algorithms and Stochastic Approximations. New York, NY: Springer-Verlag.
  • Betancourt et al. (2014) Betancourt, M., S. Byrne, S. Livingstone, and M. Girolami (2014). The geometric foundations of Hamiltonian Monte Carlo. arXiv:1410.5110v1.
  • Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76.
  • Duane et al. (1987) Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987). Hybrid Monte Carlo. Physics Letters B 195, 216–222.
  • Durmus et al. (2017) Durmus, A., E. Moulines, and E. Saksman (2017). On the convergence of Hamiltonian Monte Carlo. arXiv:1705.00166.
  • Gilks et al. (1998) Gilks, W. R., R. O. Roberts, and S. K. Sahu (1998). Adaptive Markov chain Monte Carlo through regeneration. Journal of the American Statistical Association 93, 1045–1054.
  • Girolami and Calderhead (2011) Girolami, M. and B. Calderhead (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B 73, 123–214.
  • Hoffman and Gelman (2014) Hoffman, M. D. and A. Gelman (2014). The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learnning Research 15, 1593–1623.
  • Lampinen and Vehtari (2001) Lampinen, J. and A. Vehtari (2001). Bayesian approach for neural networks-review and case studies. Neural networks 14(3), 257–274.
  • Lan et al. (2014) Lan, S., J. Streets, and B. Shahbaba (2014). Wormhole Hamiltonian monte carlo. In Proceedings of Twenty-Eighth AAAI Conference on Artificial Intelligence, Palo Alto, CA, pp. 1953–1959. AAAI Press.
  • Liang (2007) Liang, F. (2007). Annealing stochastic approximation Monte Carlo for neural network training. Machine Learning 68, 201–233.
  • Liang (2009) Liang, F. (2009). Improving SAMC using smoothing methods: Theory and applications to Bayesian model selection problems1. The Annals of Statistics 37(5B), 2626–2654.
  • Liang et al. (2007) Liang, F., C. Liu, and R. J. Carroll (2007). Stochastic approximation in Monte Carlo computation. Journal of the American Statistical Association 102, 305–320.
  • Livingstone et al. (2016) Livingstone, S., M. Betancourt, S. Byrne, and M. Girolami (2016). On the geometric ergodicity of Hamiltonian Monte Carlo. arXiv:1601.08057.
  • Neal (2010) Neal, R. (2010). Mcmc using hamiltonian dynamics. In S. Brooks, A. Gelman, G. Jones, and X.-L. Meng (Eds.), Handbook of Markov chain Monte Carlo, pp. 113–162. New York: Chapman & Hall.
  • Neal (2012) Neal, R. (2012). Bayesian learning for neural networks, Volume 118. Springer Science & Business Media.
  • Nishimura and Dunson (2016) Nishimura, A. and D. Dunson (2016). Geometrically tempered Hamiltonian Monte Carlo. arXiv:1604.00871.
  • Robbins and Monro (1951) Robbins, H. and S. Monro (1951). A stochastic approximation method. The Annals of Mathematical Statistics 22, 400–407.
  • Schmidhuber (2012) Schmidhuber, J. (2012). Deep learning in neural networks: An overview. Neural Networks 61, 85–117.
  • Shahbaba et al. (2013) Shahbaba, B., S. Lan, W. O. Johnson, and R. M. Neal (2013). Split hamiltonian monte carlo. Statistics and Computing 24, 339–349.
  • Smith et al. (1988) Smith, J. W., J. Everhart, W. Dickson, W. Knowler, and R. Johannes (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care, pp. 261. American Medical Informatics Association.
  • Sohl-Dickstein et al. (2014) Sohl-Dickstein, J., M. Mudigonda, and M. DeWeese (2014). Hamiltonian monte carlo without detailed balance. In Proceedings of the 31st International Conference on Machine Learning, Volume 32, pp. 719–726.