1 Introduction
The development of increasingly sophisticated Bayesian models in modern machine learning has accentuated the need for efficient generation of asymptotically exact samples from complex posterior distributions. Markov Chain Monte Carlo (MCMC) is an important framework for drawing samples from a target density function. MCMC sampling typically aims to estimate a desired expectation in terms of a collection of samples, avoiding the need to compute intractable integrals. The Metropolis algorithm
(Metropolis et al., 1953) was originally proposed to tackle this task. Despite great success, this method is based on random walk exploration, which often leads to inefficient posterior sampling (with a finite number of samples). Alternatively, exploration of a target distribution can be guided using proposals inspired by Hamiltonian dynamics, leading to Hamiltonian Monte Carlo (HMC) (Duane et al., 1987). Aided by gradient information, HMC is able to move efficiently in parameter space, thus greatly improving exploration. However, the emergence of big datasets poses a new challenge for HMC, as evaluation of gradients on whole datasets becomes computationally demanding, if not prohibitive, in many cases.To scale HMC methods to big data, recent advances in Stochastic Gradient MCMC (SGMCMC) have subsampled the dataset into minibatches in each iteration, to decrease computational burden (Welling & Teh, 2011; Chen et al., 2014; Ding et al., 2014; Ma et al., 2015). Stochastic Gradient Langevin Dynamics (SGLD) (Welling & Teh, 2011) was first proposed to generate approximate samples from a posterior distribution using minibatches. Since then, research has focused on leveraging the minibatch idea while also providing theoretical guarantees. For instance, Teh et al. (2014) showed that by appropriately injecting noise while using a stepsizedecay scheme, SGLD is able to converge asymptotically to the desired posterior. Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) (Chen et al., 2014) extended SGLD with auxiliary momentum variables, akin to HMC, and introduced a friction term to counteract the stochastic noise due to subsampling. However, exact estimation of such noise is needed to guarantee a correct SGHMC sampler. To alleviate this issue, the Stochastic Gradient NoséHoover Thermostat (SGNHT) (Ding et al., 2014) algorithm introduced socalled thermostat variables to adaptively estimate stochastic noise via a thermalequilibrium condition.
One standing challenge of SGMCMC methods is inefficiency when exploring complex multimodal distributions. This limitation is commonly found in latent variable models with a multilayer structure. Inefficiency is manifested because sampling algorithms have difficulties moving across modes, while traveling along the surface of the distribution. As a result, it may take a very large number of iterations (posterior samples) to cover more than one mode, greatly limiting scalability.
We investigate strategies for improving mixing in SGMCMC. We propose the Stochastic Gradient Monomial Gamma Thermostat (SGMGT), building upon the Monomial Gamma Sampler (MGS) proposed by Zhang et al. (2016b). They showed that a generalized kinetic function typically improves the stationary mixing efficiency of HMC, especially when the target distribution has multiple modes. However, this advantage comes with numerical difficulties, and convergence problems due to poor initialization. By defining a smooth version of this generalized kinetic function, we can leverage its mixing efficiency, while satisfying the required conditions for stationarity of the corresponding stochastic process, as well as alleviating numerical difficulties arising from differentiability issues. To ameliorate the convergence issues, we further introduce ) a sampler with an underlying elliptic stochastic differential equation system and ) a resampling scheme for auxiliary variables (momentum and thermostats) with theoretical guarantees. The result is an elegant framework to improve stationary mixing performance on existing SGMCMC algorithms augmented with auxiliary variables.
2 Preliminaries
Hamiltonian Monte Carlo
Suppose we are interested in sampling from a posterior distribution represented as , where denotes model parameters and represents data points. Assuming i.i.d. data, the potential energy function denotes the negative log posterior density, up to a normalizing constant, i.e., . For simplicity, in the following we omit the conditioning of in , and write
. In HMC, the posterior density is augmented with an auxiliary momentum random variable
; is independent of, and typically has a marginal Gaussian distribution with zeromean and covariance matrix
. The joint distribution is written as
, where is the total energy (or Hamiltonian) and is the standard (Gaussian) kinetic energy function, and is the mass matrix. HMC leverages Hamiltonian dynamics, driven by the following differential equations(1) 
where is the system’s time index.
The total Hamiltonian is preserved under perfect simulation, i.e., by solving (1) exactly. However, closedform solutions for and are often intractable, thus numerical integrators such as the leapfrog method are utilized to generate approximate samples of (Neal, 2011). This leads to the following update scheme:
(2) 
where is the stepsize.
Monomial Gamma HMC
In the Monomial Gamma Hamiltonian Monte Carlo (MGHMC) (Zhang et al., 2016b) algorithm, the following generalized kinetic function is employed as a substitute for the Gaussian kinetics of standard HMC:
(3) 
where denotes the elementwise power operation, is the monomial parameter. Note that when , (3) recovers the standard (Gaussian) kinetics. For general , the update equations are identical to (2), except for
(4) 
Zhang et al. (2016b) proved in the univariate case that MGHMC can yield better mixing performance when the sampler reaches its stationary distribution, under perfect dynamic simulation, i.e.,
infinitesimal stepsize in the limit and adequate (finite) simulation stepsize. Additionally, it was shown that for multimodal distributions sampled via MGMHC, the probability of getting trapped in a single mode goes to zero, as
.However, these theoretical advantages are accompanied by two practical issues: ) the numerical difficulties accentuate dramatically as increases, due to the lack of differentiability of for , and ) convergence is slow with poor initialization. For example, in (3) and (4), if is far away from the mode(s) of the distribution, will be large, causing the updated momentum to blow up. This renders the change of , i.e., , to be arbitrarily small for large , thus slowing convergence.
Stochastic Gradient MCMC
SGMCMC is desirable when the dataset, , is too large to evaluate the potential using all samples. The idea behind SGMCMC is to replace with an unbiased stochastic likelihood, , evaluated from a subset of data (termed a minibatch)
(5) 
where is a random subset of of size . SGMCMC algorithms are typically driven by a continuoustime Markov stochastic process of the form (Chen et al., 2015)
(6) 
where denotes the parameters of the augmented system, e.g., and , and are referred as drift and diffusionvectors, respectively, and denotes a standard Wiener process.
In SGHMC (Chen et al., 2014), the resulting stochastic dynamic process is governed by the following Stochastic Differential Equations (SDEs) (with ):
(7) 
where , is a function of , and is a function of . is modeled as , where and is the discretization stepsize. is an estimator of , is a userspecified diffusion factor and
is the identity matrix.
Chen et al. (2014) set for simplicity. The reasoning is that the injected noise will dominate as ( remains as a constant), whereas goes to zero. Unfortunately, the covariance function, , of the stochastic noise, , is difficult to estimate in practice.Recently, SGNHT (Ding et al., 2014) considered incorporating additional auxiliary variables (thermostats). The resulting SDEs correspond to
(8)  
(9) 
where represents the Hadamard (elementwise) product and are thermostat variables. Note that the diffusion factor, , is decoupled in (8), thus can adaptively fit to the unknown noise from the stochastic gradient .
3 Stochastic Gradient Monomial Gamma Sampler
We now consider ) a more efficient (generalized) kinetic function, ) adapting the proposed kinetics to satisfy stationary requirements and alleviate numerical difficulties, ) incorporating an additional firstorder stochastic process to (8) and ) stochastic resampling of the momentum and thermostats to lessen convergence issues.
Generalized kinetics
The statistical physics literature traditionally considers a quadratic form of the kinetics function, and a Gaussian distribution for the thermostats in (8), when analyzing the dynamic system of a canonical ensemble (Tuckerman, 2010). Inspired by this, one typical assumption in previous SGMCMC work is that the marginal distribution for the momentum and thermostat is Gaussian (Ding et al., 2014; Li et al., 2016). However, this assumption, while convenient, does not necessarily guarantee an optimal sampler.
In recent work, Lu et al. (2016) extended the standard (Newtonian) kinetics to a more general form inspired by relativity theory. By bounding the momentum, their relativistic Monte Carlo can lessen the problem associated with large potential gradients, , thus resulting in a more robust alternative to standard HMC. Further, Zhang et al. (2016b) demonstrated that adopting nonGaussian kinetics delivers better mixing and reduces sampling autocorrelation, especially for cases where the posterior distribution has multiple modes.
These ideas motivate a more general framework to characterize SGMCMC, with potentially nonGaussian kinetics and thermostats. As a relaxation of SGNHT (Ding et al., 2014; Ma et al., 2015), we consider a Hamiltonian system defined in a more general form
(10) 
where and are any valid potential functions, inherently implying that and
, define valid probability density functions.
We first consider the SDEs of SGNHT with generalized kinetics . The system can be obtained by generalizing (with identity mass matrix for simplicity) in (8) with arbitrary , thus
(11) 
However, if we set as in (3) with , the dynamics governing the SDEs in (11) will often fail to converge. This is because the sufficient condition to guarantee that the Itô process governed by the SDEs in (11) converge to a stationary distribution generally requires the FokkerPlanck equation to hold (Risken, 1984). Further, the existence and uniqueness of the solutions to the FokkerPlanck equation require Lipschitz continuity of drift and diffusion vectors in (6) (Bris & Lions, 2008). Unfortunately, this is not the case for the drift vectors in (11) when , as is nondifferentiable at the origin, i.e., .
Softened kinetics
The above limitation can be avoided by using a softened kinetic function . However, to keep the performance benefits from the original stiff kinetics, we must ensure that has the same tail behavior. We propose that for , the softened kinetics are (for clarity we consider 1D case, however higher dimensions still apply)
(12) 
where is a softening parameter. Note that is (infinitely) differentiable for any and asymptotically approaches the stiff kinetics as . A comparison between stiff kinetics, , and softened kinetics is shown in Figure 1, for different values of . Discussion and formulation of the softened kinetics for arbitrary (and ) are provided in the Supplementary Material (SM).
To generate samples of the momentum variable, , from the density with softened kinetics, which is proportional to , we use a coordinatewise rejection sampling, i.e., the proposed for the th dimension is rejected with probability .
In practice, setting to a relatively large value would still make the gradient illposed close to , thus causing high integration error when simulating the Hamiltonian dynamics. Conversely, setting to a small value will cause a high approximation error w.r.t. the original , thus resulting in a less efficient sampler. Consequently, has to be determined empirically as a tradeoff between integration and approximation errors.
Additional First Order Dynamics
Inspired by Ma et al. (2015), we consider adding Brownian motion to and in (8
), with variances
and , respectively, while maintaining the stochastic process (asymptotically) converging to the correct marginal distribution of . Specifically, we consider the following SDEs:(13) 
The variances control the Brownian motion for , respectively, and denotes a rescaling factor for the friction term of momentum updates. The additional terms and can be understood as firstorder Langevin dynamics (Welling & Teh, 2011). The variance term, , controls the contribution of to the update of w.r.t.
. This is analogous to the hyperparameter balancing
and in the SGDwithmomentum algorithm (Rumelhart et al., 1988). Derivation details for and in (12), as well as other values of , are provided in the SM.The following theorem, proven in the SM, shows that under regularity conditions, the SDEs in (13) lead to posterior samples from the invariant joint distribution , yielding the desired marginal distribution w.r.t. as .
Theorem 1.
The reasoning behind increasing stochasticity in the SDEs is twofold. First, the additional Langevin dynamics are crucial to SGMCMC with generalized kinetics for large . For instance, for , the update for from (11) is . When and is large, will be close to zero, thus (the next sample) will be close to , i.e., the sampler moves arbitrarily slow. As discussed by Zhang et al. (2016b), this can happen when moves to a region where the gradient takes a large absolute value, e.g., near the lowdensity regions in a lighttailed distribution. Fortunately, the additional Langevin dynamics in (13), , compensate for the weak updating signal from , by an immediate gradient signal . Additionally, when becomes small, will become large. As a result, these two updating signals and compensate each other, thereby delivering a stable updating scheme. Likewise, the immediate gradient in (13) can provide complementary updating signal for the thermostat variables, , to offset the weak deterministic update , when is large.
Second, (13) has noise components on all parameters, , making the corresponding SDEs elliptic
. From a theoretical perspective, ellipticity/hypoellipticity are necessary conditions to guarantee existence of bounded solutions for a particular partial differential equation related to the diffusion’s
infinitesimal generator, which lies in the core of most recent SGMCMC theory (Teh et al., 2014; Vollmer et al., 2016; Chen et al., 2015). Ellipticity is characterized by a noise process (Brownian motion) covering all components of the system, via the diffusion, , in (6). This means is block diagonal, thus a positive definite matrix (Mattingly et al., 2010). In a typical hypoelliptic case, the noise process is imposed on a subset of . However, hypoellipticity also requires the noise to be able to spread through the system via the drift term, , which may not be true for general . For instance, in (8), and is block diagonal with entries , i.e., and are not explicitly influenced by the noise process, , thus hypoellipticity cannot be guaranteed.To the authors’ knowledge, for existing SGMCMC algorithms, only SGLD where , satisfies the ellipticity property, while other algorithms such as SGHMC and SGNHT assume hypoellipticity, thus their corresponding are not positive definite.
Stochastic resampling
When generating samples from the stochastic process in (13), we resample momentum and thermostats from their marginal distribution with a fixed frequency, instead of every iteration from their conditionals. Since the momentum and thermostats are drawn from the independent marginals of stationary distribution , it can be shown that reconstructing the stochastic process with the solution of the SDEs will still leave the stochastic process invariant to the target stationary distribution (Brunick et al., 2013).
To simplify the discussion, consider a stochastic process of a particle as in (11) with fixed . As show in Figure 2, suppose the initial value of is far from the maximum a posteriori value. The dynamics governed by (11) will stochastically move along the Hamiltonian contour. The total Hamiltonian energy level is affected by the joint effect of the stochastic diffusion and momentum refraction (i.e., ), which changes continuously over time.
From previous discussions, moving on a high Hamiltonian contour when is less efficient because the absolute value of the momentum, , will get increasingly large, slowing down the movement of . Resampling of momentum according to its marginal will enable the sampler to immediately move to a lower Hamiltonian energy level.
At the burnin stage, this momentumaccumulation/energydrop cycle seen in Figure 2(bottom) via resampling momentum can happen several times, until equilibrium is found. In practice, the resulting energy level is often much lower than initially, thereby delivering a more efficient and accurate dynamic updating.
The frequency of resampling from the marginal of the stationary distribution can have a direct impact on the mixing performance. Setting the frequency too high will result in a randomwalk behavior. Conversely, with a low frequency resampling, the randomwalk behavior is suppressed at a cost of fewer jumps between trajectories associated with different energy levels. It is advisable to increase the resampling frequency if the sampler is initialized on lowdensity (e.g. lighttailed) region.
The resampling step on and plays a role that is similar to adding a Langevin component to , in the sense that both improve convergence for . However, these two strategies (resampling and Langevin) are fundamentally different. We empirically observe that resampling is most helpful during burnin, while the additional Langevinstyle updates are more helpful with mixing during stationary sampling.
Sgmgt
The specifications described above constitute an SGMCMC method for the SDEs in (11), which we call Stochastic Gradient Monomial Gamma Thermostat (SGMGT). We denote the SGMCMC method with additional Brownian motion on and in (13) as SGMGTD (Diagonal), i.e., with and . The complete update scheme, with Euler integrator, for SGMGT is presented in the SM. Note that with , SGMGTD recovers SGHMC as in Chen et al. (2014). Moreover, when , it becomes SGNHT as in Ding et al. (2014).
We note that SGMGTD improves upon SGNHT in three respects: () we introduce generalized kinetics, which provably yield lower autocorrelations than standard HMC, especially in multimodal cases; () the additional stochastic noise on thermostat variables yields more efficient mixing; () we use stochastic resampling to allow for faster interchange between different energy levels, thus alleviating sampling stickiness.
To the authors’ knowledge, despite existing analysis for Langevin Monte Carlo (Bubeck et al., 2015; Dalalyan, 2016), rigorous analysis and comparison of the mixing performance of general SGMCMC is very difficult, thus not yet established. Toward understanding the mixing performance of SGMGTD, we argue that as the minibatch size increases, and the contribution of the diffusion in (6) decreases, the SGMGTD will approach MGHMC, in which case, a large will result in high stationary mixing performance, especially when sampling multimodal distribution, as theoretically shown by Zhang et al. (2016b). Although our experiments support our intuition, a more formal theoretical justification is needed. We leave this as interesting future work.
We observe empirically that when increasing the value of , SGMGTD may not always achieve superior mixing performance. One possible reason for this is a larger value of induces “stiffer” behavior of at , which typically requires a higher level of softening, thus higher rejection rates during the rejection sampling step. Also, when the dimensionality of is higher, the rejection rate of the rejection sampling step increases (proportional to ). In such cases, the efficiency of the sampler decreases with large . For these reasons, we limit our experiments to .
We clearly have more hyperparameters than SGNHT. In practice, we fix , , and set the resampling frequency , which provides robust performance. Thus, only two additional hyperparameters are employed ( and ) compared to SGNHT, and these parameters require further tuning. We use either validation or a holdout set in our experiments.
More accurate numerical integrators
Using a firstorder Euler integrator to approximate the solution of the continuoustime SDEs in (13), leads to errors in the approximate samples (Chen et al., 2016). Alternatively, we can use the symmetric splitting scheme of Chen et al. (2016) to reduce the order of the approximate error to . Details of the splitting used in this work are provided in the SM.
Convergence properties
The SGMGT framework, as an instance of SGMCMC, enjoys the same convergence properties of general SGMCMC algorithms studied in Chen et al. (2015)
. It’s worth to mention that on challenging problems the posterior may not be densely sampled to yield ideal posterior computation, and the asymptotic theory is being used as a useful heuristic. Specifically, it is of interest to quantify how fast the sample average,
, converges to the true posterior average, , for , where is number of iterations. Here we make the same assumptions of Chen et al. (2015), and further assume that a firstorder Euler integrator and a fixed stepsize are used.Proposition 2.
For the proposed SGMGT and SGMGTD algorithms, if a fixed stepsize is used, we have:
Bias:  
MSE: 
This proposition indicates that with larger number of iterations and smaller step sizes, smaller bias and MSE bounds can be achieved. We note that these bounds have similar rates compared to other SGMCMC algorithms such as SGLD, however, as we demonstrate below in the experiments, SGMGT and SGMGTD usually converge faster than existing SGMCMC methods.
In addition, for stochastic resampling, we can extend Proposition 2 to the following complementary results:
Lemma 1.
Let be the stationary distribution of SGMGTD. The stationary distribution of SGMGTD with momentum resampling is the same as .
Lemma 2.
The optimal finitetime bias and MSE bounds for SGMGTD with momentum replacement remain the same as SGMGTD.
Proofs of Lemma 1 and Lemma 2 are provided in the SM. The proposed SGMGT framework has a strong connection with secondorder stochastic optimization methods, leading to a sampling scheme with minibatches with similar mixing performance as slice sampling (Neal, 2003). We discuss the details of this in the SM.
4 Experiments
4.1 Multiplewell Synthetic Potential
We first evaluate the mixing efficiency of SGMGT and SGMGTD for a synthetic problem, to generate samples from a complex multimodal distribution. The distribution is shown in Figure 3(left). See SM for the definition of its potential energy. The modes are almost isolated with a lowdensity region connecting each other. Consequently, traversing between modes is difficult. In order to simulate the noise of the gradient estimates, we set , similar to Ding et al. (2014), where .
We compare SGNHT with SGMGT and SGMGTD with monomial parameter and fix . For all three algorithms, we try a number of hyperparameter settings, e.g., stepsize , , and the soft parameter , and present the best results in Figure 3. Standard SGNHT fails to escape from one of the modes of the distribution. For SGMGT with , the generated samples reached 3 modes. For SGMGTD with , the sampler identified all 5 modes. In Figure 3(right), SGMGTD adequately moves across different modes and yields rapid mixing performance, unlike SGMGT which exhibits stickier behavior.
AUROC ()  A (15)  G (25)  H (14)  P(8)  R (7)  C (87) 

SGNHT  0.89  0.75  0.90  0.86  0.95  0.65 
SGMGT(a=1)  0.92  0.78  0.91  0.86  0.87  0.70 
SGMGTD(a=1)  0.95  0.86  0.95  0.93  0.98  0.73 
SGMGT(a=2)  0.93  0.79  0.93  0.88  0.86  0.62 
SGMGTD(a=2)  0.95  0.90  0.95  0.90  0.97  0.69 
ESS ()  A (15)  G (25)  H (14)  P(8)  R (7)  C (87) 
SGNHT  869  941  1911  2077  1761  1873 
SGMGTD(a=1)  3147  2131  2448  4244  1494  3605 
SGMGTD(a=2)  2700  1989  2768  3430  2265  2969 
4.2 Bayesian Logistic Regression
We evaluated the mixing efficiency and accuracy of SGMGT and SGMGTD using Bayesian logistic regression (BLR) on 6 realworld datasets from the UCI repository
(Bache & Lichman, 2013): German credit (G), Australian credit (A), Pima Indian (P), Heart (H), Ripley (R) and Caravan (C). The data dimensionality ranges from 7 to 87, and total observations vary between 250 to 5822. Gaussian priors are imposed on the regression coefficients. We set the minibatch size to 16. Other hyperparameters are provided in the SM. For each experiment, we draw 5000 iterations with 1000 burnin samples.Results in terms of median Effective Sample Size (ESS) and prediction accuracies as Area Under Receiver Operating Characteristic (AUROC) are summarized in Table 1. All the results are averages over 5 independent runs with random initialization. In general, SGMGTD performs better than SGMGT. For higherdimensional dataset Cavaran, the performance of decreases significantly, indicating numerical difficulties. The performance gap between SGMGT and SGMGTD with or is usually larger than the gap between SGNHT (). Presumably when is greater than 1, SGMGTD has better convergence.
4.3 Latent Dirichlet Allocation
We also test our methods on Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Details of LDA and our implementation are provided in the SM. We use the ICML dataset (Chen et al., 2013), which contains 765 documents corresponding to abstracts of ICML proceedings from 2007 to 2011. After stopword removal, we obtain a vocabulary size of 1918 and about 44K words. We use 80% of the documents for training and the remaining 20% for testing. The number of topics is set to 30, resulting in 57,540 parameters. We use a symmetric Dirichlet prior with concentration . All experiments are based on 5000 MCMC samples with 1000 burnin rounds. We set the minibatch size to 16. Other hyperparameter settings are provided in the SM.
stepsize  0.04  0.05  0.06  0.07  0.08  0.09  0.1 

SGLD  1058  1054  1058  1067  1037  1048  1057 
SGNHT  1104  1144  1039  1024  1043  1067  1107 
SGMGT(a=1)  996  988  990  986  996  998  997 
SGMGTD(a=1)  987  983  996  996  992  1013  1029 
SGMGT(a=2)  1024  1029  1030  1013  1030  1022  1043 
SGMGTD(a=2)  968  994  973  957  961  954  970 
Table 2 shows the test perplexities for SGMGT and SGMGTD for different stepsizes. For each method we highlight the best perplexity. The SGMGTD with outperforms other methods, however SGMGT with fails to achieve a comparable result with SGMGT with , probably because a good initialization is hard to achieve for a highdimensional distribution.
4.4 Discriminative RBM
We applied our SGMGT to the Discriminative Restricted Boltzmann Machine (DRBM)
(Larochelle & Bengio, 2008) on MNIST data. We choose DRBM instead of RBM because it provides explicit stochastic gradient formulas.We evaluated our methods empirically and compare them with SGNHT. We use one hidden layer with 500 units. For each method we performed 1500 iterations with 200 burnin samples. The minibatch size is set to 100. Details of the hyperparameter settings for SGMGT and SGMGTD are provided in the SM. As shown in Figure 4(rightmost 3 panels), we observe that SGMGTD with yields a superior mixing performance. For SGMGTD with , the posterior samples demonstrated both rapid local mixing, and longrange movement. In contrast, SGLD seems trapped into a local mode after around 500 iterations.
Figure 4(left) shows that SGMGTD with delivers the fastest convergence with the highest test accuracy, 0.976. The SGMGTD improves over SGMGT, while performance of SGMGTD seems to increase with a large value of . We observed that the stochastic resampling played a crucial role for SGMGT, as removing the resampling step resulted in a large drop in testing performance and mixing efficiency.
4.5 Recurrent Neural Network
We test our framework on Recurrent Neural Networks (RNNs) for sequence modeling
(Gan et al., 2017). We consider two tasks: (i) polyphonic music prediction; and (ii) wordlevel language modeling, detailed below. Additional details of the experiment are provided in the SM.Polyphonic music prediction
We use four datasets: Pianomidi.de (Piano), Nottingham (Nott), MuseData (Muse) and JSB chorales (JSB) (BoulangerLewandowski et al., 2012). Each of these are represented as a collection of 88dimensional binary sequences, that span the whole range of piano from A0 to C8.
We use a onelayer LSTM (Hochreiter & Schmidhuber, 1997)
model, and set the number of hidden units to 200. The total number of parameters is around 200K. Each model is trained for 100 epochs. We perform early stopping, while selecting the stepsize and other hyperparameters by monitoring the performance on validation sets. Updates are performed using minibatches from one sequence.
Language modeling
The Penn Treebank (PTB) corpus (Marcus et al., 1993) is used for wordlevel language modeling. We adopt the standard split (929K training words, 73K validation words, and 82K test words). The vocabulary size is 10K. We train a twolayer LSTM model on this dataset. The total number of parameters is approximately 6M. Each LSTM layer contains 200 units.
Algorithms  Piano  Nott  Muse  JSB  PTB 

SGLD  11.37  6.07  10.83  11.25  127.47 
SGNHT  9.00  4.24  7.85  9.27  131.3 
SGMGT (a=1)  7.90  4.35  8.42  8.67  120.6 
SGMGT (a=2)  10.17  4.64  8.51  8.84  250.5 
SGMGTD (a=1)  7.51  3.33  7.11  8.46  113.8 
SGMGTD (a=2)  7.53  3.35  7.09  8.43  109.0 
SGD  11.13  5.26  10.08  10.81  120.44 
RMSprop  7.70  3.48  7.22  8.52  120.45 
Results are shown in Table 3. The best loglikelihood results on the test set are achieved by using SGMGTD with either or (depending on the dataset). To compare with optimizationbased methods, we also include results for SGD (Bottou, 2010) and RMSprop (Tieleman & Hinton, 2012). A more comprehensive comparison is provided in the SM.
Learning curves for Nott and PTB datasets are shown in Figure 5. We omit the SGLD results since they are not comparable with other methods. For both datasets, we observe that SGMGTD delivers fastest convergence. The best negative loglikelihood is achieved by SGMGTD . The difference between and is small, though SGMGTD with seems to decrease slightly faster after 20 epochs for PTB data.
We also observe that the SGMGT with seems suboptimal compared with SGMGT with and SGNHT. We hypothesize that numerical difficulties hinder the success of SGMGT with , especially in higherdimensional cases, and without the additional Langevin components of SGMGTD.
5 Conclusions
We improve upon existing SGMCMC methods with several generalizations. We employed a general kinetic function, which we have shown to have better mixing efficiency, especially for multimodal distributions. Since practical use of the generalized kinetics is limited by convergence issues during burnin, we injected additional Langevin dynamics and incorporated a stochastic resampling step to obtain generalized SDEs that alleviate the convergence issues. Possible areas of future research include designing an algorithm in a slicesampling fashion, which maintains the invariant distribution by leveraging the connections between HMC and slice sampling (Zhang et al., 2016a). In addition, it is desirable to design an algorithm that can adaptively choose the monomial parameter , thereby achieving better mixing while automatically avoiding numerical difficulties.
Acknowledgments
This research was supported by ARO, DARPA, DOE, NGA, ONR and NSF.
References
 Bache & Lichman (2013) Bache, Kevin and Lichman, Moshe. UCI machine learning repository, 2013.
 Blei et al. (2003) Blei, David M., Ng, Andrew Y., and Jordan, Michael I. Latent Dirichlet allocation. JMLR, 3, 2003.

Bottou (2010)
Bottou, Léon.
Largescale machine learning with stochastic gradient descent.
In COMPSTAT, 2010.  BoulangerLewandowski et al. (2012) BoulangerLewandowski, Nicolas, Bengio, Yoshua, and Vincent, Pascal. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. In ICML, 2012.
 Bris & Lions (2008) Bris, C Le and Lions, PL. Existence and uniqueness of solutions to fokker–planck type equations with irregular coefficients. Communications in Partial Differential Equations, 33(7):1272–1317, 2008.
 Brunick et al. (2013) Brunick, Gerard, Shreve, Steven, et al. Mimicking an itô process by a solution of a stochastic differential equation. The Annals of Applied Probability, 23(4):1584–1628, 2013.
 Bubeck et al. (2015) Bubeck, Sebastien, Eldan, Ronen, and Lehec, Joseph. Finitetime analysis of projected langevin monte carlo. In NIPS, 2015.
 Chen et al. (2013) Chen, Changyou, Rao, Vinayak, Buntine, Wray, and Whye Teh, Yee. Dependent normalized random measures. In ICML, 2013.
 Chen et al. (2015) Chen, Changyou, Ding, Nan, and Carin, Lawrence. On the convergence of stochastic gradient mcmc algorithms with highorder integrators. In NIPS, pp. 2278–2286, 2015.
 Chen et al. (2016) Chen, Changyou, Carlson, David, Gan, Zhe, Li, Chunyuan, and Carin, Lawrence. Bridging the gap between stochastic gradient mcmc and stochastic optimization. In AISTATS, 2016.
 Chen et al. (2014) Chen, Tianqi, Fox, Emily B, and Guestrin, Carlos. Stochastic gradient hamiltonian monte carlo. ArXiv, 2014.
 Dalalyan (2016) Dalalyan, Arnak S. Theoretical guarantees for approximate sampling from smooth and logconcave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016.
 Ding et al. (2014) Ding, Nan, Fang, Y, Babbush, R, Chen, C, Skeel, R. D, and Neven, H. Bayesian sampling using stochastic gradient thermostats. In NIPS, 2014.
 Duane et al. (1987) Duane, Simon, Kennedy, Anthony D, Pendleton, Brian J, and Roweth, Duncan. Hybrid Monte Carlo. Physics letters B, 195(2), 1987.
 Gan et al. (2017) Gan, Zhe, Li, Chunyuan, Chen, Changyou, Pu, Yunchen, Su, Qinliang, and Carin, Lawrence. Scalable bayesian learning of recurrent neural networks for language modeling. In ACL, 2017.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long shortterm memory. Neural computation, 1997.
 Larochelle & Bengio (2008) Larochelle, H. and Bengio, Y. Classification using discriminative restricted boltzmann machines. In ICML, 2008.
 Li et al. (2016) Li, Chunyuan, Chen, Changyou, Fan, Kai, and Carin, Lawrence. Highorder stochastic gradient thermostats for bayesian learning of deep models. In AAAI, 2016.
 Lu et al. (2016) Lu, Xiaoyu, Perrone, Valerio, Hasenclever, Leonard, Teh, Yee Whye, and Vollmer, Sebastian J. Relativistic monte carlo. arXiv, 2016.
 Ma et al. (2015) Ma, YiAn, Chen, Tianqi, and Fox, Emily. A complete recipe for stochastic gradient mcmc. In NIPS, pp. 2917–2925, 2015.
 Marcus et al. (1993) Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and Santorini, Beatrice. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 1993.
 Mattingly et al. (2010) Mattingly, J. C., Stuart, A. M., and Tretyakov, M. V. Construction of numerical timeaverage and stationary measures via Poisson equations. SIAM J. NUMER. ANAL., 48(2):552–577, 2010.
 Metropolis et al. (1953) Metropolis, Nicholas, Rosenbluth, Arianna W, Rosenbluth, Marshall N, Teller, Augusta H, and Teller, Edward. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6), 1953.
 Neal (2003) Neal, Radford M. Slice sampling. Annals of statistics, 2003.
 Neal (2011) Neal, Radford M. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2, 2011.
 Risken (1984) Risken, Hannes. Fokkerplanck equation. In The FokkerPlanck Equation, pp. 63–95. Springer, 1984.
 Rumelhart et al. (1988) Rumelhart, David E, Hinton, Geoffrey E, and Williams, Ronald J. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1, 1988.
 Teh et al. (2014) Teh, Yee Whye, Thiéry, Alexandre, and Vollmer, Sebastian. Consistency and fluctuations for stochastic gradient langevin dynamics. ArXiv, 2014.
 Tieleman & Hinton (2012) Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 2012.
 Tuckerman (2010) Tuckerman, Mark. Statistical mechanics: theory and molecular simulation. Oxford University Press, 2010.
 Vollmer et al. (2016) Vollmer, Sebastian J, Zygalakis, Konstantinos C, and Teh, Yee Whye. Exploration of the (non) asymptotic bias and variance of stochastic gradient langevin dynamics. Journal of Machine Learning Research, 17(159):1–48, 2016.
 Welling & Teh (2011) Welling, Max and Teh, Yee W. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
 Zhang et al. (2016a) Zhang, Yizhe, Chen, Changyou, Henao, Ricardo, and Carin, Lawrence. Laplacian hamiltonian monte carlo. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 98–114. Springer, 2016a.
 Zhang et al. (2016b) Zhang, Yizhe, Wang, Xiangyu, Chen, Changyou, Henao, Ricardo, Fan, Kai, and Carin, Lawrence. Towards unifying hamiltonian monte carlo and slice sampling. In NIPS, 2016b.
Comments
There are no comments yet.