Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale

10/10/2018 ∙ by Changye Wu, et al. ∙ 0

Hamiltonian Monte Carlo samplers have become standard algorithms for MCMC implementations, as opposed to more basic versions, but they still require some amount of tuning and calibration. Exploiting the U-turn criterion of the NUTS algorithm (Hoffman and Gelman, 2014), we propose a version of HMC that relies on the distribution of the integration time of the associated leapfrog integrator. Using in addition the primal-dual averaging method for tuning the step size of the integrator, we achieve an essentially calibration free version of HMC. When compared with the original NUTS on several benchmarks, this algorithm exhibits a significantly improved efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

page 18

page 19

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hamiltonian Monte Carlo (HMC, Duane et al., 1987; Neal et al., 2011)

has emerged as an efficient Markov Chain Monte Carlo (MCMC) sampling method, which, in particular, is able to deal with high-dimensional target distribution. The method relies on a deterministic differential flow stemming from Hamiltonian mechanics to produce transitions across the parameter space of an augmented distribution, referred to as Hamiltonian. Such time-continuous dynamics leave the augmented distribution invariant but in practice the latter need to be discretised. The transitions are hence computed using a gradient-based symplectic integrator — the most popular one being the second order Störmer-Verlet or leapfrog integrator — and a Metropolis-Hastings acceptance ratio need to be added towards correcting the approximate numerical solution and hence preserving the proper target measure.



Whilst the algorithm theoretically benefits from a fast exploration of the parameter space by accepting large transitions with high probability, this is marred by its high sensitivity to hand-tuned parameters, namely the step size

of the discretisation scheme, the number of steps of the integrator, and the covariance matrix of the auxiliary variables. Indeed, the discretisation error due to a poorly tuned step size will lead to a low acceptance rate. Furthermore calibrating the number of steps of the integrator is paramount to avoid slow mixing chains or to save some computation cost. In other words, if is not large enough, HMC exhibits a random walk behaviour similar to standard Metropolis Hastings algorithms and may thus struggle to properly explore the support of the target. If is too large the associated dynamic retraces its steps back to a neighbourhood of the initial state (Betancourt et al., 2014) wasting computation efforts and diminishing mixing. Finally the choice of a well-suited covariance matrix for the auxiliary variable can enhance both the speed and the mixing of HMC.

Recent developments have addressed this tuning issue: the automatic tuning strategy of Wang et al. (2013), the Riemann manifold approach of Girolami and Calderhead (2011) that adapts the covariance matrix of the auxiliary variables to the curvature of the target distribution, just to name a few although these versions are not of direct interest. Among those, the No-U-Turn Sampler (NUTS, Hoffman and Gelman, 2014) is arguably the most popular, in particular because it automatically tunes both the step size and the number of leapfrogs. The algorithm is just as efficient, if not better than standard HMC, albeit this comes at an extra cost that may prove burdensome. The core idea behind NUTS is to pick the step size via primal-dual averaging (Nesterov, 2009) in a burn-in phase and to build at each iteration a proposal based on a locally longest path on a level set of the Hamiltonian. This is achieved by a recursive algorithm that, at each call to the leapfrog integrator, requires to evaluate both the gradient of the target distribution and the Hamiltonian itself. Roughly speaking an iteration of NUTS costs twice as much as HMC with the same number of calls to the integrator.

In this paper, we aim at reducing this extra cost and explore the opportunity to learn an empirical distribution of the longest leapfrog path — consistent with the target distribution — by exploiting the U-turn criterion of NUTS for the level set visited during a burn-in phase. The latter is then used to randomly pick at each iteration of an HMC scheme.

The paper begins with a brief description of HMC and NUTS in Section 2. We then introduce the learning strategy of the empirical distribution in Section 3. We illustrate the efficiency of the proposed algorithm compared to the original NUTS on several benchmarks in Section 4, and conclude in Section 5.

All examples in the paper can be reproduced using R and Rstan111https://github.com/wcythh/eHMC (Stan Development Team, 2018).

2 Hamiltonian Monte Carlo

Consider a probability measure on with density, also denoted , with respect to the Lebesgue measure, , where is a continuously differentiable function called the potential function. HMC allows to sample from by augmenting the target distribution with an auxiliary variable , referred to as momentum variable (as opposed to called the position), distributed according to a

-dimensional Normal distribution,

, that is HMC samples from , which marginal chain in is the distribution of interest. We refer the reader to Betancourt (2017) for a discussion about other momentum distributions and Girolami and Calderhead (2011) for a -dependent momentum distribution based on Riemann manifolds.

algocf[!t]    

Define the Hamiltonian as the unnormalised negative joint log-density

HMC generates proposals for based on the Hamiltonian dynamics which write with respect to a fictitious time

(1)

Such a mechanism aims at efficiently exploring the parameter space as compared to standard random-walk Metropolis-Hastings proposals while preserving the measure — the solution flow being reversible. In practice, the above differential equations cannot be solved analytically and HMC samplers resort to symplectic numerical integrators. The leapfrog integrator is commonly used for its trade-off between second order accuracy and computational cost. Given a discretisation time-step , it yields a map defined by

It corresponds to a three stage procedure that alternatively updates and to approximate the solution at time , and to approximate solution at a time , one repeats the procedure times. is referred to as number of leapfrog steps. This discrete scheme does no longer leave the measure invariant. To account for the discretisation bias and preserve the targetted measure, an accept-reject step is introduced (see Algorithm LABEL:alg:HMC). A transition from to a proposal corresponding to the approximated solution for an integration time is accepted with probability

(2)

which implies that detailed balance is satisfied for the target . The solution of (1) keeps the Hamiltonian constant, meaning the proposals are always accepted for the exact dynamic. After discretisation, the deviation to can be bounded (Leimkuhler and Reich, 2004),

Hence, the acceptance rate in Algorithm LABEL:alg:HMC still tends to be high even for a proposal quite far from .

2.1 Nuts

In this section, we briefly introduce the NUTS algorithm, referring the reader to Hoffman and Gelman (2014) for more details.

The so-called no-U-Turn sampler (NUTS) is a version of HMC sampler that eliminates the need to specify the number of HMC steps by adaptively choosing the locally largest value at each iteration of the algorithm. More precisely, given a leapfrog scale and a current value , NUTS first resamples the HMC momentum

from a Gaussian distribution

. Next, NUTS proceeds by doubling the leapfrog path, either forward or backward with equal probability, until the path begins to retrace towards the starting point. This means that the backward and forward end points of the path, and , satisfy

(3)

Given the collection of visited position-momentum pairs along the path, the final step of NUTS consists in selecting one point in by a slice sampling move. This means generating a Uniform and sampling uniformly in one of the points being able to regenerate and with Hamiltonian larger than . This allows for detailed balance to hold and thus validates the NUTS algorithm.

In addition, the step size in NUTS is tuned via the primal-dual averaging (Nesterov, 2009), by aiming at a targeted acceptance probability .

Compared with the standard HMC sampler, which relies on a fixed length across iterations, NUTS choose adaptively at each iteration while it requires an evaluation of the Hamiltonian along its entire leapfrog path. This obviously doubles the computation cost, when evaluating and its gradient require the same computation effort.

Even though the leapfrog path of NUTS is locally the longest possible in terms of Equation (3), we note that the starting position-momentum pair is almost never one of the two endpoints of the generated leapfrog path. Therefore the distance between the position of the proposed value and the current one is smaller than it could be, which induces a waste of computation time.

3 Learning leapfrog scale

In this section, we explore the opportunity to use a U-turn condition similar to Equation (3) for choosing in the HMC sampler based on the empirical distribution of the longest integration times. Regarding the step size , we also adopt the primal-dual averaging method implemented to NUTS.

3.1 Empirical distribution of leapfrog scale

Given a step size and a starting point , denote , the value of the pair after iterations of the leapfrog integrator, . We define the longest batch associated with as the first such that referred to as U-turn condition in what follows. We can get an empirical distribution of the longest batches by running iterations of HMC with the optimised step size and an initial number of leapfrog steps (see Algorithm LABEL:alg:L-emp-dist). Specifically, each iteration of HMC is ran for leapfrog steps to generate the next state of the Markov chain. At the same time, we compute the longest batch for the current state of the chain, that is if we store it on the go throughout the first steps, otherwise we keep simulating the dynamics till reaching the U-turn condition. Such an algorithm produces a sequence of longest batches

whose empirical measure is

where denotes the Dirac measure at , .

algocf[!t]    

3.2 Convergence Analysis

Let be the distribution of the longest batch conditionally on the current state with step size and its marginal distribution with respect to the augmented target . The transition kernel of the HMC sampler can be written as the composition of two Markov kernels and over :

where and is defined by Equation (2). Both kernels are reversible with respect to the augmented target and hence the composition leaves the latter invariant. As a consequence, produced by Algorithm LABEL:alg:L-emp-dist converges to .

3.3 Empirical Hamiltonian Monte Carlo

We introduce a version of HMC, called empirical HMC (eHMC) that makes use of the empirical distribution constructed in Algorithm LABEL:alg:L-emp-dist.

NUTS derives at each iteration a locally adapted scale for used in a manner compatible with the stationary distribution. Indeed, running the longest batch at each iteration, albeit interesting in increasing the jumping distance between the current and the proposed states, creates a bias in the resulting stationary distribution. NUTS hence requires a slice auxiliary variable to keep the reversibility of the Markov chain. Here we propose an orthogonal direction by considering a global distribution on the batch sizes consistent with the target distribution.

The core idea of eHMC is to randomly pick a number of leapfrog steps according to the empirical distribution at each iteration (see Algorithm LABEL:alg:eHMC). The advocated algorithm is valid since the choice of is independent from the current state of the chain. Thus the resulting transition kernel can be seen as a composition of multiple Markov kernels attached to the same stationary distribution (Tierney, 1994).

3.4 Two Variants

In this section, we propose two further HMC-based samplers by leveraging in two different ways. The first sampler, eHMCq, chooses an upper bound on the number of steps,

, based on the quantiles of

and the longest batch size in the HMC sampler is simulated at each iteration as . In our method, , the quantile of with

is the integer part by excess. The motivation for this version of uniformly distributed steps is that, when compared with standard HMC, a random number of steps helps the resulting HMC sampler to overcome the reducibility problem, which is caused by a (near-)periodicity. Hence randomness is beneficial for the mixing behaviour of the generated Markov chain. The underlying motivation for choosing a

quantile of is as follows. On the one hand, should be large enough. Since the number of steps at each iteration is distributed uniformly over , the expected number of steps integrator is just . As a result, we should choose to be something like the quantile rather than the mean or the median of . On the other hand, in order to prevent the choice of extremely large ’s, which will considerably waste computation efforts without meaningful improvement of the HMC sampler, we set as a quantile of , instead of the maximum of .

Inspired by the same appeal in adopting random steps, the second sampler, called eHMCu, runs the leapfrog integrator times in an HMC sampler for the -th iteration, where , . algocf[!t]    

4 Numerical Experiment

This section produces empirical comparisons of the performance of the NUTS and eHMC samplers, evaluated in terms of effective sample size (ESS) over several models with increasing complexity. We recall that the standard effective sample size associated with a Markov chain is defined as

where is the sample size and is the auto-correlation of lag

of the Markov chain. Informally, the ESS of an MCMC outputs is understood as the equivalent number of independent simulations from the target distribution, where equivalent means with equivalent variance. The following examples resort to the

ess function of the R package mcmcse (Flegal et al., 2016)

to produce estimates of the ESS for each component of the parameter. Hence, the more efficient a sampler is in its approximation to

i.i.d. sampling from the target, the higher the associated ESS should be. In order to differentiate between the computing requirements of the compared samplers, it seems fairer to evaluate samples generated under the same computation budget. As a result, the ESS is calibrated by the evaluation cost of the logarithm of target density function and of its gradient. This choice is based on the assumption that the largest portion of the computation budget is dedicated to compute both and . The evaluations below report the minimum of ESS per evaluation over all components, which identifies the least effective part of the sampler.

Furthermore, considering that the ESS criterion only reflects on the marginal efficiency of a sampler, we report in addition the now traditional expected squared jump distance per evaluation of and . While depending on the measure of the jump, this quantity appears as a global mixing assessment that measures another aspect of the efficiency of a sampler. We recall that, given a set of generated samples , its estimated ESJD is defined as

In each of the following experiments, NUTS is implemented in Rstan (Stan Development Team, 2018) and based on 25,000 iterations, with the first 5,000 iterations used to tune the step size for each targeted accept probability and the last 20,000 iterations used as sample draws. For each tuned , we run Algorithm LABEL:alg:L-emp-dist over 2,000 iterations to construct the set , which is then exploited by eHMC. Each experiment is repeated 40 times. Besides, In our experiments, the mass matrix, , is set to the identity, for both NUTS and eHMC sampler.

4.1 Multivariate Normal Distribution

This toy example sets the target as a multivariate Normal distribution with zero mean and covariance matrix ,

The dimension of the vector is chosen to be

, with

The 15 targeted values for the ideal acceptance probability, , in the primal-dual averaging step are spaced evenly between and . Figure 1 displays the minimum ESS across the parameter components and ESJD per evaluation for each sampler. As is clear from the plots, the gain brought by using eHMC and its variants is important on most of the range of ’s, without ever exhibiting a significant loss of efficiency.

Figure 1: Comparison amongst NUTS, eHMC, eHMCq and eHMCu samplers for a multivariate Normal distribution: the x-axis corresponds to the targeted accept probability, , associated with the primal-dual averaging step, and the y-axis in the two plots are the minimum of ESS across the parameter coordinates and ESJD per evaluation of and respectively. Each black dot shows the performance for a single experiment and the red curve is the median of the 40 dots’ y-values for a particular .

4.2 Bayesian Logistic Posterior

This example takes its target as the posterior distribution of a Bayesian logistic regression under a flat prior

,

The inference is based on a German credit dataset obtained from UCI repository (Frank and Asuncion, 2010), which contains observations and covariates. As described by Hoffman and Gelman (2014), all covariates are normalized towards zero mean and unit variance. Once more, the targeted values of the acceptance probability in primal-dual averaging algorithm are evenly spaced between 0.25 and 0.95. Figure 2 displays the minimum of ESS across all dimensions and ESJD per evaluation when varies. In this high dimensional case, the improvement brought by eHMC and its variants is stronger, with improvement in ESS up to one order of magnitude.

Figure 2: Comparison amongst NUTS, eHMC, eHMCq and eHMCu for a Bayesian logistic model: the x-axis corresponds to the targeted accept probability, , associated with the primal-dual averaging step, and the y-axis in the plots are the minimum of ESS across the parameter coordinates and ESJD per evaluation of and .

4.3 Stochastic Volatility Model

The following example aims at a posterior distribution associated with a stochastic volatility model and artificial observations. The stochastic volatility model is based on the state-space equations:

for

and for

where only the ’s are observed. The prior is chosen to be , , and . In order to avoid overflow problems, we reparametrize the three parameters , and as , and . In this example, also used by Girolami and Calderhead (2011), we generate observations with parameters chosen as , and . The dimension of the target is thus , with latent variables . The associated potential is thus

We split the parameter into two groups: and according to the corresponding scales. Figure 3 compares the minimum ESS per evaluation of and of each parameter group and Figure 4 shows the ESJD on and per evaluation of and for each sampler over 9 evenly spaced accept probability, , between and , with repetitions for each . The gain produced by eHMC and its variants is between two- and three-fold.

Figure 3: Comparison amongst NUTS, eHMC, eHMCq and eHMCu samplers for the stochastic volatility model: the x-axis is the targeted accept probability fed into primal-dual averaging, the y-axis is ESS of per evaluation of and for each parameter group.
Figure 4: Comparison amongst NUTS, eHMC, eHMCq and eHMCu for the stochastic volatility model: the x-axis is the targeted accept probability fed into primal-dual averaging, the y-axis are ESJD of and ESJD of per evaluation of and respectively. Each black dot shows the performance for a single experiment and the red curve is the median of the 40 dots’ y-values for a particular

4.4 Logistic Item Response Theory Model

This example considers a hierarchical item response theory (IRT) model (Hambleton et al., 1991), written as

with

where is the response for person to item , and are called the discrimination and difficulty parameters, respectively, and

is called the ability parameter. The priors for the hyperparameters are

, , , and under the constraints , , and . These constraints are handled in HMC by looking at the logarithm transforms with the appropriate Jacobian adjustment. We use a benchmark dataset from STAN repository222https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks/irt_2pl, in which . We examine 10 evenly spaced values between and as the targeted accept probability, for which a primal-dual averaging algorithm returns the associated . As shown in Table 1, the ESS is two to three times larger for eHMC, depending on the quantity of interest. Figure 5, 6 and 7 show the detailed comparison in terms of ESS and ESJD for those ’s.

Sampler ESS () per evaluation (mean sd)
NUTS
eHMC
eHMCq
eHMCu
Table 1: Comparison between NUTS and eHMC for a two parameter logistic item response theory model: each entry corresponds to the minimum ESS per evaluation of and over a group of parameters, maximised over the ten possible values of .
Figure 5: Comparison amongst NUTS, eHMC, eHMCq and eHMCu for the IRT model: the x-axis is the targeted accept probability fed into primal-dual averaging, the y-axis is the minimum ESS across the parameter coordinates per evaluation of and for each parameter group.
Figure 6: Comparison amongst NUTS, eHMC, eHMCq and eHMCu for the IRT model: the x-axis is the targeted accept probability fed into primal-dual averaging, the y-axis is ESS of each hyperparameter per evaluation of and .
Figure 7: Comparison amongst NUTS, eHMC, eHMCq and eHMCu for the IRT model: the x-axis is the targeted accept probability fed into primal-dual averaging, the y-axis is ESJD of each parameter group per evaluation of and .

4.5 Susceptible-Infected-Recovered Model

Susceptible-infected-recovered (SIR) models are used to represent disease transmission, for epidemics like cholera, within a population. The model in this example is

The dynamic of is

where represent the number of susceptible, infected, and recovered persons, respectively, within the community, and represents the concentration of the virus in the water reservoir used by this population (Grad et al., 2012). In this model, the size of the population is assumed to be constant, which means that there is no death because of the disease or other reasons and no birth. The parameter represents the size of the population that becomes infected from being susceptible during the time interval . This example uses observations over the time steps , obtained from STAN repository333https://github.com/stan-dev/stat_comp_benchmarks/tree/master/benchmarks/sir and the priors over the parameters are , , and . Towards bypassing the constraints , , and , we once more use a reparametrisation into logarithms. Ten targeted accept probabilities are used for primal-dual tuning of the step size, evenly spaced from to . Figure 8 shows the detailed comparison amongst NUTS, eHMC, eHMCq and eHMCu in terms of ESS and ESJD per evaluation of and . Table 2 reports the best performance of each sampler across the 10 values of . These experiments show once more that eHMC and its variants, when based upon , exhibit a higher efficiency than NUTS in terms of ESS and ESJD.

Sampler ESS () per evaluation (mean sd)
NUTS
eHMC
eHMCq
eHMCu
Table 2: Comparison between NUTS and eHMC for a SIR model: each entry corresponds to the minimum ESS per evaluation of and over a group of parameters, maximised over the ten possible values of .
Figure 8: Comparison amongst NUTS, eHMC, eHMCq and eHMCu for the SIR model: the x-axis is the targeted accept probability fed into primal-dual averaging, the y-axis in the first four rows is ESS of each parameter per evaluation of and and the y-axis in the lowermost plots is ESJD of per evaluation of and .

5 Conclusion and Prospective

The eHMC algorithm is an alternative to NUTS that relies on an empirical version, , of the distribution of the longest batch sizes in the leapfrog integrator, for a given step size . Its motivation is to reduce wasted simulations created by NUTS, due to its stationarity constraint and the induced backtracking. Even though the length of the path is de facto reduced, we have shown through a collection of Monte Carlo experiments, the proposed eHMC leads to an improved efficiency over the standard NUTS, when evaluated by ESS or by effective jumping distance per potential evaluation. While we do not aim here at optimising the length of the HMC step in terms of one of these criteria, the automated random choice of at each iteration shows clear promises.

The comparisons produced in this paper seem to contradict those of Hoffman and Gelman (2014). There are two reasons for this opposition. First, Hoffman and Gelman (2014) compared NUTS with the standard HMC sampler, which the number of steps is fixed at each and every iteration, instead of an HMC with a random number of steps. It has already been observed that resorting to a random integration time in an HMC sampler should improve performances. Second, the number of evaluations of both and , rather than solely as in Hoffman and Gelman (2014), is adopted as an surrogate for the computation cost.

The improvement of eHMC over NUTS proceeds from two features. First, eHMC chooses a globally appropriate value of for an HMC sampler. Second, NUTS wastes computation effort to some extend. In fact, even though NUTS only generates longest batches to prevent from retracing at each iteration, it is impossible to make sure that the current position is close to an end-point of the generated leapfrog path. Furthermore, the NUTS proposal is uniformly distributed over the candidate set, instead of being one of the end-points. Both limitations restrict the distance between the current position and the proposal.

Obviously, contains valuable information about the number of steps in the leapfrog integrator for HMC samplers and the related distribution deserves further investigation. Besides, we can also insert Algorithm LABEL:alg:L-emp-dist into a primal-dual averaging scheme to learn a rough version of and hence save computation cost.

However, we stress that the usage of need not be restricted to one of the special forms of eHMC, like the eHMCq and eHMCu samplers. For instance, instead of setting in eHMCq, choosing from , for , , towards maximizing ESJD is also an interesting direction to investigate, even though it requires more computation effort during warmup stage.

References

  • Betancourt (2017) Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXiv preprint arXiv:1701.02434.
  • Betancourt et al. (2014) Betancourt, M., Byrne, S., and Girolami, M. (2014). Optimizing the integrator step size for Hamiltonian Monte Carlo. arXiv preprint arXiv:1411.6669.
  • Duane et al. (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222.
  • Flegal et al. (2016) Flegal, J. M., Hughes, J., and Vats, D. (2016).

    mcmcse: Monte Carlo Standard Errors for MCMC

    .
    Riverside, CA and Minneapolis, MN. R package version 1.2-1.
  • Frank and Asuncion (2010) Frank, A. and Asuncion, A. (2010).

    UCI Machine Learning Repository. Irvine, CA: University of California.

    http://archive. ics. uci. edu/ml, School of Information and Computer Science, 213:2–2.
  • Girolami and Calderhead (2011) Girolami, M. and Calderhead, B. (2011). Riemann Manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214.
  • Grad et al. (2012) Grad, Y. H., Miller, J. C., and Lipsitch, M. (2012). Cholera modeling: challenges to quantitative analysis and predicting the impact of interventions. Epidemiology (Cambridge, Mass.), 23(4):523.
  • Hambleton et al. (1991) Hambleton, R. K., Swaminathan, H., and Rogers, H. J. (1991). Fundamentals of Item Response Theory. Newbury Park, CA.
  • Hoffman and Gelman (2014) Hoffman, M. D. and Gelman, A. (2014). The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593–1623.
  • Leimkuhler and Reich (2004) Leimkuhler, B. and Reich, S. (2004). Simulating Hamiltonian Dynamics, volume 14. Cambridge University Press.
  • Neal et al. (2011) Neal, R. M. et al. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2(11):2.
  • Nesterov (2009) Nesterov, Y. (2009). Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):221–259.
  • Stan Development Team (2018) Stan Development Team (2018). RStan: the R interface to Stan. R package version 2.17.3. http://mc-stan.org.
  • Tierney (1994) Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). The Annals of Statistics, 22:1701–1762.
  • Wang et al. (2013) Wang, Z., Mohamed, S., and De Freitas, N. (2013). Adaptive Hamiltonian and Riemann Manifold Monte Carlo Samplers. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML’13, pages III–1462–III–1470. JMLR.org.