Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation

04/27/2020 ∙ by Charles C. Margossian, et al. ∙ Columbia University 0

Gaussian latent variable models are a key class of Bayesian hierarchical models with applications in many fields. Performing Bayesian inference on such models can be challenging as Markov chain Monte Carlo algorithms struggle with the geometry of the resulting posterior distribution and can be prohibitively slow. An alternative is to use a Laplace approximation to marginalize out the latent Gaussian variables and then integrate out the remaining hyperparameters using dynamic Hamiltonian Monte Carlo, a gradient-based Markov chain Monte Carlo sampler. To implement this scheme efficiently, we derive a novel adjoint method that propagates the minimal information needed to construct the gradient of the approximate marginal likelihood. This strategy yields a scalable method that is orders of magnitude faster than state of the art techniques when the hyperparameters are high dimensional. We prototype the method in the probabilistic programming framework Stan and test the utility of the embedded Laplace approximation on several models, including one where the dimension of the hyperparameter is ∼6,000. Depending on the cases, the benefits are either a dramatic speed-up, or an alleviation of the geometric pathologies that frustrate Hamiltonian Monte Carlo.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Latent Gaussian models observe the following hierarchical structure:

Typically, single observations are independently distributed and only depend on a linear combination of the latent variables, that is

, for some appropriately defined vectors

. This general framework finds a broad array of applications: Gaussian processes, spatial models, and multilevel regression models to name a few examples. We denote the latent Gaussian variable and the hyperparameter, although we note that in general may refer to any latent variable other than .

We derive a method to perform Bayesian inference on latent Gaussian models, which scales when is high dimensional. The idea is to use a gradient-based Markov chain Monte Carlo (MCMC) sampler, coupled with a Laplace approximation to marginalize out . The key to successfully implementing this scheme is a novel adjoint method that efficiently differentiates the approximate marginal likelihood. In the case of a classic Gaussian process (Section 4), where , the computation required to evaluate and differentiate the marginal is on par with the GPstuff package Vanhatalo:2013, which uses the popular algorithm by Rasmussen:2006. The adjoint method is however orders of magnitude faster when is high dimensional. Figure 1 shows the superior scalability of the adjoint method on simulated data from a sparse kernel interaction model Agrawal:2019. We lay out the details of the algorithms and the experiment in Section 3.

Figure 1: Wall time to differentiate the marginal density using the adjoint method (Algorithm 2) and, as a benchmark, the method by Rasmussen:2006 (Algorithm 1).

1.1 Existing methods

Bayesian computation is, broadly speaking, split between two approaches: (i) MCMC methods that approximately sample the posterior, and (ii) approximation methods in which one finds a tractable distribution that approximates the posterior (e.g. variational inference, expectation propagation, and asymptotic approximations). The same holds for latent Gaussian models, where we can consider Hamiltonian Monte Carlo (HMC) sampling Neal:2012; Betancourt:2018 or marginalizing out the latent Gaussian variables with a Laplace approximation before deterministically integrating the hyperparameters Tierney:1986; Rue:2009.

Hamiltonian Monte Carlo sampling.

When using MCMC sampling, the target distribution is

and the Markov chain explores the joint parameter space of and .

HMC is a class of MCMC algorithms that powers many modern probabilistic programming languages, including Stan Carpenter:2017, PyMC3 Salvatier:2016

, and TensorFlow Probability 

Dillon+etal:2017:tensorflow_distributions. Its success is both empirically and theoretically motivated (e.g. Betancourt:2017) and, amongst other things, lies in its ability to probe the geometry of the target distribution via the gradient. The algorithm is widely accessible through a combination of its dynamic variants Hoffman:2014; Betancourt:2018, which spare the users the cumbersome task of manually setting the algorithm’s tuning parameters, and automatic differentiation, which alleviates the burden of calculating gradients by hand (e.g. Margossian:2019; Baydin:2018; Griewank:2008). There are known challenges when applying HMC to hierarchical models, because of the posterior distribution’s problematic geometry Betancourt:2013. In the case of latent Gaussian models, this geometric grief is often caused by the latent Gaussian variable, , and its interaction with .

Marginalization using a Laplace approximation.

The embedded Laplace approximation is also a very successful algorithm, and a key component of the R packages INLA (integrated nested Laplace integration, Rue:2009; Rue:2017) and TMB (template model builder, Kristensen:2016), and the GPstuff package Vanhatalo:2013. The idea is to marginalize out and then use standard inference techniques on .

We perform the Laplace approximation

where matches the mode and the curvature of . Then, the marginal posterior distribution is approximated as follows:

Once we perform inference on , we can recover using the conditional distribution and effectively marginalizing out. For certain models, this yields much faster inference than MCMC, while retaining comparable accuracy Rue:2009. Furthermore the Laplace approximation as a marginalization scheme enjoys very good theoretical properties Tierney:1986.

In the R package INLA, approximate inference is performed on , by characterizing

around its presumed mode. This works well for many cases but presents two limitations: the posterior must be well characterized in the neighborhood of the estimated mode and it must be low dimensional,

“2–5, not more than 20” Rue:2017. In one of the examples we study, the posterior of is both high dimensional (6000) and multimodal.

Hybrid methods.

Naturally we can use a more flexible inference method on such as a standard MCMC, as discussed by Gomez:2018, and HMC as proposed in GPstuff and TMB, the latter through its extension TMBStan and AdNuts (automatic differentiation with a No-U-Turn Sampler Monnahan:2018). The target distribution of the HMC sampler is now .

To use HMC, we require the gradient of with respect to . Much care must be taken to ensure an efficient computation of this gradient. TMB and GPstuff exemplify two approaches to differentiate the approximate marginal density. The first uses automatic differentiation and the second adapts the algorithms in Rasmussen:2006. One of the main bottlenecks is differentiating the estimated mode, . In theory, it is straightforward to apply automatic differentiation, by brute-force propagating derivatives through , i.e. sequentially differentiating the iterations of a numerical optimizer. But this approach, termed the direct method, is prohibitively expensive. A much faster alternative is to use the implicit function theorem (e.g. Bell:2008; Margossian:2019). Given any accurate numerical solver, we can always use the implicit function theorem to get derivatives, as notably done in the Stan Math Library Carpenter:2015 and in TMB’s inverse subset algorithm Kristensen:2016. One side effect is that the numerical optimizer is treated as a black box. By contrast, Rasmussen:2006 define a bespoke Newton method to compute , meaning we can store relevant variables from the final Newton step when computing derivatives. In our experience, this leads to important computational savings. But overall this method is much less flexible, working well only when

is low dimensional and requiring the user to pass the tensor of derivatives,

.

2 Aim and results of the paper

We improve the computation of HMC with an embedded Laplace approximation. Our implementation accommodates any covariance matrix , without requiring the user to specify , efficiently differentiates , even when is high dimensional, and deploys dynamic HMC to perform inference on . We introduce a novel adjoint method to differentiate , build the algorithm in C++, and add it to the Stan language. Our approach combines the Newton solver of Rasmussen:2006 with a non-trivial application of automatic differentiation.

Equipped with this implementation, we test dynamic HMC with an embedded Laplace approximation on a range of models, including ones with a high dimensional and multimodal hyperparameter. We do so by benchmarking our implementation against Stan’s dynamic HMC, which runs MCMC on both the hyperparameter and the latent Gaussian variable. For the rest of the paper, we call this standard use of dynamic HMC, full HMC. We refer to marginalizing out and using dynamic HMC on , as the embedded Laplace approximation.

Our computer experiments identify cases where the benefits, as tested with our implementation, are substantial. In the case of a classic Gaussian process, with and

, we observe an important computational speed up. We next study a general linear regression with a sparsity inducing prior; this time

and . Full HMC struggles with the posterior’s geometry, as indicated by divergent transitions, and requires a model reparameterization and extensive tuning of the sampler. On the other hand, the embedded Laplace approximation evades many of the geometric problems and solves the approximate problem efficiently. We observe similar results for a sparse kernel interaction model, which looks at second-order interactions between covariates Agrawal:2019. Our results stand in contrast to the experiments presented in Monnahan:2018, who used a different method to automatically differentiate the Laplace approximation and reported only a minor speed up. We do however note that the authors investigated different models than the ones we study here.

In all the studied cases, the likelihood is log-concave. Combined with a Gaussian prior, log-concavity guarantees that is unimodal. Detailed analysis on the error introduced by the Laplace approximation for log-concave likelihoods can be found in references (e.g. Kuss:2005; Vanhatalo+Pietilainen+Vehtari:2010; Cseke:2011; Vehtari+etal:2016:loo_glvm) and are consistent with the results from our computer experiments.

3 Implementation for probabilistic programming

In order to run HMC, we need a function that returns the approximate log density of the marginal likelihood, , and its gradient with respect to , . The user specifies the observations, , and a function to generate the covariance , based on input covariates and the hyperparameters . In the current prototype, the user picks the likelihood, , from a set of options111 More likelihoods can be implemented through a C++ class that specifies the first three derivatives of the log-likelihood.

: for example, a Bernoulli distribution with a logit link.

Standard implementations of the Laplace approximation use the algorithms in Rasmussen:2006 to compute (i) the mode and , using a Newton solver; (ii) the gradient (Algorithm 1), and (iii) simulations from . The major contribution of this paper is to construct a new differentiation algorithm, i.e. item (ii).

3.1 Using automatic differentiation in the algorithm of Rasmussen:2006

The main difficulty with Algorithm 1 from Rasmussen:2006 is the requirement for at line 8. For classic problems, where is, for instance, a squared exponential kernel, the derivatives are available analytically. This is not the case in general and, in line with the paradigm of probabilistic programming, we want a method that does not require the user to specify the tensor of derivatives, .

input: , ,
2:saved input from the Newton solver: , , , ,
4:
6:
for
8:     
     
10:     
     
12:     
end for
14:return
Algorithm 1 Gradient of the approximate marginal density, , with respect to the hyperparameters , adapted from algorithm 5.1 by Rasmussen:2006. We store and reuse terms computed during the final Newton step, algorithm 3.1 in Rasmussen:2006.

Fortunately, automatic differentiation allows us to numerically evaluate based on computer code to evaluate . To do this, we introduce the map

where is the dimension of and that of . To obtain the full tensor of derivatives, we require either forward mode sweeps or reverse mode sweeps. Given the scaling, we favor forward mode and this works well when is small. However, once becomes large, this approach is spectacularly inefficient.

3.2 Adjoint method to differentiate the approximate log marginal density

To evaluate the gradient of a composite map, it is actually not necessary to compute the full Jacobian matrix of intermediate operations. This is an important, if often overlooked, property of automatic differentiation and the driving principle behind adjoint methods (e.g. Errico:1997). This idea motivates an algorithm that does not explicitly construct , a calculation that is both expensive and superfluous. Indeed, it suffices to evaluate for the correct cotangent vector, , an operation we can do in a single reverse mode sweep of automatic differentiation.

Theorem 1

Let be the approximate log marginal density in the context of a latent Gaussian model. Let be defined as in the Newton solver by Rasmussen:2006, and let and be defined as in Algorithm 1. Then

where the gradient is with respect to and

The proof follows from Algorithm 1 and noting that all the operations in are linear. We provide the details in the Supplementary Material. Armed with this result, we build Algorithm 2, a method that combines the insights of Rasmussen:2006 with the principles of adjoint methods.

input: , ,
2:Do lines 2 - 6 of Algorithm 1.
Initiate an expression graph for automatic differentiation with .
4:
6:Do a reverse sweep over , with as the initial cotangent to obtain .
return: .
Algorithm 2 Gradient of the approximate marginal log density, , with respect to the hyperparameters, , using reverse mode automatic differentiation and theorem 1.

Figure 1 shows the time required for one evaluation and differentiation of for the sparse kernel interaction model developed by Agrawal:2019 on simulated data. The covariance structure of this model is nontrivial and analytical derivatives are not easily available. We simulate a range of data sets for varying dimensions, , of . For low dimensions, the difference is small; however, for , Algorithm 2 is more than 100 times faster than Algorithm 1, requiring 0.009 s, rather 1.47 s.

4 Gaussian process with a Poisson likelihood

We fit the disease map of Finland by Vanhatalo+Pietilainen+Vehtari:2010 which models the mortality count across the country. The data is aggregated in counties. We use 100 counties, which allows us to fit the model quickly both with full HMC and HMC using an embedded Laplace approximation. For the region, we have a 2-dimensional coordinate , the counts of deaths , and the standardized expected number of deaths, . The full latent Gaussian model is

where is a squared exponential kernel,

is the marginal standard deviation and

the characteristic length scale. Hence .

Fitting this model with MCMC requires running the Markov chains over , , and . Because the data is sparse — one observation per group — the posterior has a funnel shape which can lead to biased MCMC estimates Neal:2003; Betancourt:2013. A useful diagnostic for identifying posterior shapes that challenge the HMC sampler is divergent transitions, which occur when there is significant numerical error in the computation of the Markov chain trajectory Betancourt:2018.

To remedy these issues, we reparameterize the model and adjust the target acceptance rate, . controls the precision of HMC, with the usual trade-off between accuracy and speed. For well behaved problems, the optimal value is 0.8 Betancourt:2015 but posteriors with highly varying curvature require a higher value. Moreover, multiple attempts at fitting the model must be done before we correctly tune the sampler and remove all the divergent transitions. See the Supplementary Material for more details.

An immediate benefit of the embedded Laplace approximation is that we marginalize out and only run HMC on and , a two-dimensional and typically well behaved parameter space. In the case of the disease map, we do not need to reparameterize the model, nor adjust .

We fit the models with both methods, using 4 chains, each with 500 warmup and 500 sampling iterations. A look at the marginal distributions of , , and the first two elements of suggests the posterior samples generated by full HMC and the embedded Laplace approximation are in close agreement (Figure 2). With a Poisson likelihood, the bias introduced by the Laplace approximation is small, as shown by Vanhatalo+Pietilainen+Vehtari:2010. We benchmark the Monte Carlo estimates of both methods against results from running 18,000 MCMC iterations. The embedded Laplace approximations yields comparable precision, when estimating expectation values, and is an order of magnitude faster (Figure 2). In addition, we do not need to tune the algorithm and the MCMC warmup time is much shorter (10 seconds against 200 seconds for full HMC).

Figure 2: (Up) Posterior samples obtained with full HMC and the embedded Laplace approximation when fitting the disease map. (Down) Error when estimating the expectation value against wall time.

5 General linear regression model with a regularized horseshoe prior

Figure 3: Expectation value for the probability of developing prostate cancer, as estimated by full HMC and HMC using an embedded Laplace approximation.

Consider a regression model with observations and covariates. In the “” regime, we typically need additional structure, such as sparsity, for accurate inference. The horseshoe prior Carvalho+Polson+Scott:2010:HS is a useful prior when it is assumed that only a small portion of the regression coefficients are non-zero. Here we use the regularized horseshoe prior by Piironen:2017. The horseshoe prior is parameterized by a global scale term, the scalar , and local scale terms for each covariate, , . Consequently the number of hyperparameters is .

To use the embedded Laplace approximation, we recast the regularized linear regression as a latent Gaussian model. The benefit of the approximation is not a significant speedup, rather an improved posterior geometry, due to marginalizing out. This means we do not need to reparameterize the model, nor fine tune the sampler. To see this, we examine the microarray classification data set on prostate cancer used by Piironen:2017 and fit a regression model with a Bernoulli likelihood and a logit link. Here, and .

We use 1000 iterations to warm up the sampler and 12,000 sampling iterations. Tail quantiles, such as the

quantile, allow us to identify parameters which have a small local shrinkage and thence indicate relevant covariates. The large sample size is used to reduce the Monte Carlo error in our estimates of these extreme quantiles.

Fitting this model with full HMC requires a fair amount of work: the model must be reparameterized and the sampler carefully tuned, after multiple attempts at a fit. We use a non-centered parameterization, set (after attempting and ) and do some additional adjustments. Even then we obtain 13 divergent transitions over 12,000 sampling iterations. The Supplementary Material describes the tuning process in all its thorny details. By contrast, running the embedded Laplace approximation with Stan’s default tuning parameters produces 0 divergent transitions. Hence the approximate problem is efficiently solved by dynamic HMC.

Table 1 shows the covariates with the highest quantiles, which are softly selected by full HMC and the embedded Laplace approximation. Figure 3 compares the expected probability of developing cancer, as estimated by the two algorithms. Figure 4 compares the posterior samples and the error when estimating various quantities of interest, namely (i) the expectation value of the global shrinkage, , and the slab parameter, ; and (ii) the quantile of two local shrinkage parameters. As a benchmark we use estimates obtained from 98,000 MCMC iterations.

(full) HMC 2586 1816 4960 4238 4843 3381
HMC + Laplace 2586 1816 4960 4647 4238 3381
Table 1: Top six covariate indices, , with the highest quantiles of for the general linear model with a regularized horseshoe prior.
Figure 4: (Up) Posterior samples obtained with full HMC and HMC using an embedded Laplace approximation when fitting a general linear regression with a regularized horseshoe prior. (Down) Error when estimating various quantities of interest against wall time. stands for “expectation” and , “90 quantile”.

The Laplace approximation yields slightly less extreme probabilities of developing cancer than the corresponding full model. This behavior is expected for latent Gaussian models with a Bernoulli observation model, and has been studied in the cases of Gaussian processes and Gaussian random Markov fields (e.g. Kuss:2005; Cseke:2011; Vehtari+etal:2016:loo_glvm). While introducing a bias, the embedded Laplace approximation yields accuracy comparable to full HMC when evaluating quantities of interest.

6 Sparse kernel interaction model

A natural extension of the general linear model is to include interaction terms. To achieve better computational scalability, we can use the kernel interaction trick by Agrawal:2019 and build a sparse kernel interaction model (SKIM), which also uses the regularized horseshoe prior by Piironen:2017. The model is an explicit latent Gaussian model and uses a non-trivial covariance matrix. The full details of the model are given in the Supplementary Material.

When fitting the SKIM to the prostate cancer data, we encounter similar challenges as in the previous section: 150 divergent transitions with full HMC when using Stan’s default tuning parameters. The behavior when adding the embedded Laplace approximation is much better, although there are still 3 divergent transitions,222 We do our preliminary runs using only 4000 sampling iterations. The above number are estimated for 12000 sampling iterations. The same holds for the estimated run times. which indicates that this problem remains quite difficult even after the approximate marginalization. We also find large differences in running time. The embedded Laplace approximation runs for 10 hours, while full HMC takes 20 hours with and 50 hours with , making it difficult to tune the sampler and run our computer experiment; and more generally, quite impractical for modelers.

Figure 5: (Up) Samples obtained with full HMC and HMC using an embedded Laplace approximation when fitting the SKIM. (Down) Error when estimating various quantities of interest against wall time. stands for “expectation” and , “90 quantile”.
(full) HMC 2586 2660 2679 2581 2620 2651
HMC + Laplace 2586 2679 2660 2581 2620 2548
Table 2: Top six covariate indices, , with the highest quantiles of for the SKIM.

For computational convenience, we fit the SKIM using only 200 covariates, indexed 2500 - 2700 to encompass the 2586 covariate which we found to be strongly explanatory. This allows us to easily tune full HMC without altering the takeaways of the experiment. Note that the data here used is different from the data we used in the previous section (since we only examine a subset of the covariates) and the marginal posteriors should therefore not be compared directly.

As above, we generate 12000 posterior draws for each method. For full HMC we obtain 36 divergent transitions with and 0 with . The embedded Laplace approximation produces 0 divergences with . Table 2 shows the covariates which are softly selected; Figure 5 compares the posterior samples of both methods and the error over time, benchmarked against estimates from 98,000 MCMC iterations, for certain quantities of interest. We obtain comparable estimates and see overlap between the selected covariates, like we did in the previous section. The approximation however introduces a bias, which becomes more evident over longer runtimes.

7 Discussion

Equipped with a scalable and flexible differentiation algorithm, we expand the regime of models to which we can apply the embedded Laplace approximation. HMC allows us to perform inference even when is high dimensional and multimodal, provided the energy barrier is not too strong. In the case where , the approximation also yields a dramatic speedup. When , marginalizing out can still improve the geometry of the posterior, saving the user time otherwise spent tuning the sampling algorithm. However, when the posterior is well-behaved, the approximation may not provide any benefit.

Our next step is to further develop the prototype for Stan. We are also aiming to incorporate features that allow for a high performance implementation, as seen in the packages INLA, TMB, and GPstuff. Examples includes support for sparse matrices required to fit latent Markov random fields, parallelization and GPU support. We also want to improve the flexibility of the method by allowing users to specify their own likelihood. In this respect, the implementation in TMB is exemplary. It is in principle possible to apply automatic differentiation to do higher-order automatic differentiation and most libraries, including Stan, support this; but, along with feasibility, there is a question of efficiency and practicality (e.g. Betancourt:2018b). The added flexibility also burdens us with more robustly diagnosing errors induced by the approximation. There is extensive literature on log-concave likelihoods but less so for general likelihoods. Future work will investigate diagnostics such as importance sampling Vehtari+etal:2019:psis, leave-one-out cross-validation Vehtari+etal:2016:loo_glvm, and simulation based calibration Talts:2018.

Acknowledgment

We thank Michael Betancourt, Steve Bronder, Alejandro Catalina, Rok C̆es̆novar, Sam Power, and Sean Talts for helpful discussions.

CM thanks the Office of Naval Research, the National Science Foundation, the Institute for Education Sciences, and the Sloan Foundation. CM and AV thank the Academy of Finland (grants 298742 and 313122). DS thanks the Canada Research Chairs program and the Natural Sciences and Engineering Research Council of Canada. RA’s research was supported in part by a grant from DARPA.

We acknowledge computing resources from Columbia University’s Shared Research Computing Facility project, which is supported by NIH Research Facility Improvement Grant 1G20RR030893-01, and associated funds from the New York State Empire State Development, Division of Science Technology and Innovation (NYSTAR) Contract C090171, both awarded April 15, 2010.

We also acknowledge the computational resources provided by the Aalto Science-IT project.

References

Appendix (Supplementary Material)

We review the Newton solver proposed by Rasmussen:2006 and prove theorem 1, the main result required to do build an adjoint method for the embedded Laplace approximation. We next present our prototype code and provide details for the models used in our computer experiments.

Appendix A Newton solver for the embedded Laplace approximation

input: , ,
2:           (initialization)
repeat
4:     
     
6:     
     
8:     
unitl convergence
10:
return: ,
Algorithm 3 Newton solver for the embedded Laplace approximation [Rasmussen:2006, chapter 3]

Algorithm 3 is a transcription of the Newton method by Rasmussen:2006 using our notation. As a convergence criterion, we use the change in the objective function between two iterations

for a specified . This is consistent with the approach used in GPStuff Vanhatalo:2013. We store the following variables generated during the final Newton step to use them again when computing the gradient: , , , , and . This avoids redundant computation and spares us an expensive Cholesky decomposition.

Appendix B Building the adjoint method

To compute the gradient of the approximate log marginal with respect to ,

, we exploit several important principles of automatic differentiation. While widely used in statistics and machine learning, these principles remain arcane to many practitioners and deserve a brief review. We will then construct the adjoint method (theorem 1 and algorithm 2) as a correction to algorithm 1.

b.1 Automatic differentiation

Given a composite map

the chain rule teaches us that the corresponding Jacobian matrix observes a similar decomposition:

Based on computer code to calculate , a sweep of forward mode automatic differentiation numerically evaluates the action of the Jacobian matrix on the initial tangent , or directional derivative . Extrapolating from the chain rule

where the ’s verify the recursion relationship

If our computation follows the steps outlined above we never need to explicitly compute the full Jacobian matrix, , of an intermediate function, ; rather we only calculate a sequence of Jacobian-tangent products. Similarly a reverse mode sweep evaluates the action of a transposed cotangent on a Jacobian matrix , by computing a sequence cotangent-Jacobian products.

Hence, in the case of the embedded Laplace approximation, where

is an intermediate function, we do not need to explicitly compute but only

for the appropriate cotangent vector. This type of reasoning plays a key role when differentiating functionals of implicit functions – for example, probability densities that depend on solutions to ordinary differential equations – and leads to so-called

adjoint methods [e.g. Errico:1997].

b.2 Derivation of the adjoint method

In this section we provide a proof of theorem 1. As a starting point, assume algorithm 1 is valid. The proof can be found in Rasmussen:2006. The key observation is that all operations performed on

are linear. Algorithm 1 produces a map

and constructs the gradient one element at a time. By linearity,

Thus an alternative approach to compute the gradient is to calculate the scalar and then use a single reverse mode sweep of automatic differentiation, noting that is an analytical function. This produces Algorithm 4.

input: , ,
2:Do lines 2 - 6 of Algorithm 2.
Initiate an expression tree for automatic differentiation with .
4:
6:Do a reverse-sweep over to obtain .
return: .
Algorithm 4 Gradient of the approximate marginal log density, , with respect to the hyperparameters, , using reverse mode automatic differentiation

At this point, the most important is done in order to achieve scalability: we no longer explicitly compute and are using a single reverse mode sweep.

Automatic differentiation, for all its relatively cheap cost, still incurs some overhead cost. Hence, where possible, we still want to use analytical results to compute derivatives. In particular, we can analytically work out the cotangent

For the following calculations, we use a lower case, and , to denote the element respectively of the matrices and .

Consider

where, unlike in Algorithm 2, and are now computed using , not . We have

Then

and

Thus

For convenience, denote . We then have

where , but is maintained fixed, meaning we do not propagate derivatives through it. Let and let denote the element of . Then

Thus

where the sum term is the element of . The above expression then becomes

Combining the derivative for and we obtain

as prescribed by Theorem 1. This result is general, in the sense that it applies to any covariance matrix, , and likelihood, . Our preliminary experiments, on the SKIM, found that incorporating the analytical cotangent, , approximately doubles the differentiation speed.

Appendix C Computer code

The code used in this work is open source and detailed in this section.

c.1 Prototype Stan code

The Stan language allows users to specify the joint log density of their model. This is done by incrementing the variable target. We add a suite of functions, which return the approximate log marginal density,

. Hence, the user can specify the log joint distribution by incrementing

target with and the prior . A call to the approximate marginal density may look as follows:

target +=  laplace_marginal_*(y, n, K, phi, x, delta, delta_int, theta0);
  

The * specifies the likelihood, for example Bernoulli or Poisson333 These are the current options in the prototype we used in this article; the immediate next version also specifies the link function.. y and n are sufficient statistics for the latent Gaussian variable, ; K is a function that takes in arguments phi, x, delta, and delta_int and returns the covariance matrix; and theta0 is the initial guess for the Newton solver, which seeks the mode of . Moreover

  • y: a vector containing the sum of counts/successes for each element of

  • n: a vector with the number of observation for each element of

  • K: a function defined in the functions block, with the signature (vector, matrix, real[], int[]) ==> matrix

  • phi: the vector of hyperparameters

  • x: a matrix of data. For Gaussian processes, this is the coordinates, and for the general linear regression, the design matrix.

  • delta: additional real data.

  • delta_int: additional integer data.

  • theta0: a vector of initial guess for the Newton solver.

It is also possible to specify the tolerance of the Newton solver. This structure is consistent with other higher-order functions in Stan, such as the algebraic solver and the ordinary differential equation integrators. It gives users flexibility when specifying , but we recognize it is cumbersome. One item on our to-do list is to use variadic arguments, which remove the constraints on the signature of K, and allows users to pass any combination of arguments to K through laplace_marignal_*.

For each likelihood, we implement a corresponding random number generating function, with a call

  theta =  laplace_marginal_*_rng(y, n, K, phi, x, delta, delta_int, theta0);
  

This generates a random sample from . This function can be used in the generated quantities blocks and is called only once per iteration – in contrast with the target function which is called and differentiated once per integration step of HMC. Moreover the cost of generating is negligible next to the cost evaluating and differentiating multiple times per iteration.

c.2 C++ code

We incorporate the Laplace suite of functions inside the Stan-math library, a C++ library for automatic differentiation [Carpenter:2015]. The library is open source and available on GitHub, https://github.com/stan-dev/math. Our prototype exists on the branch try-laplace_approximation. The code is structured around a main function

    laplace_approximation(likelihood, K_functor, phi, x, delta, delta_int, theta0);
  

with

  • likelihood: a class constructed using y and n, which returns the log density, as well as its first, second, and third order derivatives.

  • K_functor: a functor that computes the covariance matrix,

  • …: the remaining arguments are as previously described.

A user can specify a new likelihood by creating the corresponding class, meaning the C++ code is expandable.

To expose the code to the Stan language, we use Stan’s new OCaml transpiler, stanc3, https://github.com/stan-dev/stanc3 and again the branch try-laplace_approximation.

Important note: the code is prototypical and currently not merged into Stan’s release or development branch.

c.3 Code for the computer experiment

The R code is available on the GitHub public repository, anonymized/laplace_manuscript.

We make use of two new prototype packages: CmdStanR (https://mc-stan.org/cmdstanr/) and posterior (https://github.com/jgabry/posterior).

Appendix D Tuning dynamic Hamiltonian Monte Carlo

In this article, we use the dynamic Hamiltonian Monte Carlo sampler described by Betancourt:2018 and implemented in Stan. This algorithm builds on the No-U Turn Sampler by Hoffman:2014, which adaptively tunes the sampler during a warmup phase. Hence for most problems, the user does not need to worry about tuning parameters. However, the models presented in this article are challenging and the sampler requires careful tuning, if we do not use the embedded Laplace approximation.

The main parameter we tweak is the target acceptance rate, . To run HMC, we need to numerically compute physical trajectories across the parameter space by solving the system of differential equations prescribed by Hamilton’s equations of motion. We do this using a numerical integrator. A small step size, , makes the integrator more precise but generates smaller trajectories, which leads to a less efficient exploration of the parameter space. When we introduce too much numerical error, the proposed trajectory is rejected. Adapt delta, , sets the target acceptance rate of proposed trajectories. During the warmup, the sampler adjusts to meet this target. For well-behaved problems, the optimal value of is 0.8 Betancourt:2015.

It should be noted that the algorithm does not necessarily achieve the target set by during the warmup. One approach to remedy this issue is to extend the warmup phase; specifically the final fast adaptation interval or term buffer [see Hoffman:2014, Stan:2020]. By default, the term buffer runs for 50 iterations (when running a warmup for 1,000 iterations). Still, making the term buffer longer does not guarantee the sampler attains the target . There exist other ways of tuning the algorithm, but at this points, the technical burden on the user is already significant. What is more, probing how well the tuning parameters work usually requires running the model for many iterations.

Appendix E Model details

We review the models used in our computer experiments and point the readers to the relevant references.

e.1 Disease map

The disease map uses a Gaussian process with a squared exponential kernel,

The full latent Gaussian model is

where we put an inverse-Gamma prior on and .

When using full HMC, we construct a Markov chain over the joint parameter space . To avoid Neal’s infamous funnel Neal:2003 and improve the geometry of the posterior distribution, it is possible to use a non-centered parameterization:

The Markov chain now explores the joint space of and the ’s are generated by transforming the ’s. With the embedded Laplace approximation, the Markov chain only explores the joint space .

e.2 Regularized horseshoe prior

The horseshoe prior [Carvalho+Polson+Scott:2010:HS] is a sparsity inducing prior that introduces a global shrinkage parameter, , and a local shrinkage parameter, for each covariate slope, . This prior operates a soft variable selection, effectively favoring or , where is the maximum likelihood estimator. Piironen:2017 add another prior to regularize unshrunk s, , effectively operating a “soft-truncation” of the extreme tails.

e.2.1 Details on the prior

For computational stability, the model is parameterized using , rather than , where

with the slab scale. The hyperparameter is and the prior

The prior on independently applies to each element, .

Following the recommendation by Piironen:2017, we set the variables of the priors as follows. Let be the number of covariates and the number of observations. Additionally, let be the expected number of relevant covariates – note this number does not strictly enforce the number of unregularized s, because the priors have heavy enough tails that we can depart from . For the prostate data, we set . Then

Next we construct the prior on ,

where

e.2.2 Formulations of the data generating process

The data generating process is

or, recasting it as a latent Gaussian model,

For full HMC, we use a non-centered parameterization of the first formulation, much like we did for the disease map. The embedded Laplace approximation requires the second formulation, which comes at the cost of evaluating and differentiationg . In this scenario, the main benefit of the Laplace approximation is not an immediate speed-up but an improved posterior geometry, due to marginalizing (and thus implicitly and ) out. This means we do not need to fine tune the sampler.

e.2.3 Fitting the model with full HMC

This section describes how to tune full dynamic HMC to fit the model at hand. Some of the details may be cumbersome to the reader. But the takeaway is simple: tuning the algorithm is hard and can be a real burden for the modeler.

Using a non-centered parameterization and with Stan’s default parameters, we obtain 150 divergent transitions444 To be precise, we here did a preliminary run using 4000 sampling iterations and obtained 50 divergent transitions (so an expected 150 over 12000 sampling iterations). . We increase the target acceptance rate to but find the sampler now produces 186 divergent transitions. A closer inspection reveals the divergences all come from a single chain, which also has a larger adapted step size, . The problematic chain also fails to achieve the target acceptance rate. These results are shown in Table 3. From this, it seems increasing yet again may not provide any benefits. Instead we increase the term buffer from 50 iterations to 350 iterations. With this setup, we however obtain divergent transitions across all chains.

Chain Step size Acceptance rate Divergences
1 0.0065 0.99 0
2 0.0084 0.90 186
3 0.0052 0.99 0
4 0.0061 0.99 0
Table 3: Adapted tuning parameters across 4 Markov chains with .

This outcome indicates the chains are relatively unstable and emphasizes how difficult it is, for this type of model and data, to come up with the right tuning parameters. With and the extended term buffer we observe 13 divergent transitions. It is possible this result is the product of luck, rather than better tuning parameters. To be clear, we do not claim we found the optimal model parameterization and tuning parameters. There is however, to our knowledge, no straightforward way to do so.

e.2.4 Fitting the model with the embedded Laplace approximation

Running the algorithm with Stan’s default tuning parameters produces 0 divergent transitions over 12,000 sampling iterations.

e.3 Sparse kernel interaction model

SKIM, developed by Agrawal:2019, extends the model of Piironen:2017 by accounting for pairwise interaction effects between covariates. The generative model shown below uses the notation in E.2 instead of that in Appendix D of Agrawal:2019:

where and are the main and pairwise effects for covariates and , respectively, and , , are defined in E.2.

Instead of sampling and , which takes at least time per iteration to store and compute, Agrawal:2019 marginalize out all the regression coefficients, only sampling via MCMC. Through a kernel trick and a Gaussian process re-parameterization of the model, this marginalization takes time instead of . The Gaussian process covariance matrix induced by SKIM is provided below:

where “” denotes the element-wise Hadamard product. Finally,