Automatic Reparameterisation of Probabilistic Programs

06/07/2019 ∙ by Maria I. Gorinova, et al. ∙ Google 1

Probabilistic programming has emerged as a powerful paradigm in statistics, applied science, and machine learning: by decoupling modelling from inference, it promises to allow modellers to directly reason about the processes generating data. However, the performance of inference algorithms can be dramatically affected by the parameterisation used to express a model, requiring users to transform their programs in non-intuitive ways. We argue for automating these transformations, and demonstrate that mechanisms available in recent modeling frameworks can implement non-centring and related reparameterisations. This enables new inference algorithms, and we propose two: a simple approach using interleaved sampling and a novel variational formulation that searches over a continuous space of parameterisations. We show that these approaches enable robust inference across a range of models, and can yield more efficient samplers than the best fixed parameterisation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

Code Repositories

autoreparam

Automatic Reparameterisation of Probabilistic Programs


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reparameterising a probabilistic model means expressing it in terms of new variables defined by a bijective transformation of the original variables of interest. The reparameterised model expresses the same statistical assumptions as the original, but can have drastically different posterior geometry, with significant implications for both variational and sampling-based inference algorithms.

Non-centring is a particularly common form of reparameterisation in Bayesian hierarchical models. Consider a random variable

; we say this is in centred parameterisation (CP). If we instead work with an auxiliary, standard normal variable , and obtain by applying the transformation , we say the variable is in its non-centred parameterisation (NCP). Although the centred parameterisation is often more intuitive, non-centring can dramatically improve the performance of inference (Betancourt and Girolami, 2015). Neal’s funnel ((a)

) provides a simple example: most Markov chain Monte Carlo (MCMC) algorithms have trouble sampling from the funnel due to the strong non-linear dependence between latent variables. Non-centring the model removes this dependence, converting the funnel into a spherical Gaussian distribution.

(a) Centred (left) and non-centred (right) parameterisation.
(b) Model that generates variables and .
(c) The model in the context of log_prob_at_0.
Figure 1: Neal’s funnel (Neal, 2003): .

Bayesian practitioners are often advised to manually non-centre their models (Stan Development Team et al., 2016); however, this breaks the separation between modelling and inference and requires expressing the model in a potentially less intuitive form. Moreover, it requires the user to understand the concept of non-centring and to know a priori where in the model it might be appropriate. Because the best parameterisation for a given model may vary across datasets, even experts may need to find the optimal parameterisation by trial and error, burdening modellers and slowing the model development loop (Blei, 2014).

We propose that non-centring and similar reparameterisations be handled automatically by probabilistic programming systems. We demonstrate how such program transformations may be implemented using the effect handling mechanisms present in several modern deep probabilistic programming frameworks, and consider two inference algorithms enabled by automatic reparameterisation: interleaved Hamiltonian Monte Carlo (iHMC), which alternates HMC steps between centred and non-centred parameterisations, and a novel algorithm we call Variationally Inferred Parameterisation (VIP), which searches over a continuous space of reparameterisations that includes non-centring as a special case.111Code for these algorithms and experiments is available at https://github.com/mgorinova/autoreparam   Experiments demonstrate that these strategies enable robust inference, performing at least as well as the best fixed parameterisation across a range of models, and sometimes better, without requiring a priori knowledge of the optimal parameterisation.

2 Related work

The value of non-centring is well-known to MCMC practitioners and researchers (Stan Development Team et al., 2016; Betancourt and Girolami, 2015), and can also lead to better variational fits in hierarchical models (Yao et al., 2018). However, the literature largely treats this as a modelling choice; Yao et al. (2018) propose that “there is no general rule to determine whether non-centred parameterisation is better than the centred one.” We are not aware of prior work that treats non-centring directly as a computational phenomenon to be exploited by inference systems.

Non-centred parameterisation of probabilistic models can be seen as analogous to the reparameterisation trick in stochastic optimisation (Kingma and Welling, 2013)

; both involve expressing a variable in terms of a diffeomorphic transformation from a "standardised" variable. In the context of probabilistic inference, these are complementary tools: the reparameterisation trick yields low-variance stochastic gradients of variational objectives, whereas non-centring changes the geometry of the posterior itself, leading to qualitatively different variational fits and MCMC trajectories.

In the context of Gibbs sampling, Papaspiliopoulos et al. (2007) introduce a family of partially non-centred parameterisations equivalent to those we use in VIP (described below) and show that it improves mixing in a spatial GLMM. Our current work can be viewed as an extension of this work that mechanically reparameterises user-provided models and automates the choice of parameterisation. Similarly, Yu and Meng (2011) proposed a Gibbs sampling scheme that interleaves steps in centered and non-centered parameterisations; our interleaved HMC algorithm can be viewed as an automated, gradient-based descendent of their scheme.

Recently, there has been work on accelerating MCMC inference through learned reparameterisation: Parno and Marzouk (2018) and Hoffman et al. (2019) run samplers in the image of a bijective map fitted to transform the target distribution approximately to an isotropic Gaussian. These may be viewed as ‘black-box’ methods that rely on learning the target geometry, potentially using highly expressive neural variational models, while we use probabilistic-program transformations to apply ‘white-box’ reparameterisations similar to those a modeller could in principle implement themselves. Because they exploit model structure, white-box approaches can correct pathologies such as those of Neal’s funnel ((a)) directly, reliably, and at much lower cost (in parameters and inference overhead) than black-box models. White- and black-box reparameterisations are not mutually exclusive, and may have complementary advantages; combining them is a likely fruitful direction for improving inference in structured models.

3 Understanding the effects of parameterisation

Non-centring reparameterisation is not always optimal; its usefulness depends on properties of both the model and the observed data. In this section, we work with a simple hierarchical model for which we can derive the posterior analytically. Consider a simple realisation of a model discussed by Betancourt and Girolami (2015, (2))

, where for a vector of

datapoints , and some given constants and , we have:

[width=0.8]plots/simple_cp

(a) Centred.

[width=0.8]plots/simple_ncp

(b) Non-centred.
(c) The condition number as a function of the data’s strength.
Figure 2: Effects of reparameterising a simple model with known posterior.

In the non-centred model, is defined in terms of and , where is a standard Gaussian variable:

(a) and (b) show the graphical models for the two parameterisations. In the non-centred case, the direct dependency between and is substituted by a conditional dependency given the data , which creates an “explaining away” effect. Intuitively, this means that the stronger the evidence is (large , and small variance), the stronger the dependency between and becomes, creating a poorly-conditioned posterior that may slow inference.

As the Gaussian distribution is self-conjugate, the posterior distribution in each case (centred or non-centred) is also a Gaussian distribution, and we can analytically inspect its covariance matrix . To quantify the quality of the parameterisation in each case, we investigate the condition number of the posterior covariance matrix under the optimal diagonal preconditioner. This models the common practice (implemented as the default in tools such as PyMC3 and Stan and followed in our experiments) of sampling using a fitted diagonal preconditioner.

(c) shows the condition numbers and for each parameterisation as a function of ; the full derivation is in Appendix A. This figure confirms the intuition that the non-centred parameterisation is better suited for situation when the evidence is weak, while strong evidence calls for centred parameterisation. In this example we can exactly determine the optimal parameterisation, since the model has only one variable that can be reparameterised and the posterior has a closed form. In more realistic settings, even experts cannot predict the optimal parameterisation for hierarchical models with many variables and groups of data, and the wrong choice can lead not just to poor conditioning but to heavy tails or other pathological geometry.

4 Reparameterising probabilistic programs

An advantage of probabilistic programming is that the program itself provides a structured model representation, and we can explore model reparameterisation through the lens of program transformations. In this paper, we focus on transforming generative probabilistic programs where the program represents a sampling process describing how the data was generated from some unknown latent variables. Most probabilistic programming languages (PPLs) provide some mechanism for transforming a generative process into an inference program; our automatic reparameterisation approach is applicable to PPLs that transform generative programs using effect handling. This includes modern deep PPLs such as Pyro (Uber AI Labs, 2017) and Edward2 (Tran et al., 2018).

4.1 Effect handling-based probabilistic programming

Consider a generative program, where running the program forward generates samples from the prior over latent variables and data. Effect handling-based PPLs treat generating a random variable within such a model as an effectful operation (an operation that is understood as having side effects) and provide ways for resolving this operation in the form of effect handlers, to allow for inference. For example, we often need to transform a statement that generates a random variable to a statement that evaluates some (log) density or mass function. We can implement this using an effect handler:

The handler log_prob_at_0 handles statements of the form . The meaning of such statements is normally “sample a random variable from the distribution and record its value in ”. However, when executed in the context of log_prob_at_0 (we write with log_prob_at_0 handle model), statements that contain random-variable constructions are handled by setting the value of the variable to , then evaluating the log density (or mass) function of at and recording its value in a new (program) variable .

For example, consider the function implementing Neal’s funnel in (b). When executed without any context, this function generates two random variables, and . When executed in the context of the log_prob_at_0 handler, it does not generate random variables, but it instead evaluates and ((c)).

This approach can be extended to produce a function that corresponds to the log joint density (or mass) function of the latent variables of the model. In §§ B.1, we give the pseudo-code implementation of a function make_log_joint, which takes a model — that generates latent variables and generates and observes data — and returns the function . This is a core operation, as it transforms a generative model into a function proportional to the posterior distribution, which can be repeatedly evaluated and automatically differentiated to perform inference.

More generally, effectful operations are operations that can have side effects, e.g. writing to a file. The programming languages literature formalises cases where impure behaviour arises from a set of effectful operations in terms of algebraic effects and their handlers (Plotkin and Power, 2001; Plotkin and Pretnar, 2009; Pretnar, 2015). A concrete implementation for an effectful operation is given in the form of effect handlers, which (similarly to exception handlers) are responsible for resolving the operation. Effect handlers can be used as a powerful abstraction in probabilistic programming, as discussed previously by Moore and Gorinova (2018), and shown by both Pyro and Edward2.

4.2 Model reparameterisation using effect handlers

Once equipped with an effect handling-based PPL, we can easily construct handlers to perform many model transformations, including model reparameterisation.

Non-centring handler.

A non-centring handler can be used to non-centre all standardisable 333We focus on Gaussian variables, but non-centring is broadly applicable, e.g. to the location-scale family and random variables that can be expressed as a bijective transformation of a “standardised” variable . latent variables in a model. The handler simply applies to statements of the form , where is not a data variable, and transforms them to , . When nested within a log_prob handler (like the one from §§ 4.1), log_prob handles the transformed standard normal statement . Thus, make_log_joint applied to a model in the ncp context returns the log joint function of the transformed variables rather than the original variables .

For example, corresponds to . But, gives the function , where and .

This approach can easily be extended to other parameterisations, including partially centred parameterisations (as shown later in §§ 5.2), non-centring and whitening multivariate Gaussians, and transforming constrained variables to have unbounded support.

Edward2 implementation.

We implement reparameterisation handlers in Edward2, a deep PPL embedded in Python and TensorFlow

(Tran et al., 2018). A model in Edward2 is a Python function that generates random variables. In the core of Edward2 is a special case of effect handling called interception. To obtain the joint density of a model, the language provides the function make_log_joint_fn(model), which uses a log_prob interceptor (handler) as previously described.

We extend the usage of interception to treat sample statements in one parameterisation as sample statements in another parameterisation (similarly to the ncp handler above):

def noncentring_interceptor(rv_constructor, **rv_kwargs):
 # Assumes rv_constructor is in the location-scale family
 name = rv_kwargs["name"] + "_std"
 rv_std = ed.interceptable$\footnotemark$(rv_constructor)(loc=0, scale=1)
 return rv_kwargs["loc"] + rv_kwargs["scale"] * rv_std
33footnotetext: Wrapping the constructor with ed.interceptable ensures that we can nest this interceptor in the context of other interceptors.

We use the interceptor by executing a model of interest within the interceptor’s context (using Python’s context managers). This overrides each random variable’s constructor to construct a variable with location and scale , and scale and shift that variable appropriately:

with ed.interception(noncentring_interceptor): neals_funnel()

We present and explain in more detail all interceptors used for this work in Appendix B.

5 Automatic model reparameterisation

We introduce two inference strategies that exploit automatic reparameterisation: interleaved Hamiltonian Monte Carlo (iHMC), and the Variationally Inferred Parameterisation (VIP).

5.1 Interleaved Hamiltonian Monte Carlo

Automatic reparameterisation opens up the possibility of algorithms that exploit multiple parameterisations of a single model. We consider interleaved Hamiltonian Monte Carlo (iHMC), which uses two HMC steps to produce each sample from the target distribution: the first step is made in CP, using the original model latent variables, while the second step is made in NCP, using the auxiliary standardised variables. Interleaving MCMC kernels across parameterisations has been explored in previous work on Gibbs sampling (Yu and Meng, 2011; Kastner and Frühwirth-Schnatter, 2014), which demonstrated that CP and NCP steps can be combined to achieve more robust and performant samplers. Our contribution is to make the interleaving automatic and model-agnostic: instead of requiring the user to write multiple versions of their model and a custom inference algorithm, we implement iHMC as a black-box inference algorithm for centred Edward2 models.

1: data ; a centred model
2: samples from
3:
4:
5:
6:
7:
8:for  do
9:     
10:     
11:     
12: return

 

Algorithm 1 Interleaved Hamiltonian Monte Carlo
1: data ; a centred model
2: samples from
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:return

 

Algorithm 2 Variationally Inferred Parameterisation

Algorithm 1 outlines iHMC. It takes a single centred model that defines latent variables and generates data . It uses the function make_ncp to automatically obtain a non-centred version of the model, , which defines auxiliary variables and function , such that .

5.2 Variationally inferred parameterisation

(a) Different parameterisations of the funnel, with mean-field normal variational fit (overlayed in white).
(b) Alternative view as implicit variational distributions (overlayed in white) on the original space.
Figure 3: Neal’s funnel: , with mean-field normal variational fit overlayed.

The best parameterisation for a given model may mix centred and non-centred representations for different variables. To efficiently search the space of reparameterisations, we propose the variationally inferred parameterisation (VIP) algorithm, which selects a parameterisation by gradient-based optimisation of a differentiable variational objective. VIP can be used as a pre-processing step to another inference algorithm; as it only changes the parameterisation of the model, MCMC methods applied to the learned parameterisation maintain their asymptotic guarantees.

Consider a model with latent variables . We introduce parameterisation parameters for each variable , and transform by defining and . This defines a continuous relaxation that includes NCP as the special case and CP as . More generally, it supports a combinatorially large class of per-variable and partial centrings. We aim to choose the parameterisation under which the posterior

is “most like” an independent normal distribution.

A natural objective to minimise is , where is an independent normal model with variational parameters . Minimising this divergence corresponds to maximising a variational lower bound, the ELBO (Bishop, 2006):

Note that the auxiliary parameters are not statistically identifiable: the marginal likelihood is constant with respect to . However, the computational properties of the reparameterised models differ, which is what the variational bound selects for. Our key hypothesis (which the results in Figure LABEL:fig:ess_and_elbo_results seem to support) is that diagonal-normal approximability is a good proxy for MCMC sampling efficiency.

To search for a good model reparameterisation, we optimise using stochastic gradients to simultaneously fit the variational distribution to the posterior and optimise the shape of that posterior. (a) provides a visual example: an independent normal variational distribution is a poor fit to the pathological geometry of a centred Neal’s funnel, but non-centring leads to a well-conditioned posterior, where the variational distribution is a perfect fit. In general settings where the reparameterised model is not exactly Gaussian, sampling-based inference can be used to refine the posterior; we apply VIP as a preprocessing step for HMC (summarised in Algorithm 2). Both the reparameterisation and the construction of the variational model are implemented as automatic program transformations using Edward2’s interceptors.

An alternate interpretation of VIP is that it expands a variational family to a more expressive family capable of representing prior dependence. Letting represent the partial centring transformation, an independent normal family on the transformed model corresponds to an implicit posterior on the original model variables. Under this interpretation,

are variational parameters that serve to add freedom to the variational family, allowing it to interpolate from independent normal (at

, (b) left) to a representation that captures the exact prior dependence structure of the model (at , (b) right).

6 Experiments

Figure 4:

Effective sample size and 95% confidence intervals for the radon model across US states.

We evaluate our proposed approaches by using Hamiltonian Monte Carlo to sample from the posterior of hierarchical Bayesian models on several datasets:

Eight schools (Rubin, 1981)

: estimating the treatment effects

of a course taught at each of schools, given test scores

and standard errors

:

Radon (Gelman and Hill, 2006)

: hierarchical linear regression, in which the radon level

in a home in county is modelled as a function of the (unobserved) county-level effect , the county uranium reading , and , the number of floors in the home:

German credit (Dua and Graff, 2017)

: logistic regression; hierarchical prior on coefficient scales:

Election ’88 (Gelman and Hill, 2006): logistic model of 1988 US presidential election outcomes by county, given demographic covariates and state-level effects :

Electric Company (Gelman and Hill, 2006): paired causal analysis of the effect of viewing an educational TV show on each of classforms over grades. The classrooms were divided into pairs, and one class in each pair was treated () at random:

6.1 Algorithms and experimental details

For each model and dataset, we compare our methods, interleaved HMC (iHMC) and VIP-HMC, with baselines of running HMC on either fully centred (CP-HMC) or fully non-centred (NCP-HMC) models. We initialise each HMC chain with samples from an independent Gaussian variational posterior, and use the posterior scales as a diagonal preconditioner; for VIP-HMC this variational optimisation also includes the parameterisation parameters . All variational optimisations were run for the same number of steps, so they were a fixed cost across all methods except iHMC (which depends on preconditioners for both the centred and non-centred transition kernels). The HMC step size and number of leapfrog steps were tuned following the procedures described in Appendix C.

We report the average effective sample size per gradient evaluations (), with standard errors computed from chains. We use gradient evaluations, rather than wallclock time, as they are the dominant operation in both HMC and VI and are easier to measure reliably; in practice, the wallclock times we observed per gradient evaluation did not differ significantly between methods.

Full details on the set up of the experiments can be found in Appendix C.

6.2 Results

Figures 4 and LABEL:fig:ess_and_elbo_results show the results of the experiments. In most cases, either the centred or non-centred parameterisation works well, while the other does not. An exception is the German credit dataset, where both CP-HMC and NCP-HMC give a small ESS: or respectively.

iHMC.

Across the datasets in both figures, we see that iHMC is a robust alternative to CP-HMC and NCP-HMC. Its performance is always within a factor of two of the best of CP-HMC and NCP-HMC, and sometimes better. In addition to being robust, iHMC can sometimes navigate the posterior more efficiently than either of CP-HMC and NCP-HMC can: in the case of German credit, it performs better than both ().

Vip.

Performance of VIP-HMC is typically as good as the better of CP-HMC and NCP-HMC, and sometimes better. On the German credit dataset, it achieves , more than three times the rate of CP-HMC and NCP-HMC, and significantly better than iHMC. LABEL:fig:ess_and_elbo_results shows the correspondence between the optimised mean-field ELBO and the effective sampling rate. This result supports the ELBO as a reasonable predictor of the conditioning of a model, and further confirms the validity of VIP as a strategy for automatic reparameterisation. Finally, we show some of the parameterisations that VIP finds in Figure 6. VIP’s behaviour appears reasonable: for most datasets we looked at, VIP finds the “correct” global parameterisation: most parameterisation parameters are set to either or (Figure 6, left). In the cases where a global parameterisation is not optimal (e.g. radon MO, radon PA and, most notably, German credit), VIP finds a mixed parameterisation, combining centred, non-centred, and partially centred variables (Figure 6, centre and right).

Figure 6: A heat map of VIP parameterisations. Light regions correspond to CP and dark regions to NCP.

7 Discussion

Our results demonstrate that automated reparameterisation of probabilistic models is practical, and enables inference algorithms that can in some cases find parameterisations even better than those a human could realistically express. These techniques allow modellers to focus on expressing statistical assumptions, leaving computation to the computer. We view the methods in this paper as exciting proofs of concept, and hope that they will inspire additional work in this space.

While we focus on reparameterising hierarchical models naturally written in centred form, the inverse transformation—detecting and exploiting implicit hierarchical structure in models expressed as algebraic equations—is an important area of future work. This may be compatible with recent trends exploring the use of symbolic algebra systems in PPL runtimes (Narayanan et al., 2016; Hoffman et al., 2018). We also see promise in automating reparameterisations of heavy-tailed and multivariate distributions, and in designing new inference algorithms to exploit these capabilities.

References

  • Andrieu and Thoms (2008) Andrieu, C. and Thoms, J. (2008). A tutorial on adaptive MCMC. Statistics and computing, 18(4):343–373.
  • Betancourt and Girolami (2015) Betancourt, M. and Girolami, M. (2015). Hamiltonian Monte Carlo for hierarchical models. Current trends in Bayesian methodology with applications, 79:30.
  • Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
  • Blei (2014) Blei, D. M. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1:203–232.
  • Dua and Graff (2017) Dua, D. and Graff, C. (2017). UCI machine learning repository.
  • Gelman and Hill (2006) Gelman, A. and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
  • Hoffman et al. (2019) Hoffman, M., Sountsov, P., Dillon, J. V., Langmore, I., Tran, D., and Vasudevan, S. (2019). NeuTra-lizing bad geometry in Hamiltonian Monte Carlo using neural transport. arXiv preprint arXiv:1903.03704.
  • Hoffman et al. (2018) Hoffman, M. D., Johnson, M., and Tran, D. (2018). Autoconj: Recognizing and exploiting conjugacy without a domain-specific language. In Neural Information Processing Systems.
  • Kastner and Frühwirth-Schnatter (2014) Kastner, G. and Frühwirth-Schnatter, S. (2014). Ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC estimation of stochastic volatility models. Computational Statistics & Data Analysis, 76:408–423.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kingma and Welling (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
  • Moore and Gorinova (2018) Moore, D. and Gorinova, M. I. (2018). Effect handling for composable program transformations in Edward2. International Conference on Probabilistic Programming.
  • Narayanan et al. (2016) Narayanan, P., Carette, J., Romano, W., Shan, C., and Zinkov, R. (2016). Probabilistic inference by program transformation in Hakaru (system description). In

    International Symposium on Functional and Logic Programming - 13th International Symposium, FLOPS 2016, Kochi, Japan, March 4-6, 2016, Proceedings

    , pages 62–79. Springer.
  • Neal (2003) Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31(3):705–741.
  • Papaspiliopoulos et al. (2007) Papaspiliopoulos, O., Roberts, G. O., and Sköld, M. (2007). A general framework for the parametrization of hierarchical models. Statistical Science, pages 59–73.
  • Parno and Marzouk (2018) Parno, M. D. and Marzouk, Y. M. (2018). Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645–682.
  • Plotkin and Power (2001) Plotkin, G. and Power, J. (2001). Adequacy for algebraic effects. In Honsell, F. and Miculan, M., editors, Foundations of Software Science and Computation Structures, pages 1–24, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Plotkin and Pretnar (2009) Plotkin, G. and Pretnar, M. (2009). Handlers of algebraic effects. In Castagna, G., editor, Programming Languages and Systems, pages 80–94, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Pretnar (2015) Pretnar, M. (2015). An introduction to algebraic effects and handlers. Invited tutorial paper. Electronic Notes in Theoretical Computer Science, 319:19 – 35. The 31st Conference on the Mathematical Foundations of Programming Semantics (MFPS XXXI).
  • Rubin (1981) Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4):377–401.
  • Stan Development Team et al. (2016) Stan Development Team et al. (2016). Stan modelling language users guide and reference manual. Technical report. https://mc-stan.org/docs/2_19/stan-users-guide/.
  • Tran et al. (2018) Tran, D., Hoffman, M. D., Vasudevan, S., Suter, C., Moore, D., Radul, A., Johnson, M., and Saurous, R. A. (2018). Simple, distributed, and accelerated probabilistic programming. Advances in Neural Information Processing Systems.
  • Uber AI Labs (2017) Uber AI Labs (2017). Pyro: A deep probabilistic programming language. http://pyro.ai/.
  • Yao et al. (2018) Yao, Y., Vehtari, A., Simpson, D., and Gelman, A. (2018). Yes, but did it work?: Evaluating variational inference. arXiv preprint arXiv:1802.02538.
  • Yu and Meng (2011) Yu, Y. and Meng, X.-L. (2011). To center or not to center: That is not the question—an ancillarity–sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics, 20(3):531–570.

Appendix A Derivation of the condition number of the posterior for a simple model

Centred parameterisation

Non-centred parameterisation

As the Gaussian distribution is self-conjugate, the posterior distribution (given ) in each case (centred or non-centred) is also a Gaussian distribution, whose shape is entirely specified by a covariance matrix . To quantify the quality of each parameterisation, we investigate the condition number of the posterior covariance matrix in each case under the best diagonal preconditioner.

We do this in three steps:

  1. We derive the covariance matrices and , such that and (Equation 1 and Equation 2).

  2. We find the best diagonal preconditioners and : for , that is , where and

    are the eigenvalues of

    (Equation 3 and Equation 4).

  3. We compare the condition numbers and , where are the eigenvalues of

a.1 Deriving and : centred parameterisation

At the same time, for , we have:

Thus, for , we get: And therefore:

(1)

a.2 Deriving and : non-centred parameterisation

Like in the previous subsection, we have:

Similarly to before, we derive , and therefore:

(2)

a.3 The best diagonal preconditioner

Consider a diagonal preconditioner . The best diagonal preconditioner of is such that:

Firstly, in terms of the covariance matrix in the centred case, we have:

The solutions of are the solutions of:

which, after simplification, becomes:

We want to find that minimises . Let . We are looking for , such that , in order to find . By expanding and simplifying we get:

And thus:

(3)

We obtain the best diagonal preconditioner in a similar manner, finally getting:

(4)

a.4 The condition numbers and

Finally, we substitute and in the respective eigenvalue equations to derive the condition number in each case:

(5)
(6)

Appendix B Interceptors

Interceptors can be used as a powerful abstractions in a probabilistic programming systems, as discussed previously by Moore and Gorinova (2018), and shown by both Pyro and Edward2. In particular, we can use interceptors to automatically reparameterise a model, as well as to specify variational families. In this section, we show Edward2 pseudo-code for the interceptors used to implement iHMC and VIP-HMC.

b.1 Make log joint

The following code is an outline of Edward2’s impllementation of a function that evaluates the log density at some given :

def make_log_joint_fn(model):
 def log_joint_fn(**kwargs):
    log_prob = 0
    def log_prob_interceptor(rv_constructor, **rv_kwargs):
      # Overrides a random variable’s ‘value‘ and accumulates its log prob.
      rv_name = rv_kwargs.get("name")
      rv_kwargs["value"] = kwargs.get(rv_name)
      rv = rv_constructor(**rv_kwargs)
      log_prob = log_prob + rv.distribution.log_prob(rv.value)
      return rv
    with ed.interception(log_prob_interceptor):
      model()
    return log_prob
 return log_joint_fn

By executing the model function in the context of log_prob_interceptor, we override each sample statement (a call to a random variable constructor rv_constructor), to generate a variable that takes on the value provided in the arguments of log_joint_fn

. As a side effect, we also accumulate the result of evaluating each variable’s prior density at the provided value, which, by the chain rule, gives us the log joint density.

b.2 Non-centred Parameterisation Interceptor

By intercepting every construction of a normal variable (or, more generally, of location-scale family variables), we can create a standard normal variable instead, and scale and shift appropriately.

def ncp_interceptor(rv_constructor, **rv_kwargs):
 # Assumes rv_constructor is in the location-scale family
 name = rv_kwargs["name"] + "_std"
 rv_std = ed.interceptable$\footnotemark$(rv_constructor)(loc=0, scale=1)
 return rv_kwargs["loc"] + rv_kwargs["scale"] * rv_std
33footnotetext: Wrapping the constructor in with ed.interceptable ensures that we can nest this interceptor in the context of other interceptors.

Running a model that declares the random variables in the context of ncp_interceptor will declare a new set of standard normal random variables . Nesting this in the context of the log_prob_interceptor from LABEL:ap:edward will then evaluate the log joint density .

For example, going back to Neal’s funnel, running

with ed.interception(log_prob_interceptor):
 neals_funnel()

corresponds to evaluating , while running

with ed.interception(log_prob_interceptor):
 with ed.interception(ncp_interceptor):
  neals_funnel()

corresponds to evaluating .

b.3 VIP Interceptor

The VIP interceptor is similar to the NCP interceptor. The notable difference is that it creates new learnable Tensorflow variables, which correspond to the parameterisation parameters :

def vip_interceptor(rv_constructor, **rv_kwargs):
 name = rv_kwargs["name"] + "_vip"
 rv_loc = rv_kwargs["loc"]
 rv_scale = rv_kwargs["scale"]
 a = tf.nn.sigmoid(tf.get_variable(
   name + "_a_unconstrained",
   initializer=tf.zeros_like(rv_loc))
 rv_vip = ed.interceptable(rv_constructor)(
            loc=a * rv_loc, scale=rv_scale ** a)
 return rv_loc + rv_scale ** (1 - a) * (rv_vip - a * rv_loc)

b.4 Mean-field Variational Model Interceptor

Finally, we show a mean-field variational familiy interceptor, which we use both to tune the step sizes for HMC (see Appendix C), and to make use of VIP automatically. The mfvi_interceptor simply substitutes each sample statement with sampling from a normal distribution with parameters specified by some fresh variational parameters and :

def vip_interceptor(rv_constructor, **rv_kwargs):
 name = rv_kwargs["name"] + "_q"
 mu = tf.get_variable(name + "_mu")
 sigma = tf.nn.softmax(tf.get_variable(name + "_sigma"))
 rv_q = ed.interceptable(ed.Normal)(
          loc=mu, scale=sigma, name=name)
 return rv_q

Appendix C Details of the experiments

Algorithms.

  • CP-HMC: HMC run on a fully centred model.

  • NCP-HMC: HMC run on a fully non-centred model.

  • iHMC: interleaved HMC.

  • VIP-HMC: HMC run on the a model reparameterised as given by VIP.

Each run consists of VI pre-processing and HMC inference.

Variational inference pre-processing.

We use automatic differentiation to compute stochastic gradients of the ELBO with respect to and perform the optimisation using Adam (Kingma and Ba, 2014). We implement the constraint using a sigmoid transformation; for .

Prior to running HMC, we also run VI to approximate per-variable initial step sizes (equivalently, a diagonal preconditioning matrix), and to initialise the chains. For each of CP-HMC and NCP-HMC this is just mean-field VI, and for VIP-HMC the VI procedure is VIP.

Each VI method is run for optimisation steps, and the ELBO is approximated using Monte Carlo samples. We use the Adam optimiser with initial learning rate , decayed to after 1000 steps and after 2000 steps, and returned the result with the highest ELBO.

Hamiltonian Monte Carlo inference.

In each case we run chains for a warm-up period of steps, followed by steps each, and report the average effective sample size (ESS) per gradient evaluations (). Since ESS is naturally estimated from scalar traces, we first estimate per-variable effective sample sizes for each model variable, and take the overall ESS to be the minimum across all variables.

The HMC step size

was adapted to target an acceptance probability of 0.75, following a simple update rule

where is the acceptance probability of the proposed state at step (Andrieu and Thoms, 2008). The adaptation runs during the first 1500 steps of the warm-up period, after which we allow the chain to mix towards a stationary distribution.

The number of leapfrog steps is chosen using ‘oracle’ tuning: each sampler is run with logarithmically increasing number of leapfrog steps in , and we report the result that maximises . This is intended to decouple the problem of tuning the number of leapfrog steps from the issues of parameterisation consider in this paper, and ensure that each method is reasonably tuned. For iHMC, we tune a single number of leapfrog steps that is shared across both the CP and NCP substeps.