Automatic Reparameterisation of Probabilistic Programs
Probabilistic programming has emerged as a powerful paradigm in statistics, applied science, and machine learning: by decoupling modelling from inference, it promises to allow modellers to directly reason about the processes generating data. However, the performance of inference algorithms can be dramatically affected by the parameterisation used to express a model, requiring users to transform their programs in non-intuitive ways. We argue for automating these transformations, and demonstrate that mechanisms available in recent modeling frameworks can implement non-centring and related reparameterisations. This enables new inference algorithms, and we propose two: a simple approach using interleaved sampling and a novel variational formulation that searches over a continuous space of parameterisations. We show that these approaches enable robust inference across a range of models, and can yield more efficient samplers than the best fixed parameterisation.READ FULL TEXT VIEW PDF
Probabilistic programming provides the means to represent and reason abo...
We introduce the notion of a stochastic probabilistic program and presen...
It is natural for probabilistic programs to use conditionals to express
Probabilistic inference procedures are usually coded painstakingly from
Probabilistic programming languages (PPLs) are powerful modelling tools ...
We consider two classes of computations which admit taking linear
Probabilistic techniques are central to data analysis, but different
Automatic Reparameterisation of Probabilistic Programs
Reparameterising a probabilistic model means expressing it in terms of new variables defined by a bijective transformation of the original variables of interest. The reparameterised model expresses the same statistical assumptions as the original, but can have drastically different posterior geometry, with significant implications for both variational and sampling-based inference algorithms.
Non-centring is a particularly common form of reparameterisation in Bayesian hierarchical models. Consider a random variable; we say this is in centred parameterisation (CP). If we instead work with an auxiliary, standard normal variable , and obtain by applying the transformation , we say the variable is in its non-centred parameterisation (NCP). Although the centred parameterisation is often more intuitive, non-centring can dramatically improve the performance of inference (Betancourt and Girolami, 2015). Neal’s funnel ((a)
) provides a simple example: most Markov chain Monte Carlo (MCMC) algorithms have trouble sampling from the funnel due to the strong non-linear dependence between latent variables. Non-centring the model removes this dependence, converting the funnel into a spherical Gaussian distribution.
Bayesian practitioners are often advised to manually non-centre their models (Stan Development Team et al., 2016); however, this breaks the separation between modelling and inference and requires expressing the model in a potentially less intuitive form. Moreover, it requires the user to understand the concept of non-centring and to know a priori where in the model it might be appropriate. Because the best parameterisation for a given model may vary across datasets, even experts may need to find the optimal parameterisation by trial and error, burdening modellers and slowing the model development loop (Blei, 2014).
We propose that non-centring and similar reparameterisations be handled automatically by probabilistic programming systems. We demonstrate how such program transformations may be implemented using the effect handling mechanisms present in several modern deep probabilistic programming frameworks, and consider two inference algorithms enabled by automatic reparameterisation: interleaved Hamiltonian Monte Carlo (iHMC), which alternates HMC steps between centred and non-centred parameterisations, and a novel algorithm we call Variationally Inferred Parameterisation (VIP), which searches over a continuous space of reparameterisations that includes non-centring as a special case.111Code for these algorithms and experiments is available at https://github.com/mgorinova/autoreparam Experiments demonstrate that these strategies enable robust inference, performing at least as well as the best fixed parameterisation across a range of models, and sometimes better, without requiring a priori knowledge of the optimal parameterisation.
The value of non-centring is well-known to MCMC practitioners and researchers (Stan Development Team et al., 2016; Betancourt and Girolami, 2015), and can also lead to better variational fits in hierarchical models (Yao et al., 2018). However, the literature largely treats this as a modelling choice; Yao et al. (2018) propose that “there is no general rule to determine whether non-centred parameterisation is better than the centred one.” We are not aware of prior work that treats non-centring directly as a computational phenomenon to be exploited by inference systems.
Non-centred parameterisation of probabilistic models can be seen as analogous to the reparameterisation trick in stochastic optimisation (Kingma and Welling, 2013)
; both involve expressing a variable in terms of a diffeomorphic transformation from a "standardised" variable. In the context of probabilistic inference, these are complementary tools: the reparameterisation trick yields low-variance stochastic gradients of variational objectives, whereas non-centring changes the geometry of the posterior itself, leading to qualitatively different variational fits and MCMC trajectories.
In the context of Gibbs sampling, Papaspiliopoulos et al. (2007) introduce a family of partially non-centred parameterisations equivalent to those we use in VIP (described below) and show that it improves mixing in a spatial GLMM. Our current work can be viewed as an extension of this work that mechanically reparameterises user-provided models and automates the choice of parameterisation. Similarly, Yu and Meng (2011) proposed a Gibbs sampling scheme that interleaves steps in centered and non-centered parameterisations; our interleaved HMC algorithm can be viewed as an automated, gradient-based descendent of their scheme.
Recently, there has been work on accelerating MCMC inference through learned reparameterisation: Parno and Marzouk (2018) and Hoffman et al. (2019) run samplers in the image of a bijective map fitted to transform the target distribution approximately to an isotropic Gaussian. These may be viewed as ‘black-box’ methods that rely on learning the target geometry, potentially using highly expressive neural variational models, while we use probabilistic-program transformations to apply ‘white-box’ reparameterisations similar to those a modeller could in principle implement themselves. Because they exploit model structure, white-box approaches can correct pathologies such as those of Neal’s funnel ((a)) directly, reliably, and at much lower cost (in parameters and inference overhead) than black-box models. White- and black-box reparameterisations are not mutually exclusive, and may have complementary advantages; combining them is a likely fruitful direction for improving inference in structured models.
Non-centring reparameterisation is not always optimal; its usefulness depends on properties of both the model and the observed data. In this section, we work with a simple hierarchical model for which we can derive the posterior analytically. Consider a simple realisation of a model discussed by Betancourt and Girolami (2015, (2))
, where for a vector ofdatapoints , and some given constants and , we have:
In the non-centred model, is defined in terms of and , where is a standard Gaussian variable:
(a) and (b) show the graphical models for the two parameterisations. In the non-centred case, the direct dependency between and is substituted by a conditional dependency given the data , which creates an “explaining away” effect. Intuitively, this means that the stronger the evidence is (large , and small variance), the stronger the dependency between and becomes, creating a poorly-conditioned posterior that may slow inference.
As the Gaussian distribution is self-conjugate, the posterior distribution in each case (centred or non-centred) is also a Gaussian distribution, and we can analytically inspect its covariance matrix . To quantify the quality of the parameterisation in each case, we investigate the condition number of the posterior covariance matrix under the optimal diagonal preconditioner. This models the common practice (implemented as the default in tools such as PyMC3 and Stan and followed in our experiments) of sampling using a fitted diagonal preconditioner.
(c) shows the condition numbers and for each parameterisation as a function of ; the full derivation is in Appendix A. This figure confirms the intuition that the non-centred parameterisation is better suited for situation when the evidence is weak, while strong evidence calls for centred parameterisation. In this example we can exactly determine the optimal parameterisation, since the model has only one variable that can be reparameterised and the posterior has a closed form. In more realistic settings, even experts cannot predict the optimal parameterisation for hierarchical models with many variables and groups of data, and the wrong choice can lead not just to poor conditioning but to heavy tails or other pathological geometry.
An advantage of probabilistic programming is that the program itself provides a structured model representation, and we can explore model reparameterisation through the lens of program transformations. In this paper, we focus on transforming generative probabilistic programs where the program represents a sampling process describing how the data was generated from some unknown latent variables. Most probabilistic programming languages (PPLs) provide some mechanism for transforming a generative process into an inference program; our automatic reparameterisation approach is applicable to PPLs that transform generative programs using effect handling. This includes modern deep PPLs such as Pyro (Uber AI Labs, 2017) and Edward2 (Tran et al., 2018).
Consider a generative program, where running the program forward generates samples from the prior over latent variables and data. Effect handling-based PPLs treat generating a random variable within such a model as an effectful operation (an operation that is understood as having side effects) and provide ways for resolving this operation in the form of effect handlers, to allow for inference. For example, we often need to transform a statement that generates a random variable to a statement that evaluates some (log) density or mass function. We can implement this using an effect handler:
The handler log_prob_at_0 handles statements of the form . The meaning of such statements is normally “sample a random variable from the distribution and record its value in ”. However, when executed in the context of log_prob_at_0 (we write with log_prob_at_0 handle model), statements that contain random-variable constructions are handled by setting the value of the variable to , then evaluating the log density (or mass) function of at and recording its value in a new (program) variable .
For example, consider the function implementing Neal’s funnel in (b). When executed without any context, this function generates two random variables, and . When executed in the context of the log_prob_at_0 handler, it does not generate random variables, but it instead evaluates and ((c)).
This approach can be extended to produce a function that corresponds to the log joint density (or mass) function of the latent variables of the model. In §§ B.1, we give the pseudo-code implementation of a function make_log_joint, which takes a model — that generates latent variables and generates and observes data — and returns the function . This is a core operation, as it transforms a generative model into a function proportional to the posterior distribution, which can be repeatedly evaluated and automatically differentiated to perform inference.
More generally, effectful operations are operations that can have side effects, e.g. writing to a file. The programming languages literature formalises cases where impure behaviour arises from a set of effectful operations in terms of algebraic effects and their handlers (Plotkin and Power, 2001; Plotkin and Pretnar, 2009; Pretnar, 2015). A concrete implementation for an effectful operation is given in the form of effect handlers, which (similarly to exception handlers) are responsible for resolving the operation. Effect handlers can be used as a powerful abstraction in probabilistic programming, as discussed previously by Moore and Gorinova (2018), and shown by both Pyro and Edward2.
Once equipped with an effect handling-based PPL, we can easily construct handlers to perform many model transformations, including model reparameterisation.
A non-centring handler can be used to non-centre all standardisable 333We focus on Gaussian variables, but non-centring is broadly applicable, e.g. to the location-scale family and random variables that can be expressed as a bijective transformation of a “standardised” variable . latent variables in a model. The handler simply applies to statements of the form , where is not a data variable, and transforms them to , . When nested within a log_prob handler (like the one from §§ 4.1), log_prob handles the transformed standard normal statement . Thus, make_log_joint applied to a model in the ncp context returns the log joint function of the transformed variables rather than the original variables .
For example, corresponds to . But, gives the function , where and .
This approach can easily be extended to other parameterisations, including partially centred parameterisations (as shown later in §§ 5.2), non-centring and whitening multivariate Gaussians, and transforming constrained variables to have unbounded support.
We implement reparameterisation handlers in Edward2, a deep PPL embedded in Python and TensorFlow(Tran et al., 2018). A model in Edward2 is a Python function that generates random variables. In the core of Edward2 is a special case of effect handling called interception. To obtain the joint density of a model, the language provides the function make_log_joint_fn(model), which uses a log_prob interceptor (handler) as previously described.
We extend the usage of interception to treat sample statements in one parameterisation as sample statements in another parameterisation (similarly to the ncp handler above):
We use the interceptor by executing a model of interest within the interceptor’s context (using Python’s context managers). This overrides each random variable’s constructor to construct a variable with location and scale , and scale and shift that variable appropriately:
We present and explain in more detail all interceptors used for this work in Appendix B.
We introduce two inference strategies that exploit automatic reparameterisation: interleaved Hamiltonian Monte Carlo (iHMC), and the Variationally Inferred Parameterisation (VIP).
Automatic reparameterisation opens up the possibility of algorithms that exploit multiple parameterisations of a single model. We consider interleaved Hamiltonian Monte Carlo (iHMC), which uses two HMC steps to produce each sample from the target distribution: the first step is made in CP, using the original model latent variables, while the second step is made in NCP, using the auxiliary standardised variables. Interleaving MCMC kernels across parameterisations has been explored in previous work on Gibbs sampling (Yu and Meng, 2011; Kastner and Frühwirth-Schnatter, 2014), which demonstrated that CP and NCP steps can be combined to achieve more robust and performant samplers. Our contribution is to make the interleaving automatic and model-agnostic: instead of requiring the user to write multiple versions of their model and a custom inference algorithm, we implement iHMC as a black-box inference algorithm for centred Edward2 models.
Algorithm 1 outlines iHMC. It takes a single centred model that defines latent variables and generates data . It uses the function make_ncp to automatically obtain a non-centred version of the model, , which defines auxiliary variables and function , such that .
The best parameterisation for a given model may mix centred and non-centred representations for different variables. To efficiently search the space of reparameterisations, we propose the variationally inferred parameterisation (VIP) algorithm, which selects a parameterisation by gradient-based optimisation of a differentiable variational objective. VIP can be used as a pre-processing step to another inference algorithm; as it only changes the parameterisation of the model, MCMC methods applied to the learned parameterisation maintain their asymptotic guarantees.
Consider a model with latent variables . We introduce parameterisation parameters for each variable , and transform by defining and . This defines a continuous relaxation that includes NCP as the special case and CP as . More generally, it supports a combinatorially large class of per-variable and partial centrings. We aim to choose the parameterisation under which the posterior
is “most like” an independent normal distribution.
A natural objective to minimise is , where is an independent normal model with variational parameters . Minimising this divergence corresponds to maximising a variational lower bound, the ELBO (Bishop, 2006):
Note that the auxiliary parameters are not statistically identifiable: the marginal likelihood is constant with respect to . However, the computational properties of the reparameterised models differ, which is what the variational bound selects for. Our key hypothesis (which the results in Figure LABEL:fig:ess_and_elbo_results seem to support) is that diagonal-normal approximability is a good proxy for MCMC sampling efficiency.
To search for a good model reparameterisation, we optimise using stochastic gradients to simultaneously fit the variational distribution to the posterior and optimise the shape of that posterior. (a) provides a visual example: an independent normal variational distribution is a poor fit to the pathological geometry of a centred Neal’s funnel, but non-centring leads to a well-conditioned posterior, where the variational distribution is a perfect fit. In general settings where the reparameterised model is not exactly Gaussian, sampling-based inference can be used to refine the posterior; we apply VIP as a preprocessing step for HMC (summarised in Algorithm 2). Both the reparameterisation and the construction of the variational model are implemented as automatic program transformations using Edward2’s interceptors.
An alternate interpretation of VIP is that it expands a variational family to a more expressive family capable of representing prior dependence. Letting represent the partial centring transformation, an independent normal family on the transformed model corresponds to an implicit posterior on the original model variables. Under this interpretation,
are variational parameters that serve to add freedom to the variational family, allowing it to interpolate from independent normal (at, (b) left) to a representation that captures the exact prior dependence structure of the model (at , (b) right).
Effective sample size and 95% confidence intervals for the radon model across US states.
We evaluate our proposed approaches by using Hamiltonian Monte Carlo to sample from the posterior of hierarchical Bayesian models on several datasets:
Eight schools (Rubin, 1981)
: estimating the treatment effectsof a course taught at each of schools, given test scores
and standard errors:
Radon (Gelman and Hill, 2006)
: hierarchical linear regression, in which the radon levelin a home in county is modelled as a function of the (unobserved) county-level effect , the county uranium reading , and , the number of floors in the home:
German credit (Dua and Graff, 2017)
: logistic regression; hierarchical prior on coefficient scales:
Election ’88 (Gelman and Hill, 2006): logistic model of 1988 US presidential election outcomes by county, given demographic covariates and state-level effects :
Electric Company (Gelman and Hill, 2006): paired causal analysis of the effect of viewing an educational TV show on each of classforms over grades. The classrooms were divided into pairs, and one class in each pair was treated () at random:
For each model and dataset, we compare our methods, interleaved HMC (iHMC) and VIP-HMC, with baselines of running HMC on either fully centred (CP-HMC) or fully non-centred (NCP-HMC) models. We initialise each HMC chain with samples from an independent Gaussian variational posterior, and use the posterior scales as a diagonal preconditioner; for VIP-HMC this variational optimisation also includes the parameterisation parameters . All variational optimisations were run for the same number of steps, so they were a fixed cost across all methods except iHMC (which depends on preconditioners for both the centred and non-centred transition kernels). The HMC step size and number of leapfrog steps were tuned following the procedures described in Appendix C.
We report the average effective sample size per gradient evaluations (), with standard errors computed from chains. We use gradient evaluations, rather than wallclock time, as they are the dominant operation in both HMC and VI and are easier to measure reliably; in practice, the wallclock times we observed per gradient evaluation did not differ significantly between methods.
Full details on the set up of the experiments can be found in Appendix C.
Figures 4 and LABEL:fig:ess_and_elbo_results show the results of the experiments. In most cases, either the centred or non-centred parameterisation works well, while the other does not. An exception is the German credit dataset, where both CP-HMC and NCP-HMC give a small ESS: or respectively.
Across the datasets in both figures, we see that iHMC is a robust alternative to CP-HMC and NCP-HMC. Its performance is always within a factor of two of the best of CP-HMC and NCP-HMC, and sometimes better. In addition to being robust, iHMC can sometimes navigate the posterior more efficiently than either of CP-HMC and NCP-HMC can: in the case of German credit, it performs better than both ().
Performance of VIP-HMC is typically as good as the better of CP-HMC and NCP-HMC, and sometimes better. On the German credit dataset, it achieves , more than three times the rate of CP-HMC and NCP-HMC, and significantly better than iHMC. LABEL:fig:ess_and_elbo_results shows the correspondence between the optimised mean-field ELBO and the effective sampling rate. This result supports the ELBO as a reasonable predictor of the conditioning of a model, and further confirms the validity of VIP as a strategy for automatic reparameterisation. Finally, we show some of the parameterisations that VIP finds in Figure 6. VIP’s behaviour appears reasonable: for most datasets we looked at, VIP finds the “correct” global parameterisation: most parameterisation parameters are set to either or (Figure 6, left). In the cases where a global parameterisation is not optimal (e.g. radon MO, radon PA and, most notably, German credit), VIP finds a mixed parameterisation, combining centred, non-centred, and partially centred variables (Figure 6, centre and right).
Our results demonstrate that automated reparameterisation of probabilistic models is practical, and enables inference algorithms that can in some cases find parameterisations even better than those a human could realistically express. These techniques allow modellers to focus on expressing statistical assumptions, leaving computation to the computer. We view the methods in this paper as exciting proofs of concept, and hope that they will inspire additional work in this space.
While we focus on reparameterising hierarchical models naturally written in centred form, the inverse transformation—detecting and exploiting implicit hierarchical structure in models expressed as algebraic equations—is an important area of future work. This may be compatible with recent trends exploring the use of symbolic algebra systems in PPL runtimes (Narayanan et al., 2016; Hoffman et al., 2018). We also see promise in automating reparameterisations of heavy-tailed and multivariate distributions, and in designing new inference algorithms to exploit these capabilities.
International Symposium on Functional and Logic Programming - 13th International Symposium, FLOPS 2016, Kochi, Japan, March 4-6, 2016, Proceedings, pages 62–79. Springer.
As the Gaussian distribution is self-conjugate, the posterior distribution (given ) in each case (centred or non-centred) is also a Gaussian distribution, whose shape is entirely specified by a covariance matrix . To quantify the quality of each parameterisation, we investigate the condition number of the posterior covariance matrix in each case under the best diagonal preconditioner.
We do this in three steps:
We compare the condition numbers and , where are the eigenvalues of
At the same time, for , we have:
Thus, for , we get: And therefore:
Like in the previous subsection, we have:
Similarly to before, we derive , and therefore:
Consider a diagonal preconditioner . The best diagonal preconditioner of is such that:
Firstly, in terms of the covariance matrix in the centred case, we have:
The solutions of are the solutions of:
which, after simplification, becomes:
We want to find that minimises . Let . We are looking for , such that , in order to find . By expanding and simplifying we get:
We obtain the best diagonal preconditioner in a similar manner, finally getting:
Finally, we substitute and in the respective eigenvalue equations to derive the condition number in each case:
Interceptors can be used as a powerful abstractions in a probabilistic programming systems, as discussed previously by Moore and Gorinova (2018), and shown by both Pyro and Edward2. In particular, we can use interceptors to automatically reparameterise a model, as well as to specify variational families. In this section, we show Edward2 pseudo-code for the interceptors used to implement iHMC and VIP-HMC.
The following code is an outline of Edward2’s impllementation of a function that evaluates the log density at some given :
By executing the model function in the context of log_prob_interceptor, we override each sample statement (a call to a random variable constructor rv_constructor), to generate a variable that takes on the value provided in the arguments of log_joint_fn
. As a side effect, we also accumulate the result of evaluating each variable’s prior density at the provided value, which, by the chain rule, gives us the log joint density.
By intercepting every construction of a normal variable (or, more generally, of location-scale family variables), we can create a standard normal variable instead, and scale and shift appropriately.
Running a model that declares the random variables in the context of ncp_interceptor will declare a new set of standard normal random variables . Nesting this in the context of the log_prob_interceptor from LABEL:ap:edward will then evaluate the log joint density .
The VIP interceptor is similar to the NCP interceptor. The notable difference is that it creates new learnable Tensorflow variables, which correspond to the parameterisation parameters :
Finally, we show a mean-field variational familiy interceptor, which we use both to tune the step sizes for HMC (see Appendix C), and to make use of VIP automatically. The mfvi_interceptor simply substitutes each sample statement with sampling from a normal distribution with parameters specified by some fresh variational parameters and :
CP-HMC: HMC run on a fully centred model.
NCP-HMC: HMC run on a fully non-centred model.
iHMC: interleaved HMC.
VIP-HMC: HMC run on the a model reparameterised as given by VIP.
Each run consists of VI pre-processing and HMC inference.
We use automatic differentiation to compute stochastic gradients of the ELBO with respect to and perform the optimisation using Adam (Kingma and Ba, 2014). We implement the constraint using a sigmoid transformation; for .
Prior to running HMC, we also run VI to approximate per-variable initial step sizes (equivalently, a diagonal preconditioning matrix), and to initialise the chains. For each of CP-HMC and NCP-HMC this is just mean-field VI, and for VIP-HMC the VI procedure is VIP.
Each VI method is run for optimisation steps, and the ELBO is approximated using Monte Carlo samples. We use the Adam optimiser with initial learning rate , decayed to after 1000 steps and after 2000 steps, and returned the result with the highest ELBO.
In each case we run chains for a warm-up period of steps, followed by steps each, and report the average effective sample size (ESS) per gradient evaluations (). Since ESS is naturally estimated from scalar traces, we first estimate per-variable effective sample sizes for each model variable, and take the overall ESS to be the minimum across all variables.
The HMC step size
was adapted to target an acceptance probability of 0.75, following a simple update rule
where is the acceptance probability of the proposed state at step (Andrieu and Thoms, 2008). The adaptation runs during the first 1500 steps of the warm-up period, after which we allow the chain to mix towards a stationary distribution.
The number of leapfrog steps is chosen using ‘oracle’ tuning: each sampler is run with logarithmically increasing number of leapfrog steps in , and we report the result that maximises . This is intended to decouple the problem of tuning the number of leapfrog steps from the issues of parameterisation consider in this paper, and ensure that each method is reasonably tuned. For iHMC, we tune a single number of leapfrog steps that is shared across both the CP and NCP substeps.