Likelihood-free MCMC with Approximate Likelihood Ratios

03/10/2019 ∙ by Joeri Hermans, et al. ∙ CERN University of Liège 10

We propose a novel approach for posterior sampling with intractable likelihoods. This is an increasingly important problem in scientific applications where models are implemented as sophisticated computer simulations. As a result, tractable densities are not available, which forces practitioners to rely on approximations during inference. We address the intractability of densities by training a parameterized classifier whose output is used to approximate likelihood ratios between arbitrary model parameters. In turn, we are able to draw posterior samples by plugging this approximator into common Markov chain Monte Carlo samplers such as Metropolis-Hastings and Hamiltonian Monte Carlo. We demonstrate the proposed technique by fitting the generating parameters of implicit models, ranging from a linear probabilistic model to settings in high energy physics with high-dimensional observations. Finally, we discuss several diagnostics to assess the quality of the posterior.



There are no comments yet.


page 7

page 8

Code Repositories


Simulation-based inference in PyTorch

view repo


Neural likelihood-free methods in PyTorch.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian inference is a major statistical tool in scientific domains. Scientific use cases are generally interested in a posterior which relates parameters of a model to observations . Although the Bayesian formulation is a natural fit for such settings, the implied computation is generally not. Often the evidence is intractable, making posterior inference using Bayes’ rule impractical. Methods such as Markov chain Monte Carlo (mcmc[1, 2] bypass the dependence on the evidence by numerically approximating the posterior. Typically, this involves evaluating some form of the likelihood ratio. In this work, we consider an equally common setting where the likelihood is also intractable, thereby calling for tractable approximations. A large body of research in the area of likelihood-free inference aims to improve the quality and the computational efficiency of such approximations.

This work proposes a novel approach to perform likelihood-free posterior inference using mcmc. Our method relies on the observation that one can train a classifier to approximate the likelihood ratio [3]

. This approximate ratio can be used to compute the transition probability in common

mcmc samplers such as Metropolis-Hastings [1, 2]. In the case the classifier is differentiable, one can derive the score , which makes the method applicable to modern mcmc samplers such as Hamiltonian Monte Carlo [4].

Contributions We adapt the training procedure proposed by [3] to improve the quality of the approximated likelihood ratio (Section 2.1). We introduce approximate likelihood ratio mcmc and derive likelihood-free Metropolis-Hastings and Hamiltonian Monte Carlo samplers (Section 3).

2 Background

2.1 Approximate likelihood ratio tests

When comparing two hypotheses and under a set of observations

, the most powerful test-statistic 

[5] is the likelihood ratio


Previous work [3] shows that it is possible to reformulate the test-statistic under a change of variables such that the likelihood ratio can be written as


This observation can be used in a supervised setting to train a classifier to distinguish samples and . The decision function modeled by the optimal classifier is


Under this decision function, the likelihood ratio of a sample can be obtained using


In some cases, one might also be interested in comparing a larger set of hypotheses. However, training for every likelihood ratio test is not practical. A solution proposed by [3] is to train a parameterized classifier against an arbitrary reference hypothesis . Similarly, the decision function modeled by is


As the reference hypothesis is known, the likelihood ratio between two hypotheses of interest and against is equivalent to


In practice, the choice of a (mathematically) arbitrary has a significant effect on the approximated likelihood ratio function. This is not only attributable to modeling choices in , but also to issues arising in the absence of support between and . In these regions, the classifier

can take on an arbitrary decision function, which results in arbitrary approximated likelihood ratios. Additionally, the use of a reference hypothesis adds a sensitive hyperparameter to the optimization algorithm which requires careful tuning. To overcome the issue with reference hypotheses, we propose training

to classify samples from the evidence and the likelihood such that the decision function becomes


which can be targeted using Algorithm 1. Under this formulation, one can use the likelihood-to-evidence ratio


to obtain the log likelihood ratio


Under the optimized classifier we obtain by computing


Contrary to other approaches which use a classifier to directly approximate the likelihood-to-evidence ratio [6], ours by construction has the desirable property that


Targeting the evidence has several additional advantages. First, we remove the reliance on a manually specified and sensitive . Second, there is support between and , resulting in a proper definition of the likelihood ratio. Third, a single uniform simulation dataset can be used to train , simplifying the implementation. The decision functions of classifiers trained with both approaches are shown in Figure 1. Finally, we would like to mention that the additional loss term in Algorithm 1 (line seven) is not required when having access to an infinite amount of simulations [3]. However, empirical evaluations with limited simulation budgets indicate this term vastly improves the stability and convergence of the resulting parameterized classifier.

Inputs: Generative model .
Prior .
Outputs: Parameterized classifier .
Hyperparameters: Batch-size .

1:while not converged do
2:     Sample
3:     Sample
4:     Simulate
5:     Simulate
10:end while
Algorithm 1 Optimization of .
Figure 1: In both scenarios an ensemble of 10 classifiers is trained under . We show the mean decision function of

and its standard deviation under an observation

. is superimposed as a blue line. Additionally, we plot the approximate posterior (shaded area near ). (Left): training procedure proposed by [3]. is trained under . (Right):

the proposed procedure. By classifying samples against the evidence, the resulting decision function has significantly less variance due to proper constraints for the optimization algorithm. This translates into

putting more density on .

2.2 Markov chain Monte Carlo

Markov chain Monte Carlo (mcmc) methods are generally applied when one is interested in a posterior density


for which the evidence is typically intractable, but point-wise evaluations of the likelihood are possible [1, 2, 7]. mcmc samples are drawn from by collecting dependent states of a Markov chain. The mechanism for proposing the next state in the Markov chain is dependent on the algorithm at hand. However, accepting the transition is usually determined by evaluating some form of the ratio


As a result, the normalizing constant is not required as cancels out within the ratio


3 Approximate Likelihood Ratio MCMC

In this work, we propose an approach to draw samples from a posterior using Markov chain Monte Carlo without likelihoods. We base our approach on the observation that mcmc samplers use the likelihood ratio to assess the quality of a candidate state against the current state . In many applications however, the evaluation of the likelihood is intractable. Recent methods [8, 9, 10] address this by modeling an approximate density and use

as a proxy in the likelihood ratio. Typically, this requires (complex) density estimators 

[11, 12, 9, 13] and expensive training procedures to properly approximate . Instead, we directly model the likelihood ratio using the technique presented in Section 2.1 such that . As a result, the optimization procedure reduces to a classification problem without making any modeling assumptions on . After training, we can use the classifier to transform common mcmc samplers into likelihood-free alternatives (Section 3.1 & Section 3.2). Building upon the results presented in this section, we propose a practical procedure in Appendix A which iteratively improves the approximate posterior.

3.1 Metropolis-Hastings

Adapting the Metropolis-Hastings (mh) algorithm [1, 2] is done by replacing the likelihood ratio test in the original Metropolis mechanism, making the sampler likelihood-free. We summarize the modified mh sampler in Algorithm 2 for completeness.

Inputs: Initial parameter .
Transition distribution .
Parameterized classifier .
Observations .
Outputs: Markov chain
Hyperparameters: Steps .

3:for  do
9:end for
Algorithm 2 Likelihood-free Metropolis-Hastings

3.2 Hamiltonian Monte Carlo

Hybrid or Hamiltonian Monte Carlo (hmc[14, 4, 15] improves the sampling efficiency of vanilla mcmc algorithms by reducing the autocorrelation within the sequence of random Monte Carlo samples. Improving the sampling efficiency is especially useful in scenarios where designing a proper transition distribution is difficult. In hmc, this is achieved by modeling the density as a potential energy function


and attributing some kinetic energy


with momentum to the current state . A new state can be proposed by simulating the Hamiltonian dynamics of . This is achieved by integrating over a fixed number of steps with initial momentum . Afterwards, the Metropolis mechanism uses


to accept or reject the transition to the candidate . The first step in making hmc likelihood-free, is by showing reduces to the log likelihood ratio,


To simulate the Hamiltonian dynamics of , we require a likelihood-free definition of . Current likelihood-free implementations of hmc rely on finite differences to estimate this gradient [16]. Instead, we make the observation that can be rewritten as


This form can be recovered by the optimal classifier under the assumption that the classifier is differentiable. To show this, consider


By expanding the likelihood-to-evidence ratio , Equation 21 reduces to


As an illustration, Figure 2 shows the ability of an approximate classifier to produce accurate estimates of . In particular, the example shows the gradient of observations with respect to and highlights that estimates are accurate when samples of have been presented to during training.

Finally, having likelihood-free forms for and , we can now replace these components in hmc to obtain a likelihood-free hmc sampler (lf-hmc).

Inputs: Initial parameter .
Momentum distribution .
Parameterized classifier .
Observations .
Outputs: Markov chain
Hyperparameters: Steps .
Leapfrog-integration steps .
Leapfrog-integration stepsize .

3:for  do
8:     for  do
13:     end for
18:end for
Algorithm 3 Likelihood-free Hamiltonian Monte Carlo
Figure 2: This figure shows the ability of to produce gradient estimates of the log likelihood with respect to the analytical form. The discrepancy originates from the absence of density in these regions during training. In particular, the classifier did not observe any samples there, while the closed-form expression for the gradient attributes some density to these regions.

4 Experiments

We demonstrate likelihood-free mcmc in several problem settings. We report the classifier structure, and the considered during the training of in every experiment. We do not apply the iterative technique proposed in Appendix A as the classifiers were sufficiently powerful. The posterior is approximated by samples. We also dedicate the computational effort to collect Approximate Bayesian Computation (abc) posterior samples for benchmarking. Additionally, mcmc includes a short burnin period.

The source code to reproduce these experiments can be found in the following To make the proposed technique directly applicable, we wrapped the method and associated diagnostics in a Python package called

(a) lf-mh
(b) abc
(c) Observations
Figure 3: Approximate posterior samples for 10 observations drawn from . The acceptance threshold for abc has been set to . The dashed blue line in Subfigure 2(c) indicates the true slope and intercept .

4.1 Linear regression model

In this first illustrative experiment, we are interested in inferring the posterior over the parameters of a linear probabilistic generative model333Adapted from


with , and . Figure 3 and Figure 4 show the posterior for 10 and 100 observations respectively. For each of these plots, the same parameterized classifier is used to draw posterior samples. We would like to note two modes should be present for as depends on the absolute value of .

Setup  We generate simulations for drawn from a uniform prior . These simulations are used to train . The architecture of

is a multilayer-perceptron with

selu [17] non-linearities. No regularization nor any data normalization technique is applied. The classifier is trained using Adam [18]. The summary statistic for abc is the combination of the slope, intercept and resulting

-value of a linear regression. Finally, the Euclidean distance is computed between the summary statistics to assess the fit of drawn parameter against the observations. Observations are simulated from


(a) lf-mh
(b) abc
(c) Observations
Figure 4: Approximate posterior samples for 100 observations drawn from . The abc acceptance threshold has been set to . Modes for are unbalanced, which is attributable to the dependence among mcmc samples.

Results  The approximate posteriors computed by the proposed method and by ABC are shown in Figures 3 and Figure 4, for 10 and 100 observations respectively. Plots suggest agreement between the two approximate posteriors, showing similar relations between parameters. In both settings, the same pre-trained parameterized classifier is used, which shows the independence between the expensive training procedure and inference. While our method requires a static number of simulations, rejection sampling schemes have to rely on active calls to the simulator. This approach might be inefficient for small acceptance thresholds under a specific prior. Additionally, the proposed approach can be trained on pre-simulated datasets (common in many scientific fields), thereby limiting the reliance on expensive simulator calls during training.

4.2 Lotka-Volterra population model

The Lotka-Volterra model [19] describes the evolution of predator and prey populations over time. The population dynamics are driven by a set of differential equations with several parameters describing the intrinsic population properties: the reproduction rate of the predator , the death rate of the predator , the reproduction rate of the prey , and the predation rate of the prey , thereby summarizing the model parameter as .

Setup  The evolution of the populations is modeled as a Markov jump process. We consider the same simulation model as in [8]. As before, simulations are generated from the prior . We start with an initial prey population of 100, and an initial predator population of 50 and simulate the population dynamics for 30 time-units with a resolution of 0.2. This results in a -dimensional matrix representing the state of the prey and predator populations over time. Subsequently, the content of the matrix is normalized between

to dampen the effect of large population sizes on the neural network. The processed matrix and a standardized

are fed into a batch normalized MLP with

relu non-linearities. The standardization of is computed using the mean and variance of , which is known a priori. The summary statistic in abc follows the definition from [8]: the mean of the population sizes and the autocorrelation of the population time series at different lags. Observations are simulated from .

Results  The approximated posteriors are depicted in Figure 5. The posteriors approximated by lf-mh and abc exhibit similar shapes. As the abc posterior is smeared out by the acceptance threshold , the variance of the lf-mh posterior is expected to be smaller. Additionally, for the considered prior, lf-mh is significantly more simulation-efficient than abc. The acceptance rate of abc under an acceptance threshold of was indicating that abc computed in the order of simulations for an equivalent amount of posterior samples. In contrast, the proposed method has a constant simulation cost. However, the acceptance rate of abc could be improved by relying on more efficient sampling procedures such as smc-abc [20].

(a) lf-mh
(b) abc
Figure 5: Approximate posterior for the Lotka-Volterra population model. abc samples have been drawn under an acceptance threshold of .

4.3 Particle physics detector alignment

To inspect the performance of the proposed method in the presence of high-dimensional observations, we turn to a particle physics inference problem. We consider the Pythia simulator [21] configured according to the parameters derived by the Monash tune [22, 23]. Under this configuration, Pythia simulates collisions at a center-of-mass energy of 91.2 GeV. The product of these collisions is a boson which decays to some quarks. These quarks are then observed by a detector, which we design as a 1-layered pixelized spherical detector [24] with a radius of 30 units, emulating a grid in pseudorapidity and azimuthal angle , covering . Additionally, the detector does not record the energy of a particle passing through its material, but only indicates a hit (i.e., a tracker). The detector is parameterized by an offset parameter in the z-axis parallel to the beam axis. An offset of means that the center of the spherical detector is centered at the collision point.

Setup  We generate simulations with drawn from the prior and 100 collisions per simulation. The parameterized classifier is a modification of vgg-11 [25] with batch-normalization [26] in the convolutional layers. The dependence on is added in the final fully-connected part of the classifier. While vgg

might be overpowered for this particular problem, we made this decision to demonstrate that off-the-shelf convolutional neural networks can be used in the proposed method. Due to the high-dimensional and binary nature of the observations we apply the cosine-similarity as the distance function in

abc with an acceptance threshold of . Lowering the acceptance threshold has a significant computational impact, making the rejection sampling scheme impractical.

Results  Figure 6 clearly shows the proposed technique is outperforming abc. The large variance and flatness of abc posteriors might be attributable to the insufficiency of the summary statistic. It is especially challenging to construct a sufficient summary statistic due to the high variability of the high-dimensional observations caused by the underlying physical processes. Despite the absence of domain knowledge, the classifier is able to extract an approximate posterior which is in agreement with the generating parameter for 100 collisions.

Figure 6: Particle tracker alignment under a single observation with 100 collisions. We consider three settings from Pythia simulations. (Top): . While abc has significantly higher variance, the density is concentrated on the negative range of the offset, suggesting the detector offset is in . No conclusive decision can be made under this particular -threshold. Similar arguments can be made for (Middle): , and (Bottom): . Employing lf-mh, the same classifier has been used across all settings. This shows that a single likelihood ratio model can be applied to a variety of problems after an expensive training procedure.

5 Diagnostics

5.1 Classifier convergence

We make the observation that given a fixed and sufficiently large dataset, the increase in the capacity or representational power of a classifier leads to a decrease in the variance of over classifiers sampled from a optimization procedure , as the error with respect to the best probabilistic classifier decreases.

The probabilistic classifier can be used to obtain the likelihood-to-evidence ratio. As this ratio is unique for a given observation and model parameter , one can assess the classifier convergence by computing . If the variance is large, then the classifier capacity is insufficient to approximate the likelihood-to-evidence ratio, which might result in a miss-estimation of the posterior. If the variance is small, we cannot necessarily conclude that the posterior is well-estimated. However, if the classifier capacity is large enough, then the approximation error of the likelihood-to-evidence ratio is likely to be small, which would result in a good approximate posterior. Figure 7 demonstrates this principle empirically. Additionally, this diagnostic can be useful in an architecture-search scheme, in which the architecture of the parameterized classifier is optimized to minimize this variance.

(a) 64 hidden units
(b) 2 hidden units
Figure 7: We train 15 high and low capacity classifiers, each having 64 and 2 hidden units respectively with a single hidden layer. We perform posterior inference with a single observation . The sample variance indicates that the resolution error of the low capacity classifiers is much greater compared to the high capacity classifiers. In some regions the sample variance of the classifier is 0, indicating a perfect approximation.

5.2 Simulation based calibration

Earlier work [27] proposes a method to verify the Bayesian computation of software. The diagnostic is based on the observation that if and

, then exact posterior quantiles for each parameter will be uniformly distributed, provided that the posteriors are absolutely continuous 

[28]. However, this diagnostic is not guaranteed to function on posterior samples [28]. Simulation Based Calibration (svb[28] circumvents this issue by discretizing the function into a rank-statistic . svb ranks independent samples of the posterior and approximates the quantiles by histogramming. The only assumption is the availability of an implicit model (simulator), thereby making the proposed technique perfectly suitable for diagnosing likelihood-free Bayesian approximations. The vanilla technique is not directly applicable to mcmc samplers (due to sample auto-correlation), as the technique requires independent posterior samples. This can be addressed by ensuring that the effective sample size is at least after uniform thinning of the Markov chain. Figure 8 shows a histogram of rank-statistics for a classifier . Bins outside of the shaded area would indicate a poor approximation of the posterior.

Figure 8: Histogram over rank statistics and independent tests computed using svb [28]. Bins outside the shaded indicate a poor approximation.

6 Related work

Sampling from an implicit model to perform statistical inference [29, 30] is a common technique in science to bypass the intractability of the likelihood. Works applying this technique often rely on rejection sampling schemes such as Approximate Bayesian Computation [31, 32]. abc approximates the posterior by accepting proposed states . This is done by comparing simulated samples against observations under a sufficient summary statistic . A proposed state is accepted when the distance of these summary statistics under some divergence is smaller or equal to , i.e., . Even when is low-dimensional, the acceptance rate of proposals can severely be affected by the prior , i.e., a large misspecified prior can cause a lot of rejections. In recent years variants of abc have been proposed to improve the acceptance rate by guiding simulations based on previously accepted states [20, 33, 34, 35]. Such approaches are crucial, as the acceptance rate drastically decreases when . On the other hand, approaches such as [36] remove the need for handcrafted summary statistics. While abc yields unbiased samples of when , it suffers from high variance as in most settings.

Other approaches take a new perspective and cast inference as an optimization problem [37, 38]. In variational inference (vi), a parameterized posterior over parameters of interest is optimized to minimize kl[39]. Amortized inference [40, 41] expands on this idea by using generative models to capture inference mappings. However, this typically leads to an amortization gap [42, 43] which has been extensively studied by [44]. Recent work in [45] proposes a novel form of variational inference by introducing an adversary in combination with reinforce-estimates [46, 47] to optimize a parameterized prior. This work lends itself naturally to a likelihood-free setting, but only provides point-estimates.

Similarly, approaches like Sequential Neural Posterior Estimation (snpe[10] borrow ideas from vi literature and iteratively adjusts a parameterized posterior defined as a Mixture Density Network [48]. Instead of learning the posterior directly, Sequential Neural Likelihood (snl[8] uses modern conditional density estimators such as autoregressive flows [11, 12, 9, 13] which serve as a proxy for the likelihood . Having access to this proxy, one can use traditional mcmc-techniques to draw samples from the posterior. While snl relies on invertible operations, it also requires careful tuning to test the expressiveness of the autoregressive flow. On the other hand, employing our approach one can rely on standard optimization techniques to train the parameterized classifier. Additionally, the sequential nature of snl makes the technique susceptible to bias when initial samples do not sufficiently represent the likelihood.

Early likelihood-free mcmc samplers relied on the abc rejection sampling scheme to draw proposals. These samples are in turn evaluated by the Metropolis-mechanism under a summary statistic [34, 49]. Other likelihood-free alternatives with theoretical guarantees can be constructed by utilizing pseudo-marginal [50, 51]

approaches, as these only require unbiased point estimates. Typically, such unbiased estimators are obtained by importance sampling.

An important aspect of likelihood-free inference is minimizing the number of simulation calls while retaining the inference quality. Active simulation strategies such as bolfi [52] and others [53, 54] achieve this by employing Bayesian optimization to make efficient use of the simulator. Other approaches attempt to learn approximate versions of the simulator [55]

which is used to perform efficient inference, adapting a similar strategy as in world models for reinforcement learning 

[56]. Our method relies on the training of an accurate likelihood-to-evidence ratio model, which would typically come with a high initial cost. However, recent works [57, 58] make it possible to significantly reduce the cost of this step, provided joint likelihood ratios and scores can be extracted from the simulator.

7 Summary & discussion

We present an approach to perform likelihood-free posterior sampling by approximating likelihood ratios and embedding these ratios in Markov chain Monte Carlo samplers. The likelihood ratios are approximated by training a parameterized classifier to distinguish samples from the likelihood and evidence . In turn, the transformed output of the classifier can be used to approximate the likelihood ratio between arbitrary model parameters. Due to the mcmc nature of the sampling procedure, the presented approach does not suffer from the computational inefficiencies that are present in traditional rejection sampling schemes. Experimental results demonstrate the efficiency, robustness and ease-of-use of the proposed method, enabling the use of existing high-performance neural architectures. Contrary to existing schemes, ours does not require the classifier to be retrained when presented with a different set of observations. However, this comes at the significant upfront cost of generating training set tuples and additional requirements for sufficient classifier capacity.

Likelihood-free mcmc is of interest in domain sciences such as astronomy and high-energy physics which typically rely on computer simulations to describe complex generative processes, as these provide a natural way to perform inference. However, before making any conclusion in a scientific process, a formal understanding of the effect of bias and variance in the classifier would be of interest.

Other interesting directions for future work include the extraction of state proposals from the parameterized classifier instead of using fixed transition distributions or relying on an expensive computation of the Hamiltonian dynamics.


Joeri Hermans would like to thank the F.R.S.-FNRS for his FRIA scholarship. Partial Computational resources have been provided by the Consortium des Équipements de Calcul Intensif (CÉCI), funded by the Fonds de la Recherche Scientifique de Belgique (F.R.S.-FNRS) under Grant No. 2.5020.11 and by the Walloon Region.


Appendix A Iterative posterior estimation

In case the capacity of a trained classifier is insufficient with respect to a given prior , its output may be biased. The bias originates from the approximation error of the likelihood-to-evidence ratio and is attributed to the discrepancy between the estimator and the optimal classifier. Typically, this bias will over- or underestimate the likelihood-to-evidence ratio, which translates into an over- or underestimation of the approximated posterior. While we do not describe the effects of the bias on the approximate posterior formally, empirical results indicate that the bias is expected to overestimate the posterior density. This observation motivates the design of a posterior estimation technique which frees up some capacity of by iteratively reducing the bounds of a prior. This procedure is summarized in Algorithm 4.

Inputs: Initial prior .
Observations .
Outputs: Markov chain
Hyperparameters: Rounds .

2:for  do
7:end for
Algorithm 4 Iterative likelihood-free mcmc

a.1 Demonstration

Consider the following inference task. An implicit model generates images of circles within a coordinate system. The simulator takes as input, where and are the coordinates of the circle’s center and is its radius. The goal is to infer given a single observation . The parameterized classifier for this task is a simple MLP with selu non-linearities. We intentionally limit its capacity to demonstrate the proposed procedure. Initially, the prior at round is . After the first iteration of Algorithm 4, we obtain a Markov chain . Figure 9 shows the approximate posterior as estimated by . At the beginning of the second round, we adjust the prior bounds whenever the approximate posterior shrinks them. In this case, we set the lower bound of the prior in round to , where computes the standard deviation of the specified chain. Likewise, the upper bound is set to . Both quantities are constrained by the initial prior. While the margins introduced by are in essence not required, we introduce them to account for possible Monte Carlo error of the sampler and the bias of the parameterized classifier. After setting the refined prior , we launch the second iteration of classifier training and mcmc sampling. Figure 10 shows the final approximate posterior.

Figure 9: Approximate posterior after the first round. is not able to capture the radius.
Figure 10: Approximate posterior after the second round.