Many real-world phenomena are best described by computer simulations. Such simulators often implement a stochastic generative process, which is based on a mechanistic model and parametrized by . In practice, these simulators are used to generate samples of observations , but the density is only defined implicitly through the simulation code. Often, the generative process involves latent variables and the density
is intractable because of the integral over a large (and possibly highly structured) latent space. Without a tractable likelihood, statistical inference on the parameters given observed data is challenging. This problem has prompted the development of likelihood-free inference methods such as Approximate Bayesian Computation [1, 2, 3, 4] and neural density or neural density ratio estimation algorithms [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Nearly all of these established methods treat the simulator as a black box and only use its capability to generate samples for a specified values of .
In Refs. [24, 25, 26] a new paradigm was introduced that exploits additional information that can be extracted from the simulation. In particular, within the simulation where the latent variables are available, it is often possible to extract the joint likelihood ratio
and the joint score
which are conditioned on the latent variables corresponding to a particular sample.
It was then shown that certain loss functionals , which depend on the joint likelihood ratio and the joint score, are minimized by the likelihood ratio
an otherwise intractable quantity. This motivates a family of new techniques for likelihood-free inference in which the the joint likelihood ratio and joint score are used as training data for neural networks. These networks serve as surrogate models for the intractable likelihood or likelihood ratio. Experiments showed these new methods to be more sample-efficient than previously established neural density and neural density ratio estimation techniques. The authors of Refs. [24, 25, 26] coined the term “mining gold” for the process of extracting the joint likelihood ratio and joint score from the simulator – while the augmented data require some effort to extract, they are extremely valuable.
While the loss functionals originally proposed in Refs. [24, 25, 26] have the correct minima, they are not necessarily the most sample efficient. In particular, the proposed mean squared error (MSE) losses are often dominated by few samples with large joint likelihood ratios. Here we extend and improve that original work with two new algorithms for likelihood-free inference. The key improvement are new loss functions, which use an improved estimator for the cross entropy based on the joint likelihood ratio and joint score.
2 Cross-entropy estimation with augmented data
Consider the problem of estimating the likelihood ratio based on samples , labeled with ; samples , labeled ; and the joint likelihood ratio and joint score .
The familiar binary cross-entropy loss functional is defined as
For balanced samples () we have
which allows us to rewrite Eq. 5 as
It is straightforward to show that this loss functional is minimized by
To use the cross entropy to train a surrogate model with a finite number of samples, we need a tractable estimator for the cross entropy. The standard estimator, as used for instance in the Carl inference method , is given by
act as an unbiased, but high-variance estimator of. In the limit of infinite samples, this estimator therefore has the correct minimum of Eq. (9), but for finite sample sizes it may suffer from high variance.
With the availability of the joint likelihood ratio from the simulator, the are tractable and we can define the alternative estimator
By using the exact rather than the , the samples drawn according to also provide information about the second term in the loss function, and vice versa. By minimizing the loss function we get an estimator and thus a likelihood ratio estimator
This defines the Alice inference method111Approximate likelihood with improved cross-entropy estimator, which consists of mining the joint likelihood ratio from the simulator, training a neural network on the improved cross-entropy estimator in Eq. (11), and using this surrogate model for statistical inference on .
It is to be expected that a likelihood ratio estimator based on the Alice estimator for the cross-entropy should outperform the Carl method, which is based on the standard cross-entropy estimator in Eq. (10). The more interesting question is how it stacks up against the Rolr technique introduced in Refs. [24, 25, 26], in which the loss function
is minimized. In the limit of infinite samples it is minimized by . But here each event only contributes to either the squared error on or on term, which might lead to a higher variance.
In analogy to the Cascal and Rascal methods of Refs. [24, 25, 26], we can define an additional inference method which uses the joint score, i. e. an additional piece of information that describes the local (tangential) behavior of the likelihood function. If a parameterized likelihood ratio estimator is implemented with a differentiable architecture such as a neural network, we can calculate the gradient of the output with respect to and similarly calculate the corresponding score
of the estimator. For a perfect (or equivalently ) estimator, this corresponding score will also minimize the squared error loss with respect to the joint score , which can be extracted from the simulator [24, 25, 26]. Turning this argument around, we can use the joint score to guide the training of the estimator. This is the idea behind the Alices 222Approximate likelihood with improved cross-entropy estimator and score technique, which is based on the loss function
The factor is necessary to guarantee the correct minimum of the squared error on the score. The hyper-parameter weights the two terms in the loss function. This loss is the natural extension of the the Cascal loss function, but we expect it to reduce the variance compared to the Cascal approach for finite sample size. An interesting question is how it performs compared to the Rascal approach, which similarly augments the Rolr loss in Eq. (13) with the score term.
We experiment with the new methods in the particle physics problem introduced in Refs. [24, 25]. In this real-world problem, the outcome of proton-proton collisions is characterized by 42 observables, from which likelihood ratios and confidence limits on two model parameters are derived. We first consider an idealized setting neglecting the detector response where the likelihood function is tractable, which provides us with ground truth that can be used to evaluate the performance of the algorithms. For a detailed description of the setup, see Ref. .
In addition to the Carl, Rolr, Cascal, and Rascal techniques described above, we also compare to the Sally and Sallino methods. Sally and Sallino approximate a statistical model that is accurate in the neighborhood of . The methods are very sample efficient, but make approximations that limit their asymptotic performance.
Except for the new loss functions, we used the same architectures and hyper-parameters as in Ref. . In particular, we use fully connected networks with five hidden layers, 100 units each, and activation functions for both approaches. For Alices we use , which was found to give a good performance for the closely related Cascal method .
|training samples||training samples||training samples|
Table 1 and Fig. 1 show the quality of the likelihood ratio estimate based on various sized training samples for the new methods and compares them to the inference techniques presented in Ref. . As a performance metric we use an expected mean squared error on the log likelihood ratio, as defined in Ref. .
Unsurprisingly, the Alice and Rolr methods clearly outperform Carl, which does not have access to the joint likelihood ratio. More significantly, we find that Alice outperforms and Rolr, which does have access to the joint likelihood ratio. We conjecture that this improvement can be attributed to the lower variance of the cross-entropy compared to the squared error. More surprisingly, the Alice method also outperforms the Rascal method for larger training sample sizes (), even though Alice does not have access to the joint score.
For smaller training sample sizes () the Alices method outperforms the Alice method, which is not surprising given the additional information available during training. For larger training sample sizes (), the variance of the score actually deteriorates the performance of Alices compared to Alice. We did not perform hyper-parameter tuning for as a function of the training sample size, which should ensure that Alices performs at least as well as Alice. We leave a systematic tuning of the parameter and an analysis of sources of variance in this approach for future work.
Figure 2 shows expected exclusion contours at different confidence levels on the two parameters, assuming 36 observed events distributed according to . The methods are trained on the full training samples of
samples. The left panel shows contours constructed based on asymptotic properties of the profile likelihood ratio test statistic. While methods such asRascal are generally very accurate, with this construction they can sometimes lead to overly optimistic exclusion contours, visible as tighter bounds than the “truth” contour. We find that switching to Alice and Alices reduces this issue, but does not entirely solve it.
The right panel of Fig. 2
shows exclusion contours based on the frequentist confidence intervals calibrated with toy experiments. This Neyman construction guarantees coverage: while the limits from any approach may be worse than the optimal limits, they will never be overly optimistic. As test statistics we use the likelihood ratio with respect to the, which explains why the contours are generally stronger than in the left panel. We find that both Alice and Alices, like Rascal and Cascal of Refs. [24, 25, 26], lead to limits that are virtually indistinguishable from the ideal limits based on the true likelihood ratio.
Finally, in Fig. 3 we show similar expected exclusion contours, but in a more realistic setup in which the parton shower and detector effects are described with approximate smearing functions, which makes the true likelihood intractable. In this situation, we cannot compare the likelihood ratio estimators to the ground truth. Instead, we show the expected contours based on the Neyman construction, similar to the right panel of Fig. 2. In the left panel we show results for limited training samples of only events. In this setup, Alices allows for strong limits, comparable to Rascal and slightly better than for Alice. The right panel demonstrates that with the full training sample the results of Rascal, Alice, and Alices are indistinguishable.
In this work, we have extended recently developed inference techniques for the setting in which the likelihood is only implicitly defined through a stochastic generative model or simulator. By exploiting the joint likelihood ratio that can be extracted from the simulator, we introduced an improved cross-entropy estimator. This improved cross-entropy estimator is used to define two new likelihood-free inference techniques: Alice and Alices.
Our experiments comparing Alice and Alices with the other recently developed techniques indicate that they are significantly more sample efficient than Rolr, Cascal, and Rascal techniques. We attribute this to the lower variance of the improved cross-entropy estimator. For smaller training sample sizes, there are still advantages to the Sally and Sallino techniques.
We note that it is possible to use a hybrid of the traditional cross-entropy of Eq. 5 and the improved cross-entropy Eq. 11. This would be useful in situations where one may not have access to the joint ratio for practical reasons or because some training samples come from real data instead of a simulation. Furthermore, we note that the improved cross-entropy estimator of Alice and Alices can be extended from the binary setting to one where samples are generated from multiple parameters points if the joint likelihood ratio for all pairs is available. These joint likelihood ratios provide the necessary ingredient for importance sampling beyond the binary setting considered here.
The ubiquity of simulators and other implicit models indicates there is enormous potential for likelihood-free inference techniques. The use of augmented data improves the sample efficiency of these techniques significantly, and these results motivate further study of variance reduction techniques that leverage this augmented data.
JB, KC, and GL are grateful for the support of the Moore-Sloan data science environment at NYU. KC and GL were supported through the NSF grants ACI-1450310 and PHY-1505463. JP was partially supported by the Scientific and Technological Center of Valparaíso (CCTVal) under Fondecyt grant BASAL FB0821. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
-  D. B. Rubin: ‘Bayesianly justifiable and relevant frequency calculations for the applied statistician’. Ann. Statist. 12 (4), p. 1151, 1984. URL https://doi.org/10.1214/aos/1176346785.
-  M. A. Beaumont, W. Zhang, and D. J. Balding: ‘Approximate bayesian computation in population genetics’. Genetics 162 (4), p. 2025, 2002.
-  J. Alsing, B. Wandelt, and S. Feeney: ‘Massive optimal data compression and density estimation for scalable, likelihood-free inference in cosmology’ , 2018. arXiv:1801.01497.
-  T. Charnock, G. Lavaux, and B. D. Wandelt: ‘Automatic physical inference with information maximizing neural networks’. Phys. Rev. D97 (8), p. 083004, 2018. arXiv:1802.03537.
T. Kanamori, S. Hido, and M. Sugiyama: ‘A least-squares approach to direct importance estimation’. Journal of Machine Learning Research 10 (Jul), p. 1391, 2009.
-  Y. Fan, D. J. Nott, and S. A. Sisson: ‘Approximate Bayesian Computation via Regression Density Estimation’. ArXiv e-prints , 2012. arXiv:1212.1479.
-  L. Dinh, D. Krueger, and Y. Bengio: ‘NICE: Non-linear Independent Components Estimation’. ArXiv e-prints , 2014. arXiv:1410.8516.
-  D. Jimenez Rezende and S. Mohamed: ‘Variational Inference with Normalizing Flows’. ArXiv e-prints , 2015. arXiv:1505.05770.
-  K. Cranmer, J. Pavez, and G. Louppe: ‘Approximating Likelihood Ratios with Calibrated Discriminative Classifiers’ , 2015. arXiv:1506.02169.
-  K. Cranmer and G. Louppe: ‘Unifying generative models and exact likelihood-free inference with conditional bijections’. J. Brief Ideas , 2016.
-  L. Dinh, J. Sohl-Dickstein, and S. Bengio: ‘Density estimation using Real NVP’. ArXiv e-prints , 2016. arXiv:1605.08803.
-  G. Papamakarios and I. Murray: ‘Fast -free inference of simulation models with bayesian conditional density estimation’. In ‘Advances in Neural Information Processing Systems’, p. 1028–1036, 2016.
-  B. Paige and F. Wood: ‘Inference Networks for Sequential Monte Carlo in Graphical Models’. ArXiv e-prints , 2016. arXiv:1602.06701.
-  R. Dutta, J. Corander, S. Kaski, and M. U. Gutmann: ‘Likelihood-free inference by ratio estimation’. ArXiv e-prints , 2016. arXiv:1611.10242.
-  B. Uria, M.-A. Côté, K. Gregor, I. Murray, and H. Larochelle: ‘Neural Autoregressive Distribution Estimation’. ArXiv e-prints , 2016. arXiv:1605.02226.
-  A. van den Oord, S. Dieleman, H. Zen, et al.: ‘WaveNet: A Generative Model for Raw Audio’. ArXiv e-prints , 2016. arXiv:1609.03499.
-  A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu: ‘Conditional Image Generation with PixelCNN Decoders’. ArXiv e-prints , 2016. arXiv:1606.05328.
-  A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu: ‘Pixel Recurrent Neural Networks’. ArXiv e-prints , 2016. arXiv:1601.06759.
-  M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander: ‘Likelihood-free inference via classification’. Statistics and Computing p. 1–15, 2017.
-  D. Tran, R. Ranganath, and D. M. Blei: ‘Hierarchical Implicit Models and Likelihood-Free Variational Inference’. ArXiv e-prints , 2017. arXiv:1702.08896.
-  G. Louppe and K. Cranmer: ‘Adversarial Variational Optimization of Non-Differentiable Simulators’. ArXiv e-prints , 2017. arXiv:1707.07113.
-  G. Papamakarios, T. Pavlakou, and I. Murray: ‘Masked Autoregressive Flow for Density Estimation’. ArXiv e-prints , 2017. arXiv:1705.07057.
-  G. Papamakarios, D. C. Sterratt, and I. Murray: ‘Sequential Neural Likelihood: Fast Likelihood-free Inference with Autoregressive Flows’. ArXiv e-prints , 2018. arXiv:1805.07226.
-  J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez: ‘Constraining Effective Field Theories with Machine Learning’ , 2018. arXiv:1805.00013.
-  J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez: ‘A Guide to Constraining Effective Field Theories with Machine Learning’ , 2018. arXiv:1805.00020.
-  J. Brehmer, G. Louppe, J. Pavez, and K. Cranmer: ‘Mining gold from implicit models to improve likelihood-free inference’ , 2018. arXiv:1805.12244.