Likelihood-free inference with an improved cross-entropy estimator

08/02/2018 ∙ by Markus Stoye, et al. ∙ Universidad Técnica Federico Santa María NYU college CERN University of Liège 0

We extend recent work (Brehmer, et. al., 2018) that use neural networks as surrogate models for likelihood-free inference. As in the previous work, we exploit the fact that the joint likelihood ratio and joint score, conditioned on both observed and latent variables, can often be extracted from an implicit generative model or simulator to augment the training data for these surrogate models. We show how this augmented training data can be used to provide a new cross-entropy estimator, which provides improved sample efficiency compared to previous loss functions exploiting this augmented training data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world phenomena are best described by computer simulations. Such simulators often implement a stochastic generative process, which is based on a mechanistic model and parametrized by . In practice, these simulators are used to generate samples of observations , but the density is only defined implicitly through the simulation code. Often, the generative process involves latent variables and the density


is intractable because of the integral over a large (and possibly highly structured) latent space. Without a tractable likelihood, statistical inference on the parameters given observed data is challenging. This problem has prompted the development of likelihood-free inference methods such as Approximate Bayesian Computation [1, 2, 3, 4] and neural density or neural density ratio estimation algorithms [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Nearly all of these established methods treat the simulator as a black box and only use its capability to generate samples for a specified values of .

In Refs. [24, 25, 26] a new paradigm was introduced that exploits additional information that can be extracted from the simulation. In particular, within the simulation where the latent variables are available, it is often possible to extract the joint likelihood ratio


and the joint score


which are conditioned on the latent variables corresponding to a particular sample.

It was then shown that certain loss functionals , which depend on the joint likelihood ratio and the joint score, are minimized by the likelihood ratio


an otherwise intractable quantity. This motivates a family of new techniques for likelihood-free inference in which the the joint likelihood ratio and joint score are used as training data for neural networks. These networks serve as surrogate models for the intractable likelihood or likelihood ratio. Experiments showed these new methods to be more sample-efficient than previously established neural density and neural density ratio estimation techniques. The authors of Refs. [24, 25, 26] coined the term “mining gold” for the process of extracting the joint likelihood ratio and joint score from the simulator – while the augmented data require some effort to extract, they are extremely valuable.

While the loss functionals originally proposed in Refs. [24, 25, 26] have the correct minima, they are not necessarily the most sample efficient. In particular, the proposed mean squared error (MSE) losses are often dominated by few samples with large joint likelihood ratios. Here we extend and improve that original work with two new algorithms for likelihood-free inference. The key improvement are new loss functions, which use an improved estimator for the cross entropy based on the joint likelihood ratio and joint score.

After introducing these new algorithm in Sec. 2, we show its performance in a problem from particle physics in Sec. 3, before giving our conclusions in Sec. 4.

2 Cross-entropy estimation with augmented data

Consider the problem of estimating the likelihood ratio based on samples , labeled with ; samples , labeled ; and the joint likelihood ratio and joint score .

The familiar binary cross-entropy loss functional is defined as


For balanced samples () we have


which allows us to rewrite Eq. 5 as


It is straightforward to show that this loss functional is minimized by


To use the cross entropy to train a surrogate model with a finite number of samples, we need a tractable estimator for the cross entropy. The standard estimator, as used for instance in the Carl inference method [9], is given by



act as an unbiased, but high-variance estimator of

. In the limit of infinite samples, this estimator therefore has the correct minimum of Eq. (9), but for finite sample sizes it may suffer from high variance.

With the availability of the joint likelihood ratio from the simulator, the are tractable and we can define the alternative estimator


By using the exact rather than the , the samples drawn according to also provide information about the second term in the loss function, and vice versa. By minimizing the loss function we get an estimator and thus a likelihood ratio estimator


This defines the Alice inference method111Approximate likelihood with improved cross-entropy estimator, which consists of mining the joint likelihood ratio from the simulator, training a neural network on the improved cross-entropy estimator in Eq. (11), and using this surrogate model for statistical inference on .

It is to be expected that a likelihood ratio estimator based on the Alice estimator for the cross-entropy should outperform the Carl method, which is based on the standard cross-entropy estimator in Eq. (10). The more interesting question is how it stacks up against the Rolr technique introduced in Refs. [24, 25, 26], in which the loss function


is minimized. In the limit of infinite samples it is minimized by . But here each event only contributes to either the squared error on or on term, which might lead to a higher variance.

In analogy to the Cascal and Rascal methods of Refs. [24, 25, 26], we can define an additional inference method which uses the joint score, i. e. an additional piece of information that describes the local (tangential) behavior of the likelihood function. If a parameterized likelihood ratio estimator is implemented with a differentiable architecture such as a neural network, we can calculate the gradient of the output with respect to and similarly calculate the corresponding score


of the estimator. For a perfect (or equivalently ) estimator, this corresponding score will also minimize the squared error loss with respect to the joint score , which can be extracted from the simulator [24, 25, 26]. Turning this argument around, we can use the joint score to guide the training of the estimator. This is the idea behind the Alices 222Approximate likelihood with improved cross-entropy estimator and score technique, which is based on the loss function


The factor is necessary to guarantee the correct minimum of the squared error on the score. The hyper-parameter weights the two terms in the loss function. This loss is the natural extension of the the Cascal loss function, but we expect it to reduce the variance compared to the Cascal approach for finite sample size. An interesting question is how it performs compared to the Rascal approach, which similarly augments the Rolr loss in Eq. (13) with the score term.

3 Experiments

We experiment with the new methods in the particle physics problem introduced in Refs. [24, 25]. In this real-world problem, the outcome of proton-proton collisions is characterized by 42 observables, from which likelihood ratios and confidence limits on two model parameters are derived. We first consider an idealized setting neglecting the detector response where the likelihood function is tractable, which provides us with ground truth that can be used to evaluate the performance of the algorithms. For a detailed description of the setup, see Ref. [25].

In addition to the Carl, Rolr, Cascal, and Rascal techniques described above, we also compare to the Sally and Sallino methods. Sally and Sallino approximate a statistical model that is accurate in the neighborhood of . The methods are very sample efficient, but make approximations that limit their asymptotic performance.

Except for the new loss functions, we used the same architectures and hyper-parameters as in Ref. [25]. In particular, we use fully connected networks with five hidden layers, 100 units each, and activation functions for both approaches. For Alices we use , which was found to give a good performance for the closely related Cascal method [25].

Strategy Expected MSE
training samples training samples training samples
Table 1: Fidelity of different strategies in the idealized scenario, using training sets of various sizes. We use the expected mean squared error as defined in Ref. [25] as a performance metric. The new methods, Alice and Alices, outperform Rascal, Cascal, and Rolr. The Sally and Sallino methods are very sample efficient, but make approximations that limit their asymptotic performance.

Table 1 and Fig. 1 show the quality of the likelihood ratio estimate based on various sized training samples for the new methods and compares them to the inference techniques presented in Ref. [25]. As a performance metric we use an expected mean squared error on the log likelihood ratio, as defined in Ref. [25].

Unsurprisingly, the Alice and Rolr methods clearly outperform Carl, which does not have access to the joint likelihood ratio. More significantly, we find that Alice outperforms and Rolr, which does have access to the joint likelihood ratio. We conjecture that this improvement can be attributed to the lower variance of the cross-entropy compared to the squared error. More surprisingly, the Alice method also outperforms the Rascal method for larger training sample sizes (), even though Alice does not have access to the joint score.

For smaller training sample sizes () the Alices method outperforms the Alice method, which is not surprising given the additional information available during training. For larger training sample sizes (), the variance of the score actually deteriorates the performance of Alices compared to Alice. We did not perform hyper-parameter tuning for as a function of the training sample size, which should ensure that Alices performs at least as well as Alice. We leave a systematic tuning of the parameter and an analysis of sources of variance in this approach for future work.

Figure 1: Estimator fidelity in the idealized scenario as a function of the training sample size. As a metric we use the expected mean squared error on the log likelihood ratio, see Ref. [25]. The new methods are more sample efficient than the similar Rolr and Rascal techniques.

Figure 2 shows expected exclusion contours at different confidence levels on the two parameters, assuming 36 observed events distributed according to . The methods are trained on the full training samples of

samples. The left panel shows contours constructed based on asymptotic properties of the profile likelihood ratio test statistic. While methods such as

Rascal are generally very accurate, with this construction they can sometimes lead to overly optimistic exclusion contours, visible as tighter bounds than the “truth” contour. We find that switching to Alice and Alices reduces this issue, but does not entirely solve it.

Figure 2: Expected exclusion limits on the model parameters in the idealized scenario for different inference methods. We assume 36 events distributed according to . All estimators are trained on a large data set with samples. Left: construction of exclusion limits based on asymptotic properties of the likelihood ratio. With this method, inefficient estimators can predict overly optimistic exclusion limits, as can be seen for instance for the Rascal method. The new Alice and Alices approaches are less prone to this issue. Right: construction of exclusion limits calibrated with toy experiments (i. e. the Neyman construction). In this approach, the intervals will always cover, but might not be optimal. We find an excellent performance of the Alice and Alices methods, virtually indistinguishable from the Rascal method and the true likelihood ratio.

The right panel of Fig. 2

shows exclusion contours based on the frequentist confidence intervals calibrated with toy experiments. This Neyman construction guarantees coverage: while the limits from any approach may be worse than the optimal limits, they will never be overly optimistic. As test statistics we use the likelihood ratio with respect to the

, which explains why the contours are generally stronger than in the left panel. We find that both Alice and Alices, like Rascal and Cascal of Refs. [24, 25, 26], lead to limits that are virtually indistinguishable from the ideal limits based on the true likelihood ratio.

Figure 3: Expected exclusion limits on the model parameters in the scenario with detector effects for different inference methods. We construct the contours with the Neyman construction, which guarantees the confidence intervals will cover. The intervals are based on 36 events distributed according to . The estimators are trained on data sets with (left) or (right) samples. The Alices method leads to strong limits, comparable to the Rascal technique.

Finally, in Fig. 3 we show similar expected exclusion contours, but in a more realistic setup in which the parton shower and detector effects are described with approximate smearing functions, which makes the true likelihood intractable. In this situation, we cannot compare the likelihood ratio estimators to the ground truth. Instead, we show the expected contours based on the Neyman construction, similar to the right panel of Fig. 2. In the left panel we show results for limited training samples of only events. In this setup, Alices allows for strong limits, comparable to Rascal and slightly better than for Alice. The right panel demonstrates that with the full training sample the results of Rascal, Alice, and Alices are indistinguishable.

4 Conclusions

In this work, we have extended recently developed inference techniques for the setting in which the likelihood is only implicitly defined through a stochastic generative model or simulator. By exploiting the joint likelihood ratio that can be extracted from the simulator, we introduced an improved cross-entropy estimator. This improved cross-entropy estimator is used to define two new likelihood-free inference techniques: Alice and Alices.

Our experiments comparing Alice and Alices with the other recently developed techniques indicate that they are significantly more sample efficient than Rolr, Cascal, and Rascal techniques. We attribute this to the lower variance of the improved cross-entropy estimator. For smaller training sample sizes, there are still advantages to the Sally and Sallino techniques.

We note that it is possible to use a hybrid of the traditional cross-entropy of Eq. 5 and the improved cross-entropy Eq. 11. This would be useful in situations where one may not have access to the joint ratio for practical reasons or because some training samples come from real data instead of a simulation. Furthermore, we note that the improved cross-entropy estimator of Alice and Alices can be extended from the binary setting to one where samples are generated from multiple parameters points if the joint likelihood ratio for all pairs is available. These joint likelihood ratios provide the necessary ingredient for importance sampling beyond the binary setting considered here.

The ubiquity of simulators and other implicit models indicates there is enormous potential for likelihood-free inference techniques. The use of augmented data improves the sample efficiency of these techniques significantly, and these results motivate further study of variance reduction techniques that leverage this augmented data.


JB, KC, and GL are grateful for the support of the Moore-Sloan data science environment at NYU. KC and GL were supported through the NSF grants ACI-1450310 and PHY-1505463. JP was partially supported by the Scientific and Technological Center of Valparaíso (CCTVal) under Fondecyt grant BASAL FB0821. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.