    # Classification and Bayesian Optimization for Likelihood-Free Inference

Some statistical models are specified via a data generating process for which the likelihood function cannot be computed in closed form. Standard likelihood-based inference is then not feasible but the model parameters can be inferred by finding the values which yield simulated data that resemble the observed data. This approach faces at least two major difficulties: The first difficulty is the choice of the discrepancy measure which is used to judge whether the simulated data resemble the observed data. The second difficulty is the computationally efficient identification of regions in the parameter space where the discrepancy is low. We give here an introduction to our recent work where we tackle the two difficulties through classification and Bayesian optimization.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The likelihood function plays a central role in statistics and machine learning. It is the joint probability of the observed data

seen as a function of the model parameters of interest . We may assume that the data are a realization of a two stage sampling process,

 \Z ∼p1(\Z|\thetaTrue), \Dobs ∼p2(\Dobs|\Z,\thetaTrue), (1)

where are unobserved variables and are some fixed but unknown values of the model parameters. The likelihood function is implicitly defined via an integral,

 L(\mytheta)=p(\Dobs|\mytheta)=∫p2(\Dobs|\Z,\mytheta)p1(\Z|\mytheta)\ud\Z. (2)

For many realistic data generating processes, the integral cannot be computed analytically in closed form, and numerical approximation is computationally too costly as well. Standard likelihood-based inference is then not feasible. But inference can be performed by using the possibility to simulate data from the model. Such simulation-based likelihood-free inference methods have emerged in multiple disciplines: “Indirect inference” originated in economics (Gouriéroux et al., 1993), “approximate Bayesian computation” (ABC) in genetics (Beaumont et al., 2002; Marjoram et al., 2003; Sisson et al., 2007), or the “synthetic likelihood” approach in ecology (Wood, 2010). The different methods share the basic idea to identify the model parameters by finding values which yield simulated data that resemble the observed data. The inference process is shown in a schematic way in Algorithm 1 in the framework of ABC.

In Algorithm 1, two fundamental difficulties of the aforementioned inference methods are highlighted. One difficulty is the measurement of similarity, or discrepancy, between the observed data and the simulated data (line 5). The choice of discrepancy measure affects the statistical quality of the inference process. The second difficulty is of computational nature. Since simulating data can be computationally very costly, one would like to identify the region in the parameter space where the simulated data resemble the observed data as quickly as possible, without proposing parameters which have a negligible chance to be accepted (line 3).

We have been working on both problems, the choice of the discrepancy measure and the fast identification of the parameter regions of interest (Gutmann et al., 2014; Gutmann and Corander, 2015). The following two sections are a brief introduction to the two papers.

## 2 Discriminability as discrepancy measure

We transformed the original problem of measuring the discrepancy between and

into a problem of classifying the data into simulated versus observed

(Gutmann et al., 2014). Intuitively, it is easier to discriminate between two data sets which are very different than between data which are similar, and when the two data sets are generated with the same parameter values, the classification task cannot be solved significantly above chance-level. This motivated us to use the discriminability (classifiability) as discrepancy measure, and to perform likelihood-free inference by identifying the parameter values which yield chance-level discriminability only (Gutmann et al., 2014).

We next illustrate this approach using a toy example. The data

are assumed to be sampled from a standard normal distribution (black curve in Figure

1(a)), and the parameter of interest is the mean. For data simulated with mean (green curve), the two densities barely overlap so that classification is easy. In fact, linear discriminant analysis (LDA) yields a discriminability of almost 100% (Figure 1(b), green dashed curve). If the data are simulated with a mean closer to zero, for example with (red curve), the simulated data become more similar to and the classification accuracy drops to around 60% (red dashed curve). For , where the simulated and observed data are generated with the same values of , only chance-level discriminability of 50% is obtained. This illustrates how discriminability can be used as a discrepancy measure.

We analyzed the validity of this approach theoretically and demonstrated it on more challenging synthetic data as well as real data with an individual-based epidemic model for bacterial infections in day care centers (Gutmann et al., 2014). The finding that classification can be used to measure the discrepancy has both practical and theoretical value: The main practical value is that the rather difficult problem of choosing a discrepancy measure is reduced to a more standard problem where we can leverage on effective existing solutions. The theoretical value lies in the establishment of a tight connection between likelihood-free inference and classification – two fields of research which appear rather different at first glance.

## 3 Bayesian optimization to identify parameter regions of interest

In the following, we denote a certain discrepancy measure by . A small value of is assumed to imply that are judged to be similar to . The difficulty in finding parameter regions where is small is at least twofold: First, the mapping from to can generally not be expressed in closed form and derivatives are not available either. Second, is actually a stochastic process due to the use of simulations to obtain . We illustrate this in Figure 2 for our Gaussian toy example where is the discriminability between and (for further examples, see Gutmann and Corander, 2015). The figure visualizes the distribution of for . The fact that is a random process was suppressed in Figure 1 by working with a large sample size.

We used Bayesian optimization, a combination of nonlinear (Gaussian process) regression and optimization (see, for example, Brochu et al., 2010), to quickly identify regions where is likely to be small (Gutmann and Corander, 2015). In Bayesian optimization, the available information about the relation between and is used to build a statistical model of , and new data are actively acquired in regions where the minimum of is potentially located. After acquisition of the new data, e.g. a tuple

, the model is updated using Bayes’ theorem.

For our simple toy example, the region around zero was identified as the region of interest within ten acquisitions (Figure 3(a-e)). While the location of the minimum is approximately correct, the posterior mean approximates the (empirical) mean of in Figure 2 only roughly. As more evidence about the behavior of in the region of interest is acquired, the fit improves (Figure 3(f)).

In the full paper (Gutmann and Corander, 2015), we show that Bayesian optimization not only allows to quickly identify the regions of interest but also to perform approximate posterior inference. Our findings are supported by theory, and applications to real data analysis with intractable models. In our applications, the inference was accelerated through a reduction in the number of required simulations by several orders of magnitude.