GANs for LIFE: Generative Adversarial Networks for Likelihood Free Inference

11/29/2017 ∙ by Vinay Jethava, et al. ∙ 0

We introduce a framework using Generative Adversarial Networks (GANs) for likelihood--free inference (LFI) and Approximate Bayesian Computation (ABC). Our approach addresses both the key problems in likelihood--free inference, namely how to compare distributions and how to efficiently explore the parameter space. Our framework allows one to use the simulator model as a black box and leverage the power of deep networks to generate a rich set of features in a data driven fashion (as opposed to previous ad hoc approaches). Thereby it is a step towards a powerful alternative approach to LFI and ABC. On benchmark data sets, our approach improves on others with respect to scalability, ability to handle high dimensional data and complex probability distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Approximate Bayesian Computation (ABC) is a likelihood-free inference method that learns the parameter by generating simulated data and accepting proposals (for the parameter ) when the simulated data resembles the true data  (see Lintusaari et al., 2017, for a recent overview). In most high-dimensional settings, a summary statistic is chosen to represent the data so that the distance can be computed in terms of the summary statistic . Rejection ABC (Pritchard et al., 1999)

often suffers from low acceptance rates and several methods have been developed to address this problem. Broadly, these can be categorized as Markov Chain Monte Carlo methods 

(Marjoram et al., 2003; Beaumont et al., 2009; Meeds et al., 2015; Moreno et al., 2016), sequential Monte Carlo (Sisson et al., 2007)

, and more recently, classifier-based approaches based on Bayesian optimization (BOLFI) 

(Gutmann et al., 2016).

In a series of recent papers, Gutmann et al. (2014, 2017, 2016) have suggested two novel ideas: treating the problem of discriminating distributions as a classification problem between and ; and regression of the parameter on the distance-measure using Gaussian Processes in order to identify the suitable regions of the parameter space having higher acceptance ratios.

This line of work is closely related to Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) with different flavors of GANs (Goodfellow et al., 2014; Nowozin et al., 2016; Arjovsky et al., 2017; Dziugaite et al., 2015) minimizing alternative divergences (or ratio losses) between the observed data and simulated data  (see Mohamed and Lakshminarayanan, 2016, for an excellent exposition).

In this paper, we develop the connection between GANs (Goodfellow, 2016) and Gutmann et al. (2017) and present a new differentiable architecture inspired by GANs for likelihood-free inference (Figure 1). In doing so, we make the following contributions:

  • We present a method for adapting black box simulator-based model with an “approximator”, a differentiable neural network module 

    (similar to function approximation, see e.g., Liang and Srikant, 2016, and references therein).

  • Our method provides automatic generation of summary statistics as well as the choice of different distance functions (for comparing the summary statistics) with clear relation to likelihood ratio tests in GANs.

  • We adapt one of the key ideas in Gutmann et al. (2017), namely, gradient-descent based search to quickly narrow down to the acceptance region of the parameter space, to the framework of GANs.

  • We perform experiments on real-world problems (beyond Lotka-Volterra model) previously studied in the ABC literature - showing benefits as well as cases where the differentiable neural network architecture might not be the best solution (Section 4).

2 Related Work

Summary statistics and distance function

The choice of summary statistic is known to be critical to the performance of the ABC scheme (Blum et al., 2013; Prangle, 2015). Several methods have explored automatic generation of summary statistics, e.g., projection using regression (Fearnhead and Prangle, 2012)

, random forests 

(Marin et al., 2016), etc. More recently, Prangle et al. (2017) have explored alternative distance functions for comparison of summary statistics. Within the classification scheme of (Prangle, 2015; Prangle et al., 2017), one of our contributions is automatic computation of non-linear projection-based summary statistics and

a moment matching distance function (respectively, MMD 

(Dziugaite et al., 2015) or Wasserstein distance (Arjovsky et al., 2017)).

LFI using neural networks

Several authors have recently proposed alternative approaches for likelihood-free inference (Meeds and Welling, 2015; Cranmer et al., 2015; Papamakarios and Murray, 2016; Tran et al., 2017). In particular, Papamakarios and Murray (2016) inverted the ABC problem by sampling the parameter from mixture of Gaussians (parametrized by neural network model ). More recently, Tran et al. (2017) presented an elegant variational approach for likelihood-free inference under the restrictions that (1) the conditional density is indirectly specified as the hierarchical implicit model (similar to the “approximator” in this work) which given input noise outputs sample given latent variable and parameter , i.e., ; and (2) one can sample from the target distribution .

However, it is not clear how to extend these methods to the high-dimensional setting where choice of summary statistics is crucial. Further, the mean field assumption in (Tran et al., 2017) is not valid for time-series models. In contrast to above methods, this paper clearly demarcates the summarization from the approximation of the non-differentiable simulator. Additionally, not being constrained by a variational setup in (Tran et al., 2017), one can use sophisticated approximators/summarizer pairs for “simulating” a black-box simulator.

Maximum Mean Discrepancy

The Maximum Mean Discrepancy (MMD) is an integral probability metric defined via a kernel and its associated Reproducing Kernel Hilbert Space (RKHS) (Muandet et al., 2017; Sriperumbudur et al., 2012; Gretton et al., 2007). Explicitly, the MMD distance with kernel between distributions and is given by where are independent copies from and are independent copies from . For empirical samples from and from

, an unbiased estimate of the MMD is

. As shown in Dziugaite et al. (2015), this can be differentiated with respect to parameters generating one of the distributions.

3 Model

1:  while  is not converged do 2:     for  to  do 3:        , 4:        Draw sample , , , , , , as in (1)-(7) 5:     end for

{Loss function}

6:      7:      8:      {Optimization} 9:      10:      11:  end while
Figure 1: ABC-GAN architecture for ABC computation. The external dependence on noise () is shown in red. Two distinct network paths (shown in green and pink) correspond to two different optimizations (resp., improve_approx and improve_accept) akin to ratio test in GANs; ABC-GAN implementation , .

Let denote a target distribution with unknown parameters that need to be estimated. We have access to a black box simulator  allowing us to generate samples from the distribution for any choice of parameter . Here, we have captured the underlying stochasticity of the simulator using suitable noise .

Figure 1 gives a high-level overview of the ABC-GAN architecture. The inputs to the network are the generator noise and the simulator noise which are problem-specific. The network functions are given as

(observations) (1)
(generator) (2)
(approximator) (3)
(simulator) (4)
(summary-sim.) (5)
(summary-obs.) (6)
(decoder-approx.) (7)

The parameters are and the optimization consists of two alternating stages, namely:

  • improve_approx: This phase trains the approximator and the summarizer against the black-box simulator using a ratio test between and . Mathematically,

    where the loss term corresponds to a decoder for approximator (as an encoding of parameter ).

  • improve_accept: This phase trains the generator in order to generate parameters that are similar to the observed data, i.e., with better acceptance rates. Mathematically,

The improve_approx optimization (without improve_accept) reduces to function approximation for the summary statistics of black-box simulator, with the decoder loss ensuring the outputs of the approximator and summarizer do not identically go to zero. A network as described above can be used in place of the simulator in a transparent fashion within an ABC scheme. However, this is extremely wasteful and the improve_accept scheme incorporates gradient descent in parameter space to quickly reach the acceptance region similar to BOLFI (Gutmann et al., 2017). Given the perfect approximator (or directly, the black-box simulator) and choosing an -divergence (Nowozin et al., 2016; Mohamed and Lakshminarayanan, 2016) as loss ensures the generator unit attempts to produce parameters within the acceptance region.

Algorithm 1 shows our specific loss functions and optimization procedure used in this paper. In our experiments, we observed that training with MMD loss in lieu of the discriminator unit (Dziugaite et al., 2015) yielded the best results (Section 4).

Discussion

We emphasize that our implementation (Algorithm 1) is just one possible solution in the ABC-GAN architecture landscape. For a new problem of interest, we highlight some of these choices below:

  • Pretraining (improve_approx): For a specific region of parameter space (e.g., based on domain knowledge), one can pre-train and fine-tune the approximator and summarizer networks by generating more samples from parameters within the restricted parameter space.

  • Automatic summarization: Further, in domains where the summary statistic is not so obvious or rather adhoc (Section 4.2) - improve_approx optimization provides a natural method for identifying good summary statistics.

  • Loss function/Discriminator: Prangle et al. (2017) discuss why standard distance functions like Euclidean might not be suitable in many cases. In our architecture, the loss function can be selected based on the insights gained in the GAN community (Li et al., 2015; Bellemare et al., 2017; Arora et al., 2017; Sutherland et al., 2016).

  • Training: GANs are known to be hard to train. In this work, improve_approx is more crucial (since the approximator/summarizer pair tracks the simulator) and in general should be prioritized over improve_accept (which chooses the next parameter setting to explore). We suggest using several rounds of improve_approx for each round of improve_approx in case of convergence problems.

  • Mode collapse: We did not encounter mode collapse for a simple experiment on univariate normal (Supplementary Material, Section ). However, it is a known problem and an active area of research (Arjovsky et al., 2017; Arora et al., 2017) and choosing the Wasserstein distance (Arjovsky et al., 2017) has yielded promising results in other problems.

  • Module internals: Our architecture is not constrained by independence between samples (vis-a-vis (Tran et al., 2017)). For example, in time series problems, it makes sense to have deep recurrent architectures (e.g., LSTMs, GRUs) that are known to capture long-range dependencies (e.g., Sections 4.2

    ). Design of network structure is an active area of deep learning research which can be leveraged alongside domain knowledge.

4 Experiments

All experiments are performed using TensorFlow r1.3 on a Macbook pro 2015 laptop with core i5, 16GB RAM and

without GPU support. The code for the experiments will be made available on github. In addition to experiments reported below, Section 

in the supplementary material evaluates our ABC-GAN architecture on two small synthethic models, namely, univariate-normal and mixture of normal distributions.

4.1 Generalized linear model

Figure 2: Inferred parameter values for one run of ABC-GAN for the GLM model (Section 4.1) with prior and two layer feed-forward networks used for the generator and approximator in the ABC-GAN structure. The true parameter value is . We see that ABC-GAN algorithm recovers close to the true parameter .

Kousathanas et al. (2016) note that basic ABC algorithm and sequential Monte Carlo methods (Sisson et al., 2007; Beaumont et al., 2009) are useful for low-dimensional models, typically less than parameters. To tackle higher dimensions, Kousathanas et al. (2016) consider scenarios where it is possible to define sufficient statistics for subsets of the parameters allowing parameter-specific ABC computation. This includes the class of exponential family of probability distributions.

In this example, we consider a generalized linear model defined by Kousathanas et al. (2016, Toy model 2) given as where denotes the unknown parameter,

denote multi-variate normal random variable

and is a design matrix and

We use a uniform prior as in the original work. However, we note that Kousathanas et al. (2016) “start the MCMC chains at a normal deviate , i.e., around the true values of .” The true parameter is chosen as .

We do parameter inference for dimensional Gaussian in the above setting. Figure 2 shows the mean of the posterior samples within each mini-batch as the the algorithm progresses 111For clarity, a larger version of this plot is presented in Figure 1 of the supplementary material.. The total number of iterations is with mini-batch size of and learning rate of . The algorithm takes seconds. We reiterate that ABC-GAN does not use model-specific information such as the knowledge of sufficient statistics for subsets of parameters, and thus, is more widely applicable than the approach of Kousathanas et al. (2016). Concurrently, it also enables computationally efficient inference in high-dimensional models – a challenge for Sequential Monte Carlo based methods (Sisson et al., 2007; Beaumont et al., 2009) and BOLFI (Gutmann et al., 2017).

In the high-dimensional setting, classic ABC methods or BOLFI cannot be easily used in contrast to this work.

4.2 Ricker’s model

Network structure for Ricker model Convolutional summarizer (DCC)
Figure 3: Network structure of ABC-GAN for the Ricker model. Here, , and denote a densely connected layer with outputs, sigmoid activation and a LSTM with units respectively. Convolutional summarizer for the DCC example. We generate summary representation consisting of features from the input matrices . The number of filters , size of the filters

and stride

are shown. The max_pooling layer has size . and the final dense layer outputs a summary representation of size for each input sample.

The stochastic Ricker model (Ricker, 1954) is an ecological model described by the nonlinear autoregressive equation: where is the animal population at time and . The observation is given by the distribution , and the model parameters are . The latent time series makes the inference of the parameters difficult.

Wood (2010) computed a synthetic log-likelihood by defining the following summary statistics: mean, number of zeros in , auto-covariance with lag , regression coefficients for against . They fit a Gaussian matrix to the summary statistics of the simulated samples and then, the synthetic log-likelihood is given by the probability of observed data under this Gaussian model. Wood (2010) inferred the unknown parameters using standard Markov Chain Monte Carlo (MCMC) method based on their synthetic log-likelihood. Gutmann et al. (2016) used the same synthetic log-likelihood but reduced the number of required samples for parameter inference based on regressing the discrepancy between simulated and observed samples (in terms of summary statistics) on the parameters using Gaussian Processes.

Figure 4: Inferred parameters for the Ricker model (Section 4.2) in a single run after iterations using RMSProp with learning rate=, mini-batch size and sequence length . The means of the estimated parameters for this run are and the range is shown.
Figure 5: The histogram of the posterior samples for the last 1000 iterations of ABC-GAN in one run for the Ricker model. The true parameters are shown in green.

Figure 3 shows the network structure for the Ricker model 222Please see the supplementary material for detailed description of the ABC-GAN architecture for the Ricker model. We follow the experimental setup of Wood (2010). We simulate observations from the Ricker model with true parameters . We use the following prior distributions:

Figure 4 shows the output of the generator (posterior samples for the parameters) for iterations of the algorithm. A mini-batch size of is used with each sequence having length , and the optimization is done using RMSProp with a learning rate of . Figure 5 shows the histogram of the posterior samples for the last iterations of the algorithm. We note that the algorithm converges and is quite close to the true parameters.

We obtain the posterior means as (averaged over independent runs) for the number of true observations being and an typical time of seconds for iterations. BOLFI estimates the following parameters for data points (see Gutmann et al., 2016, Figure 9) and a typical run of BOLFI takes seconds for the given setup 333The code provided by Dr. Michael Gutmann uses the GNU R and C code of Wood for synthetic log-likelihood and is considerably faster than the python version available in ELFI..

We re-emphasize that compared to the highly-engineered features of Wood (2010), our summary representation is learnt using a standard LSTM-based neural network (Hochreiter and Schmidhuber, 1997; Graves, 2012). Thus, our approach allows for easier extensions to other problems compared to manual feature engineering.

For complex simulators with latent variables (in a low-dimensional setting) with less number of observations and limited simulations, BOLFI performs best at significantly higher computational cost. Further, (Tran et al., 2017) is not suitable as mean field assumption is violated in the time-series model.

4.3 Infection in Daycare center

Figure 6: Result for the DCC example using the non-random features of Gutmann et al. (2017). The true parameters are . A mini-batch size of was used for iterations of the algorithm. The result shows the histogram of the posterior samples generated by the algorithm.
Figure 7: Histogram of the posterior samples generated by ABC-GAN for the DCC example using the convolutional summarizer. A learning rate of is used for iterations. The true parameters are shown in green.

We study the transmission of strains of Streptococcus pneumoniae in a total of children attending one of day care centers in Oslo, Norway. The initial data was published by Vestrheim et al. (2008) and further described in a follow-up study (Vestrheim et al., 2010). Numminen et al. (2013) first presented an ABC-based approach for inferring the parameters associated with rates of infection from an outside source (), infection from within the DCC () and infection by multiple strains (). Gutmann et al. (2017) presented a classification-based approach where they used classification as a surrogate for the rejection test of standard ABC method. Section C in the supplementary material presents a discussion of the hand-engineered features of Numminen et al. (2013) and Gutmann et al. (2017).

In the remainder of this section, we follow the nomenclature in (Gutmann et al., 2017). For a single DCC, the observed data consists of presence or absence of a particular strain of the disease at time when the swabs were taken. On average, individuals attend a DCC out of which only some are sampled. There are strains of the bacteria in total. So the data from each of the DCCs consists of a binary matrix with entries, where if attendee has strain at time and zero otherwise. The observed data consists of a set of binary matrices formed by . For the simulator, we use the code provided by Michael Gutmann which in turn uses the code of Elina Numminen 444https://www.cs.helsinki.fi/u/gutmann/code/BOLFI/.

We assume the following prior on the parameters:

We first use the non-random features defined by Gutmann et al. (2017) as our summary statistics. Figure 6 shows the histogram of the posterior samples generated by ABC-GAN using the non-random features defined by Gutmann et al. (2017). We note that the results are not encouraging in this case, though this is an artifact of our algorithm. In order to address this, we define a new summary representation using a convolutional network. Figure 3 shows the structure of the convolutional summarizer which is used to generate summary representations instead of the non-random features defined by Gutmann et al. (2017). The generator and approximator modules are standard one-layer feed forward networks which are fully described in Section C of the supplementary material.

Figure 7 shows the results for the convolutional summarizer. We note that the posterior samples improve especially for the parameters and . However, there is considerable room for improvement by using alternative summarization networks and improved training especially using more rounds of improve_approx (for better approximation). We leave this to future work.

5 Conclusions

We present a generic architecture for training a differentiable approximator module which can be used in lieu of black-box simulator models without any need to reimplement the simulator. Our approach allows automatic discovery of summary statistics and crystallizes the choice of distance functions within the GAN framework. The goal of this paper is not to perform ”better” than all existing ABC methods under all settings, rather to provide an easy recipe for designing scalable likelihood-free inference models using popular deep learning tools.

References