1 Introduction
Approximate Bayesian Computation (ABC) is a likelihoodfree inference method that learns the parameter by generating simulated data and accepting proposals (for the parameter ) when the simulated data resembles the true data (see Lintusaari et al., 2017, for a recent overview). In most highdimensional settings, a summary statistic is chosen to represent the data so that the distance can be computed in terms of the summary statistic . Rejection ABC (Pritchard et al., 1999)
often suffers from low acceptance rates and several methods have been developed to address this problem. Broadly, these can be categorized as Markov Chain Monte Carlo methods
(Marjoram et al., 2003; Beaumont et al., 2009; Meeds et al., 2015; Moreno et al., 2016), sequential Monte Carlo (Sisson et al., 2007), and more recently, classifierbased approaches based on Bayesian optimization (BOLFI)
(Gutmann et al., 2016).In a series of recent papers, Gutmann et al. (2014, 2017, 2016) have suggested two novel ideas: treating the problem of discriminating distributions as a classification problem between and ; and regression of the parameter on the distancemeasure using Gaussian Processes in order to identify the suitable regions of the parameter space having higher acceptance ratios.
This line of work is closely related to Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) with different flavors of GANs (Goodfellow et al., 2014; Nowozin et al., 2016; Arjovsky et al., 2017; Dziugaite et al., 2015) minimizing alternative divergences (or ratio losses) between the observed data and simulated data (see Mohamed and Lakshminarayanan, 2016, for an excellent exposition).
In this paper, we develop the connection between GANs (Goodfellow, 2016) and Gutmann et al. (2017) and present a new differentiable architecture inspired by GANs for likelihoodfree inference (Figure 1). In doing so, we make the following contributions:

We present a method for adapting black box simulatorbased model with an “approximator”, a differentiable neural network module
(similar to function approximation, see e.g., Liang and Srikant, 2016, and references therein). 
Our method provides automatic generation of summary statistics as well as the choice of different distance functions (for comparing the summary statistics) with clear relation to likelihood ratio tests in GANs.

We adapt one of the key ideas in Gutmann et al. (2017), namely, gradientdescent based search to quickly narrow down to the acceptance region of the parameter space, to the framework of GANs.

We perform experiments on realworld problems (beyond LotkaVolterra model) previously studied in the ABC literature  showing benefits as well as cases where the differentiable neural network architecture might not be the best solution (Section 4).
2 Related Work
Summary statistics and distance function
The choice of summary statistic is known to be critical to the performance of the ABC scheme (Blum et al., 2013; Prangle, 2015). Several methods have explored automatic generation of summary statistics, e.g., projection using regression (Fearnhead and Prangle, 2012)
(Marin et al., 2016), etc. More recently, Prangle et al. (2017) have explored alternative distance functions for comparison of summary statistics. Within the classification scheme of (Prangle, 2015; Prangle et al., 2017), one of our contributions is automatic computation of nonlinear projectionbased summary statistics anda moment matching distance function (respectively, MMD
(Dziugaite et al., 2015) or Wasserstein distance (Arjovsky et al., 2017)).LFI using neural networks
Several authors have recently proposed alternative approaches for likelihoodfree inference (Meeds and Welling, 2015; Cranmer et al., 2015; Papamakarios and Murray, 2016; Tran et al., 2017). In particular, Papamakarios and Murray (2016) inverted the ABC problem by sampling the parameter from mixture of Gaussians (parametrized by neural network model ). More recently, Tran et al. (2017) presented an elegant variational approach for likelihoodfree inference under the restrictions that (1) the conditional density is indirectly specified as the hierarchical implicit model (similar to the “approximator” in this work) which given input noise outputs sample given latent variable and parameter , i.e., ; and (2) one can sample from the target distribution .
However, it is not clear how to extend these methods to the highdimensional setting where choice of summary statistics is crucial. Further, the mean field assumption in (Tran et al., 2017) is not valid for timeseries models. In contrast to above methods, this paper clearly demarcates the summarization from the approximation of the nondifferentiable simulator. Additionally, not being constrained by a variational setup in (Tran et al., 2017), one can use sophisticated approximators/summarizer pairs for “simulating” a blackbox simulator.
Maximum Mean Discrepancy
The Maximum Mean Discrepancy (MMD) is an integral probability metric defined via a kernel and its associated Reproducing Kernel Hilbert Space (RKHS) (Muandet et al., 2017; Sriperumbudur et al., 2012; Gretton et al., 2007). Explicitly, the MMD distance with kernel between distributions and is given by where are independent copies from and are independent copies from . For empirical samples from and from
, an unbiased estimate of the MMD is
. As shown in Dziugaite et al. (2015), this can be differentiated with respect to parameters generating one of the distributions.3 Model
1: while is not converged do 2: for to do 3: , 4: Draw sample , , , , , , as in (1)(7) 5: end for 6: 7: 8: {Optimization} 9: 10: 11: end while  
Let denote a target distribution with unknown parameters that need to be estimated. We have access to a black box simulator allowing us to generate samples from the distribution for any choice of parameter . Here, we have captured the underlying stochasticity of the simulator using suitable noise .
Figure 1 gives a highlevel overview of the ABCGAN architecture. The inputs to the network are the generator noise and the simulator noise which are problemspecific. The network functions are given as
(observations)  (1)  
(generator)  (2)  
(approximator)  (3)  
(simulator)  (4)  
(summarysim.)  (5)  
(summaryobs.)  (6)  
(decoderapprox.)  (7) 
The parameters are and the optimization consists of two alternating stages, namely:

improve_approx: This phase trains the approximator and the summarizer against the blackbox simulator using a ratio test between and . Mathematically,
where the loss term corresponds to a decoder for approximator (as an encoding of parameter ).

improve_accept: This phase trains the generator in order to generate parameters that are similar to the observed data, i.e., with better acceptance rates. Mathematically,
The improve_approx optimization (without improve_accept) reduces to function approximation for the summary statistics of blackbox simulator, with the decoder loss ensuring the outputs of the approximator and summarizer do not identically go to zero. A network as described above can be used in place of the simulator in a transparent fashion within an ABC scheme. However, this is extremely wasteful and the improve_accept scheme incorporates gradient descent in parameter space to quickly reach the acceptance region similar to BOLFI (Gutmann et al., 2017). Given the perfect approximator (or directly, the blackbox simulator) and choosing an divergence (Nowozin et al., 2016; Mohamed and Lakshminarayanan, 2016) as loss ensures the generator unit attempts to produce parameters within the acceptance region.
Algorithm 1 shows our specific loss functions and optimization procedure used in this paper. In our experiments, we observed that training with MMD loss in lieu of the discriminator unit (Dziugaite et al., 2015) yielded the best results (Section 4).
Discussion
We emphasize that our implementation (Algorithm 1) is just one possible solution in the ABCGAN architecture landscape. For a new problem of interest, we highlight some of these choices below:

Pretraining (improve_approx): For a specific region of parameter space (e.g., based on domain knowledge), one can pretrain and finetune the approximator and summarizer networks by generating more samples from parameters within the restricted parameter space.

Automatic summarization: Further, in domains where the summary statistic is not so obvious or rather adhoc (Section 4.2)  improve_approx optimization provides a natural method for identifying good summary statistics.

Loss function/Discriminator: Prangle et al. (2017) discuss why standard distance functions like Euclidean might not be suitable in many cases. In our architecture, the loss function can be selected based on the insights gained in the GAN community (Li et al., 2015; Bellemare et al., 2017; Arora et al., 2017; Sutherland et al., 2016).

Training: GANs are known to be hard to train. In this work, improve_approx is more crucial (since the approximator/summarizer pair tracks the simulator) and in general should be prioritized over improve_accept (which chooses the next parameter setting to explore). We suggest using several rounds of improve_approx for each round of improve_approx in case of convergence problems.

Mode collapse: We did not encounter mode collapse for a simple experiment on univariate normal (Supplementary Material, Section ). However, it is a known problem and an active area of research (Arjovsky et al., 2017; Arora et al., 2017) and choosing the Wasserstein distance (Arjovsky et al., 2017) has yielded promising results in other problems.

Module internals: Our architecture is not constrained by independence between samples (visavis (Tran et al., 2017)). For example, in time series problems, it makes sense to have deep recurrent architectures (e.g., LSTMs, GRUs) that are known to capture longrange dependencies (e.g., Sections 4.2
). Design of network structure is an active area of deep learning research which can be leveraged alongside domain knowledge.
4 Experiments
All experiments are performed using TensorFlow r1.3 on a Macbook pro 2015 laptop with core i5, 16GB RAM and
without GPU support. The code for the experiments will be made available on github. In addition to experiments reported below, Sectionin the supplementary material evaluates our ABCGAN architecture on two small synthethic models, namely, univariatenormal and mixture of normal distributions.
4.1 Generalized linear model
Kousathanas et al. (2016) note that basic ABC algorithm and sequential Monte Carlo methods (Sisson et al., 2007; Beaumont et al., 2009) are useful for lowdimensional models, typically less than parameters. To tackle higher dimensions, Kousathanas et al. (2016) consider scenarios where it is possible to define sufficient statistics for subsets of the parameters allowing parameterspecific ABC computation. This includes the class of exponential family of probability distributions.
In this example, we consider a generalized linear model defined by Kousathanas et al. (2016, Toy model 2) given as where denotes the unknown parameter,
denote multivariate normal random variable
and is a design matrix andWe use a uniform prior as in the original work. However, we note that Kousathanas et al. (2016) “start the MCMC chains at a normal deviate , i.e., around the true values of .” The true parameter is chosen as .
We do parameter inference for dimensional Gaussian in the above setting. Figure 2 shows the mean of the posterior samples within each minibatch as the the algorithm progresses ^{1}^{1}1For clarity, a larger version of this plot is presented in Figure 1 of the supplementary material.. The total number of iterations is with minibatch size of and learning rate of . The algorithm takes seconds. We reiterate that ABCGAN does not use modelspecific information such as the knowledge of sufficient statistics for subsets of parameters, and thus, is more widely applicable than the approach of Kousathanas et al. (2016). Concurrently, it also enables computationally efficient inference in highdimensional models – a challenge for Sequential Monte Carlo based methods (Sisson et al., 2007; Beaumont et al., 2009) and BOLFI (Gutmann et al., 2017).
In the highdimensional setting, classic ABC methods or BOLFI cannot be easily used in contrast to this work.
4.2 Ricker’s model
Network structure for Ricker model  Convolutional summarizer (DCC) 
and stride
are shown. The max_pooling layer has size . and the final dense layer outputs a summary representation of size for each input sample.The stochastic Ricker model (Ricker, 1954) is an ecological model described by the nonlinear autoregressive equation: where is the animal population at time and . The observation is given by the distribution , and the model parameters are . The latent time series makes the inference of the parameters difficult.
Wood (2010) computed a synthetic loglikelihood by defining the following summary statistics: mean, number of zeros in , autocovariance with lag , regression coefficients for against . They fit a Gaussian matrix to the summary statistics of the simulated samples and then, the synthetic loglikelihood is given by the probability of observed data under this Gaussian model. Wood (2010) inferred the unknown parameters using standard Markov Chain Monte Carlo (MCMC) method based on their synthetic loglikelihood. Gutmann et al. (2016) used the same synthetic loglikelihood but reduced the number of required samples for parameter inference based on regressing the discrepancy between simulated and observed samples (in terms of summary statistics) on the parameters using Gaussian Processes.
Figure 3 shows the network structure for the Ricker model ^{2}^{2}2Please see the supplementary material for detailed description of the ABCGAN architecture for the Ricker model. We follow the experimental setup of Wood (2010). We simulate observations from the Ricker model with true parameters . We use the following prior distributions:
Figure 4 shows the output of the generator (posterior samples for the parameters) for iterations of the algorithm. A minibatch size of is used with each sequence having length , and the optimization is done using RMSProp with a learning rate of . Figure 5 shows the histogram of the posterior samples for the last iterations of the algorithm. We note that the algorithm converges and is quite close to the true parameters.
We obtain the posterior means as (averaged over independent runs) for the number of true observations being and an typical time of seconds for iterations. BOLFI estimates the following parameters for data points (see Gutmann et al., 2016, Figure 9) and a typical run of BOLFI takes seconds for the given setup ^{3}^{3}3The code provided by Dr. Michael Gutmann uses the GNU R and C code of Wood for synthetic loglikelihood and is considerably faster than the python version available in ELFI..
We reemphasize that compared to the highlyengineered features of Wood (2010), our summary representation is learnt using a standard LSTMbased neural network (Hochreiter and Schmidhuber, 1997; Graves, 2012). Thus, our approach allows for easier extensions to other problems compared to manual feature engineering.
For complex simulators with latent variables (in a lowdimensional setting) with less number of observations and limited simulations, BOLFI performs best at significantly higher computational cost. Further, (Tran et al., 2017) is not suitable as mean field assumption is violated in the timeseries model.
4.3 Infection in Daycare center
We study the transmission of strains of Streptococcus pneumoniae in a total of children attending one of day care centers in Oslo, Norway. The initial data was published by Vestrheim et al. (2008) and further described in a followup study (Vestrheim et al., 2010). Numminen et al. (2013) first presented an ABCbased approach for inferring the parameters associated with rates of infection from an outside source (), infection from within the DCC () and infection by multiple strains (). Gutmann et al. (2017) presented a classificationbased approach where they used classification as a surrogate for the rejection test of standard ABC method. Section C in the supplementary material presents a discussion of the handengineered features of Numminen et al. (2013) and Gutmann et al. (2017).
In the remainder of this section, we follow the nomenclature in (Gutmann et al., 2017). For a single DCC, the observed data consists of presence or absence of a particular strain of the disease at time when the swabs were taken. On average, individuals attend a DCC out of which only some are sampled. There are strains of the bacteria in total. So the data from each of the DCCs consists of a binary matrix with entries, where if attendee has strain at time and zero otherwise. The observed data consists of a set of binary matrices formed by . For the simulator, we use the code provided by Michael Gutmann which in turn uses the code of Elina Numminen ^{4}^{4}4https://www.cs.helsinki.fi/u/gutmann/code/BOLFI/.
We assume the following prior on the parameters:
We first use the nonrandom features defined by Gutmann et al. (2017) as our summary statistics. Figure 6 shows the histogram of the posterior samples generated by ABCGAN using the nonrandom features defined by Gutmann et al. (2017). We note that the results are not encouraging in this case, though this is an artifact of our algorithm. In order to address this, we define a new summary representation using a convolutional network. Figure 3 shows the structure of the convolutional summarizer which is used to generate summary representations instead of the nonrandom features defined by Gutmann et al. (2017). The generator and approximator modules are standard onelayer feed forward networks which are fully described in Section C of the supplementary material.
Figure 7 shows the results for the convolutional summarizer. We note that the posterior samples improve especially for the parameters and . However, there is considerable room for improvement by using alternative summarization networks and improved training especially using more rounds of improve_approx (for better approximation). We leave this to future work.
5 Conclusions
We present a generic architecture for training a differentiable approximator module which can be used in lieu of blackbox simulator models without any need to reimplement the simulator. Our approach allows automatic discovery of summary statistics and crystallizes the choice of distance functions within the GAN framework. The goal of this paper is not to perform ”better” than all existing ABC methods under all settings, rather to provide an easy recipe for designing scalable likelihoodfree inference models using popular deep learning tools.
References

Arjovsky et al. [2017]
Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks.
In
International Conference on Machine Learning
, pages 214–223, 2017.  Arora et al. [2017] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). CoRR, abs/1703.00573, 2017. URL http://arxiv.org/abs/1703.00573.
 Beaumont et al. [2009] Mark A Beaumont, JeanMarie Cornuet, JeanMichel Marin, and Christian P Robert. Adaptive approximate bayesian computation. Biometrika, 96(4):983–990, 2009.
 Bellemare et al. [2017] Marc G. Bellemare, Ivo Danihelka, Will Dabney, Shakir Mohamed, Balaji Lakshminarayanan, Stephan Hoyer, and Rémi Munos. The cramer distance as a solution to biased wasserstein gradients. CoRR, abs/1705.10743, 2017. URL http://arxiv.org/abs/1705.10743.
 Blum et al. [2013] Michael GB Blum, Maria Antonieta Nunes, Dennis Prangle, Scott A Sisson, et al. A comparative review of dimension reduction methods in approximate bayesian computation. Statistical Science, 28(2):189–208, 2013.
 Cranmer et al. [2015] Kyle Cranmer, Juan Pavez, and Gilles Louppe. Approximating likelihood ratios with calibrated discriminative classifiers. arXiv preprint arXiv:1506.02169, 2015.

Dziugaite et al. [2015]
Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani.
Training generative neural networks via maximum mean discrepancy
optimization.
In
Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence
, pages 258–267. AUAI Press, 2015.  Fearnhead and Prangle [2012] Paul Fearnhead and Dennis Prangle. Constructing summary statistics for approximate bayesian computation: semiautomatic approximate bayesian computation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):419–474, 2012.
 Goodfellow [2016] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.

Graves [2012]
Alex Graves.
Supervised sequence labelling with recurrent neural networks
, volume 385. Springer, 2012.  Gretton et al. [2007] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the twosampleproblem. In Advances in neural information processing systems, pages 513–520, 2007.
 Gutmann et al. [2014] Michael U Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander. Statistical inference of intractable generative models via classification. arXiv preprint arXiv:1407.4981, 2014.
 Gutmann et al. [2016] Michael U Gutmann, Jukka Corander, et al. Bayesian optimization for likelihoodfree inference of simulatorbased statistical models. Journal of Machine Learning Research, 2016.
 Gutmann et al. [2017] M.U. Gutmann, R. Dutta, S. Kaski, and J. Corander. Likelihoodfree inference via classification. Statistics and Computing, in press, March 2017. ISSN 15731375. doi: 10.1007/s1122201797386. URL https://doi.org/10.1007/s1122201797386.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Kousathanas et al. [2016] Athanasios Kousathanas, Christoph Leuenberger, Jonas Helfer, Mathieu Quinodoz, Matthieu Foll, and Daniel Wegmann. Likelihoodfree inference in highdimensional models. Genetics, 203(2):893–904, 2016.
 Li et al. [2015] Yujia Li, Kevin Swersky, and Richard S. Zemel. Generative moment matching networks. CoRR, abs/1502.02761, 2015. URL http://arxiv.org/abs/1502.02761.
 Liang and Srikant [2016] Shiyu Liang and R Srikant. Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161, 2016.
 Lintusaari et al. [2017] Jarno Lintusaari, Michael U Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander. Fundamentals and recent developments in approximate bayesian computation. Systematic biology, 66(1):e66–e82, 2017.
 Marin et al. [2016] JeanMichel Marin, Louis Raynal, Pierre Pudlo, Mathieu Ribatet, and Christian P Robert. Abc random forests for bayesian parameter inference. arXiv preprint arXiv:1605.05537, 2016.
 Marjoram et al. [2003] Paul Marjoram, John Molitor, Vincent Plagnol, and Simon Tavaré. Markov chain monte carlo without likelihoods. Proceedings of the National Academy of Sciences, 100(26):15324–15328, 2003.
 Meeds and Welling [2015] Edward Meeds and Max Welling. Optimization monte carlo: Efficient and embarrassingly parallel likelihoodfree inference. CoRR, abs/1506.03693, 2015. URL http://arxiv.org/abs/1506.03693.
 Meeds et al. [2015] Edward Meeds, Robert Leenders, and Max Welling. Hamiltonian abc. arXiv preprint arXiv:1503.01916, 2015.
 Mohamed and Lakshminarayanan [2016] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 Moreno et al. [2016] Alexander Moreno, Tameem Adel, Edward Meeds, James M Rehg, and Max Welling. Automatic variational abc. arXiv preprint arXiv:1606.08549, 2016.
 Muandet et al. [2017] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, Bernhard Schölkopf, et al. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends® in Machine Learning, 10(12):1–141, 2017.
 Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 Numminen et al. [2013] Elina Numminen, Lu Cheng, Mats Gyllenberg, and Jukka Corander. Estimating the transmission dynamics of streptococcus pneumoniae from strain prevalence data. Biometrics, 69(3):748–757, 2013.
 Papamakarios and Murray [2016] George Papamakarios and Iain Murray. Fast free inference of simulation models with bayesian conditional density estimation. In Advances in Neural Information Processing Systems, pages 1028–1036, 2016.
 Prangle [2015] Dennis Prangle. Summary statistics in approximate bayesian computation. arXiv preprint arXiv:1512.05633, 2015.
 Prangle et al. [2017] Dennis Prangle et al. Adapting the abc distance function. Bayesian Analysis, 12(1):289–309, 2017.
 Pritchard et al. [1999] Jonathan K Pritchard, Mark T Seielstad, Anna PerezLezaun, and Marcus W Feldman. Population growth of human y chromosomes: a study of y chromosome microsatellites. Molecular biology and evolution, 16(12):1791–1798, 1999.
 Ricker [1954] William E Ricker. Stock and recruitment. Journal of the Fisheries Board of Canada, 11(5):559–623, 1954.
 Sisson et al. [2007] Scott A Sisson, Yanan Fan, and Mark M Tanaka. Sequential monte carlo without likelihoods. Proceedings of the National Academy of Sciences, 104(6):1760–1765, 2007.
 Sriperumbudur et al. [2012] Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Gert RG Lanckriet, et al. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
 Sutherland et al. [2016] Dougal J Sutherland, HsiaoYu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. arXiv preprint arXiv:1611.04488, 2016.
 Tran et al. [2017] Dustin Tran, Rajesh Ranganath, and David Blei. Hierarchical implicit models and likelihoodfree variational inference. In Advances in Neural Information Processing Systems, pages 5529–5539, 2017.
 Vestrheim et al. [2008] Didrik F Vestrheim, E Arne Høiby, Ingeborg S Aaberge, and Dominique A Caugant. Phenotypic and genotypic characterization of streptococcus pneumoniae strains colonizing children attending daycare centers in norway. Journal of clinical microbiology, 46(8):2508–2518, 2008.
 Vestrheim et al. [2010] Didrik F Vestrheim, E Arne Høiby, Ingeborg S Aaberge, and Dominique A Caugant. Impact of a pneumococcal conjugate vaccination program on carriage among children in norway. Clinical and Vaccine Immunology, 17(3):325–334, 2010.
 Wood [2010] Simon N Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102, 2010.
Comments
There are no comments yet.