# Partially Exchangeable Networks and Architectures for Learning Summary Statistics in Approximate Bayesian Computation

We present a novel family of deep neural architectures, named partially exchangeable networks (PENs) that leverage probabilistic symmetries. By design, PENs are invariant to block-switch transformations, which characterize the partial exchangeability properties of conditionally Markovian processes. Moreover, we show that any block-switch invariant function has a PEN-like representation. The DeepSets architecture is a special case of PEN and we can therefore also target fully exchangeable data. We employ PENs to learn summary statistics in approximate Bayesian computation (ABC). When comparing PENs to previous deep learning methods for learning summary statistics, our results are highly competitive, both considering time series and static models. Indeed, PENs provide more reliable posterior samples even when using less training data.

## Authors

• 4 publications
• 11 publications
• 6 publications
• 14 publications
• ### Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network

Approximate Bayesian Computation (ABC) methods are used to approximate p...
10/08/2015 ∙ by Bai Jiang, et al. ∙ 0

• ### Dynamic Likelihood-free Inference via Ratio Estimation (DIRE)

Parametric statistical models that are implicitly defined in terms of a ...
10/23/2018 ∙ by Traiko Dinev, et al. ∙ 0

• ### Convolutional Neural Networks as Summary Statistics for Approximate Bayesian Computation

Approximate Bayesian Computation is widely used in systems biology for i...
01/31/2020 ∙ by Mattias Åkesson, et al. ∙ 0

• ### Neural Approximate Sufficient Statistics for Implicit Models

We consider the fundamental problem of how to automatically construct su...
10/20/2020 ∙ by Yanzhi Chen, et al. ∙ 0

• ### Distance-learning For Approximate Bayesian Computation To Model a Volcanic Eruption

Approximate Bayesian computation (ABC) provides us with a way to infer p...
09/28/2019 ∙ by Lorenzo Pacchiardi, et al. ∙ 0

• ### Multi-Statistic Approximate Bayesian Computation with Multi-Armed Bandits

Approximate Bayesian computation is an established and popular method fo...
05/22/2018 ∙ by Prashant Singh, et al. ∙ 0

• ### A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks

Inference for population genetics models is hindered by computationally ...
02/16/2018 ∙ by Jeffrey Chan, et al. ∙ 0

## Code Repositories

### PENs-and-ABC

Code for the paper "Partially Exchangeable Networks and Architectures for Learning Summary Statistics in Approximate Bayesian Computation"

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We propose a novel neural network architecture to ease the application of approximate Bayesian computation (ABC), a.k.a.

likelihood-free inference. The architecture, called partially exchangeable network (PEN), uses partial exchangeability in Markovian data, allowing us to perform ABC inference for time series models with Markovian structure. Since the DeepSets architecture (Zaheer et al., 2017) turns out to be a special case of PEN, we can also perform ABC inference for static models. Our work is about automatically construct summary statistics of the data that are informative for model parameters. This is a main challenge in the practical application of ABC algorithms, since such summaries are often handpicked (i.e. ad-hoc summaries are constructed from model domain expertise), or these are automatically constructed using a number of approaches as detailed in Section 2. Neural networks have been previously used to automatically construct summary statistics for ABC. Jiang et al. (2017) and Creel (2017)

employ standard multilayer perceptron (MLP) networks for learning the summary statistics.

Chan et al. (2018) introduce a network that exploits the exchangeability property in exchangeable data. Our PEN architecture is a new addition to the tools for automatic construction of summary statistics, and PEN produces competitive inference results compared to Jiang et al. (2017), which in turn was shown outperforming the semi-automatic regression method by Fearnhead & Prangle (2012). Moreover, our PEN architecture is more data efficient and when reducing the training data PEN outperforms Jiang et al. (2017), the factor of reduction being of order to depending on cases.

Our main contributions are:

• Introducing the partially exchangeable networks (PENs) architecture;

• Using PENs to automatically learn summary statistics for ABC inference. We consider both static and dynamic models. In particular, our network architecture is specifically designed to learn summary statistics for dynamic models.

## 2 Approximate Bayesian computation

Approximate Bayesian computation (ABC) is an increasingly popular inference method for model parameters , in that it only requires the ability to produce artificial data from a stochastic model simulator (Beaumont et al., 2002; Marin et al., 2012). A simulator is essentially a computer program, which takes

, makes internal calls to a random number generator, and outputs a vector of artificial data. The implication is that ABC can be used to produce approximate inference when the likelihood function

underlying the simulator is intractable. As such ABC methods have been applied to a wide range of disciplines (Sisson et al., 2018). The fundamental idea in ABC is to generate parameter proposals and accept a proposal if the simulated data for that proposal is similar to observed data

. Typically this approach is not suitable for high-dimensional data, and a set of summary statistics of the data is therefore commonly introduced to break the

curse-of-dimensionality. So, instead of comparing to , we compare summary statistics of the simulated data to those of observed data . Then we accept the proposed if is close to in some metric. Using this scheme, ABC will simulate draws from the following approximate posterior of

 pϵABC(θ|sobs)∝∫Kϵ(Δ(s⋆,sobs))p(s⋆|θ)p(θ)ds⋆,

where is the prior of , is a distance function between observed and simulated summaries (we use a Mahalanobis distance, see the supplementary material in Appendix A), is a kernel, which in all our applications is the uniform kernel returning 1 if and 0 otherwise, and is the so-called ABC-threshold. A smaller produces more accurate approximations to the true summaries posterior , though this implies a larger computational effort due to the increasing number of rejected proposals. An additional issue is that ideally we would like to target , not , but again unless sufficient statistics are available (impossible outside the exponential family), and since , we have to be content with samples from .

In this work we do not focus on how to sample from (see Sisson et al., 2018 for possibilities). Therefore, we employ the simplest (and also most inefficient) ABC algorithm, the so called “ABC rejection sampling” (Pritchard et al., 1999). We will use the “reference table” version of ABC rejection sampling (e.g. Cornuet et al., 2008), which is as follows:

• Generate independent proposals , and corresponding data from the simulator;

• Compute the summary statistics for each ;

• Compute the distances for each .

• Retain proposals corresponding to those that are smaller than the -th percentile of all distances.

The retained ’s form a sample from with given by the selected th percentile. An advantage of this approach is that it allows to easily compare the quality of the ABC inference based on several methods for computing the summaries, under the same computational budget . Moreover, once the “reference table” has been produced in the first step, we can recycle these simulations to produce new posterior samples using several methods for computing the summary statistics.

### 2.1 Learning summary statistics

Event though ABC rejection sampling is highly inefficient due to proposing parameters from the prior , this is not a concern for the purpose of our work. In fact, our main focus is learning the summary statistics . This is perhaps the most serious difficulty affecting the application of ABC methodology to practical problems. In fact, we require summaries that are informative for , as a replacement for the (unattainable) sufficient statistics. A considerable amount of research has been conducted on how to construct informative summary statistics (see Blum et al., 2013 and Prangle, 2015 for an overview). However their selection is still challenging since no state-of-the-art methodology exists that can be applied to arbitrarily complex problems. Fearnhead & Prangle (2012) consider a regression-based approach where they also show that the best summary statistic, in terms of the minimal quadratic loss, is the posterior mean. The latter is however unknown since

itself is unknown. Therefore, they introduce a simulation approach based on a linear regression model

 θij=E(θj|yi)+ξij=b0j+bjh(yi)+ξij (1)

with some mean-zero noise. Here and

is a vector of (non)-linear transformations of “data”

(here can be simulated or observed data). Therefore Fearnhead & Prangle (2012) have models to fit separately, one for each component of vector . Of course, these fittings are to be performed before ABC rejection is executed, so this is a step that anticipates ABC rejection, to provide the latter with suitable summary statistics. The parameters in each regression (1

) are estimated by fitting the model by least squares to a new set of

simulated data-parameter pairs where, same as for ABC rejection, the are generated from and the are generated from the model simulator conditionally on . To clarify the notation: is the number of data-parameter pairs used to fit the linear regression model in (1), while is the number of parameter-data pair proposals used in ABC rejection sampling. However the two sets of parameter-data pairs and are different since these serve two separate purposes. They are generated in the same way but independently of each other. After fitting (1), estimates are returned and is taken as th summary statistic, . We can then take as th component of , and similarly take . The number of summaries is therefore equal to the size of .

This approach is further developed in Jiang et al. (2017) where a MLP deep neural network regression model is employed, and replaces the linear regression model in (1). Hence, Jiang et al. (2017) has the following regression model

 θi=E(θ|yi)+ξi=fβ(yi)+ξi

where is the MLP parametrized by the weights . Jiang et al. (2017) estimate from

 minβ1NN∑i=1∥fβ(yi)−θi∥22, (2)

where are the parameter-data pairs that the network is fitted to.

The deep neuronal network with multiple hidden layers considered in

Jiang et al. (2017) offers stronger representational power to approximate (and hence learn an informative summary statistic), compared to using linear regression, if the posterior mean is a highly non-linear function of . Moreover, experiments in Jiang et al. (2017) show that indeed their MLP outperforms the linear regression approach in Fearnhead & Prangle (2012) (at least for their considered experiments), although at the price of a much larger computational effort. For this reason in our experiments we compare ABC coupled with PENs with the ABC MLP from Jiang et al. (2017).

In Creel (2017) a deep neural network regression model is used. He also introduces a pre-processing step such that instead of feeding the network with the data set , the network is fed with a set of statistics of the data . This means that, unlike in Jiang et al. (2017), in Creel (2017) the statistician must already know “some kind” of initial summary statistics, used as input, and then the network returns another set of summary statistics as output, and the latter are used for ABC inference. Our PENs do not require any initial specification of summary statistics.

## 3 Partially exchangeable networks

Even though the likelihood function is intractable in the likelihood-free setting, we may still have insights into properties of the data generating process. To that end, given our data set with units, we will exploit some of the invariance properties of its prior predictive distribution . As discussed in Section 2, the regression approach to ABC (Fearnhead & Prangle, 2012) involves to learn the regression function , where is the posterior mean. Our goal in this section is to leverage the invariances of the Bayesian model to design deep neural architectures that are fit for this purpose.

### 3.1 Exchangeability and partial exchangeability

The simplest form of model invariance is exchangeability. A model is said to be exchangeable if, for all permutations in the symmetric group , . For example, if the observations are independent and identically distributed (i.i.d.) given the parameter, then is exchangeable. A famous theorem of de Finetti (1929), which was subsequently generalized in various ways (see e.g. the review of Diaconis, 1988), remarkably shows that such conditionally i.i.d. models are essentially the only exchangeable models.

If the model is exchangeable, it is clear that the function will be permutation invariant. It is therefore desirable that a neural network used to approximate this function should also be permutation invariant. The design of neural architectures that guarantee permutation invariance have been the subject of numerous works, dating at least back to Minsky & Papert (1988) and Shawe-Taylor (1989). A renewed interest in such architectures came about recently, notably through the works of Ravanbakhsh et al. (2017), Zaheer et al. (2017), and Murphy et al. (2019)—a detailed and up-to-date overview of this rich line of work can be found in Bloem-Reddy & Teh (2019). Most relevant to our work is the DeepSets architecture of Zaheer et al. (2017) that we generalize to partial exchangeability, and the approach of Chan et al. (2018), who used permutation invariant networks for ABC.

However, the models considered in ABC are arising from intractable-likelihoods scenarios, which certainly are not limited to exchangeable data, quite the opposite, e.g. stochastic differential equations (Picchini, 2014), state-space models and beyond (Jasra, 2015). To tackle this limitation, we ask: could we use a weaker notion of invariance to propose deep architectures suitable for such models?

In this paper, we answer this question for a specific class of non-i.i.d. models: Markov chains. To this end, we make use of the notion of

partial exchangeability studied by Diaconis & Freedman (1980). This property can be seen as a weakened version of exchangeability where is only invariant to a subset of the symmetric group called block-switch transformations. Informally, for , a -block-switch transformation interchanges two given disjoint blocks of when these two blocks start with the same symbols and end with the same symbols.

###### Definition 1 (Block-switch transformation).

For increasing indices such that and , the -block-switch transformation is defined as follows: if and then

 y =y1:i−1to38.27pt\vboxto20.75pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{1,0.8,0.8% }\pgfsys@color@rgb@fill{1}{0.8}{0.8}{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} {\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{1,0.8,0.8}% \pgfsys@color@rgb@fill{1}{0.8}{0.8}{}\pgfsys@rect{-17.132789pt}{-6.412953pt}{3% 4.265579pt}{16.745846pt}\pgfsys@fill\pgfsys@invoke{ } {}{}\pgfsys@endscope}{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0% .0}{1.0}{-13.799789pt}{0.0pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{yi:j}} }}{}{}\pgfsys@endscope}}} {}{}{}{}\pgfsys@endscope}}\pgfsys@endscopeto0.0pt\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicturey(j+1):(k−1)to38.27pt\vboxto20.75pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{0.8,0.8,1% }\pgfsys@color@rgb@fill{0.8}{0.8}{1}{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} {\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{0.8,0.8,1}% \pgfsys@color@rgb@fill{0.8}{0.8}{1}{}\pgfsys@rect{-17.132789pt}{-6.412953pt}{3% 4.265579pt}{16.745846pt}\pgfsys@fill\pgfsys@invoke{ } {}{}\pgfsys@endscope}{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0% .0}{1.0}{-13.799789pt}{0.0pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{yk:l}} }}{}{}\pgfsys@endscope}}} {}{}{}{}\pgfsys@endscope}}\pgfsys@endscopeto0.0pt\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicturey(l+1):M (3) T(d)b(y) =y1:i−1to38.27pt\vboxto20.75pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{0.8,0.8,1% }\pgfsys@color@rgb@fill{0.8}{0.8}{1}{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} {\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{0.8,0.8,1}% \pgfsys@color@rgb@fill{0.8}{0.8}{1}{}\pgfsys@rect{-17.132789pt}{-6.412953pt}{3% 4.265579pt}{16.745846pt}\pgfsys@fill\pgfsys@invoke{ } {}{}\pgfsys@endscope}{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0% .0}{1.0}{-13.799789pt}{0.0pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{yk:l}} }}{}{}\pgfsys@endscope}}} {}{}{}{}\pgfsys@endscope}}\pgfsys@endscopeto0.0pt\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicturey(j+1):(k−1)to38.27pt\vboxto20.75pt\pgfpicture\makeatletterto0.0pt\pgfsys@beginscope\definecolorpgfstrokecolorrgb0,0,0\pgfsys@color@rgb@stroke000\pgfsys@color@rgb@fill000\pgfsys@setlinewidth0.4pt\nullfontto0.0pt\pgfsys@beginscope\hbox{{\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{1,0.8,0.8% }\pgfsys@color@rgb@fill{1}{0.8}{0.8}{}{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} {\pgfsys@beginscope{}\definecolor{pgffillcolor}{rgb}{1,0.8,0.8}% \pgfsys@color@rgb@fill{1}{0.8}{0.8}{}\pgfsys@rect{-17.132789pt}{-6.412953pt}{3% 4.265579pt}{16.745846pt}\pgfsys@fill\pgfsys@invoke{ } {}{}\pgfsys@endscope}{{{{}}\pgfsys@beginscope{}\pgfsys@transformcm{1.0}{0.0}{0% .0}{1.0}{-13.799789pt}{0.0pt}{}\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}{}\pgfsys@color@rgb@fill{0}{0}{0}{}\hbox{{yi:j}} }}{}{}\pgfsys@endscope}}} {}{}{}{}\pgfsys@endscope}}\pgfsys@endscopeto0.0pt\hss\pgfsys@discardpath\pgfsys@endscope\hss\endpgfpicturey(l+1):M. (4)

If or then the block-switch transformation leaves unchanged: .

###### Definition 2 (Partial exchangeability).

A function is said to be -block-switch invariant if for all and for all -block-switch transformations . Similarly, a model is -partially exchangeable if for all -block-switch transformations we have .

Note that -partial exchangeability reduces to exchangeability and that all permutations are -block-switch transformations.

It is rather easy to see that, if is a Markov chain of order , then is partially exchangeable (and therefore is -block-switch invariant). In the limit of infinite data sets, Diaconis & Freedman (1980) showed that the converse was also true: any partially exchangeable distribution is conditionally Markovian. This result, which is an analogue of de Finetti’s theorem for Markov chains, justfies that

partial exchangeability is the right symmetry to invoke when dealing with Markov models

.

### 3.2 From model invariance to network architecture

When dealing with Markovian data, we therefore wish to model a regression function that is -block-switch invariant. Next theorem gives a general functional representation of such functions, in the case where is countable.

###### Theorem 1.

Let be -block-switch invariant. If is countable, then there exist two functions and such that

 (5)
###### Proof.

Let be the equivalence relation over defined by

 x∼y⟺∃b1,…,bk,y=T(d)b1∘⋯∘T(d)bk(x).

Let be the projection over the quotient set. According to the properties of the quotient set, since is -block-switch invariant, there exists a unique function such that .

Since is countable, is also countable and there exists an injective function . Consider then the function

 ν:y↦(y1:d,M−d∑i=12−c(yi:(i+d))),

which is clearly -block-switch invariant. There exists a unique function such that

We will now show that is a bijection. By construction, is clearly surjective. Let us now prove its injectivity. We thus have to show that, for all , implies . Let such that . We have therefore and

 M−d∑i=12−c(xi:(i+d))=M−d∑i=12−c(yi:(i+d)).

The uniqueness of finite binary representations then implies that . According to Diaconis & Freedman (1980, Proposition 27), those two conditions imply that , which shows that is indeed injective.

Since is a bijection, implies that which leads to . Finally, expanding this gives

 ∀y∈YM,F(y)=g∘h−1(y1:d,M−d∑i=12−c(yi:(i+d))),

which is the desired form with and . ∎

When , the representation reduces to

 F(y)=ρ(M∑i=1ϕ(yi)), (6)

and we exactly recover Theorem 2 from Zaheer et al. (2017)—which also assumes countability of —and the DeepSets representation. While an extension of our theorem to the uncountable case is not straightforward, we conjecture that a similar result holds even with uncountable . A possible way to approach this conjecture is to study the very recent and fairly general result of Bloem-Reddy & Teh (2019). We note that the experiments on an autoregressive time series model in Section 4.3, which is a Markovian process, support this conjecture.

##### Partially exchangeable networks

The result in Theorem 1 suggests how to build -block-switch invariant neural networks: we replace the functions and in Equation 5

by feed forward neural networks and denote this construction a

-partially exchangeable network (PEN- or PEN of order ). In this construction, we will call the inner network, which maps a -length subsequence into some representation , and is the outer network that maps the first symbols of the input, and the sum of the representations of all -length subsequences of the input, to the output. We note that DeepSets networks are a special case of the PENs that corresponds to PEN-0.

### 3.3 Using partially exchangeable networks for learning summary statistics for ABC

While PENs can by used for any exchangeable data, in this paper we use it for learning summary statistics in ABC. In particular, we propose the following regression model for learning the posterior mean

 θi=E(θ|yi)+ξi=ρβρ(yi1:d,M−d∑l=1ϕβϕ(yil:l+d))+ξi.

Here are the weights for the inner network, and are the weights for the outer network that maps its arguments into the posterior mean of the unknown parameters, which is the ABC summary we seek. When using PENs to learn the summary statistics we obtain the weights for the networks using the same criterion as in Equation 2, except that instead of using the MLP network we use a PEN network for the underlying regression problem.

When targeting static models we employ a PEN-0, i.e. a DeepSets network, since a static model can be viewed as a zero-order Markov model. When targeting time series models we use a PEN-, where is the order of the assumed data generating Markov process.

## 4 Experiments

We will present four experiments: two static models (g-and-k and -stable distributions), and two time series models (autoregressive and moving average models). Full specification of the experimental settings is provided as supplementary material in Appendix A. The code was written in Julia 1.0.0 (Bezanson et al., 2017) and the framework Knet (Yuret, 2016) was used to build the deep learning models. The code can be found at: https://github.com/SamuelWiqvist/PENs-and-ABC. All experiments are simulation studies and the data used can be generated from the provided code.

### 4.1 g-and-k distribution

The g-and-k distribution is a distribution defined by its quantile function via four parameters, and not by its probability density function since the latter is unavailable in closed form. This means that the likelihood function is “intractable” and as such exact inference is not possible. However, it is very simple to simulate draws from said distribution (see supplementary material in Appendix

A), which means that g-and-k models are typically used to test ABC algorithms (Prangle, 2017).

The unknown parameters are and we follow the common practice of keeping fixed to and assume and (Prangle, 2017). The prior distributions are set to , , , and (

is the Gamma distribution with shape parameter

and rate parameter ). We perform a simulation study with ground-truth parameters , , , (same ground-truth parameter values as in Allingham et al., 2009, Picchini & Anderson, 2017, Fearnhead & Prangle, 2012). Our data set comprises realizations from a g-and-k distribution.

We compare four different methods of constructing the summary statistics for ABC: (i) we use the handpicked summary statistics in Picchini & Anderson (2017), i.e. ( is the th percentile and

is the skewness); (ii) we use a MLP network; (iii) we use a MLP network with a preprocessing step where we feed the network with the empirical distribution function of the data instead of feeding it with the actual data; (iv) we use a PEN-0.

The probability density function for the g-and-k distribution can be approximated via finite differences, as implemented in the gk R package (Prangle, 2017). This allow us to sample from an almost exact posterior distribution using standard Markov chain Monte Carlo (MCMC), combined with the numerical approximation for the probability density function of the g-and-k distribution given in Prangle (2017). We evaluate the inference produced using summaries constructed from the four different methods (i–iv) by comparing the resulting ABC posteriors to the “true” posterior (computed using MCMC). Comparisons are performed via the multivariate Cramér statistic (Baringhaus & Franz, 2004) (see the supplementary material, in Appendix A, for details on how this statistic is computed). Small values of the Cramér statistic indicate that the two distributions under comparison are similar. ABC inferences are repeated over 100 independent data sets, and for a different number of training data observations for DNN models.

The results are presented in Figure 1 and we can conclude that PEN-0 generates the best results. Furthermore, PEN-0 is also more data efficient since it performs considerably better than other methods with limited number of training observations. It seems in fact that PEN-0 requires 10 times less training data than the version of MLP with preprocessing to achieve the same inference accuracy. However all methods performed poorly when too few training observations are used. The results also show that when MLP is feeded with the observations it generates poor results, but if we instead send in the empirical distribution function, in the spirit of Creel (2017), we obtain considerably better results.

### 4.2 α-stable distribution

The

-stable is a heavy-tailed distribution defined by its characteristic function (see supplementary material in Appendix

A). Its probability density function is intractable and inference is therefore challenging. Bayesian methods for the parameters can be found in e.g. (Peters et al., 2012; Ong et al., 2018). Unknown parameters are . We follow Ong et al. (2018) and transform the parameters:

 ~α=logα−1.12−α, ~β=logβ+11−β, ~γ=logγ,and ~δ=δ.

This constraints the original parameters to , , and . Independent Gaussian priors and ground-truth parameters are as in Ong et al. (2018): ; ground-truth values for the untransformed parameters are: , , , and . Observations consist of samples.

We compare methods for computing summary statistics as we did in Section 4.1 for the g-and-k distribution. However, since here the true posterior distribution is unavailable, we evaluate the different methods by comparing the root-mean square error (RMSE) between ground-truth parameter values and the mean of the ABC posteriors obtained under the different methods, see Table 1. From Table 1 we conclude that PEN-0 performs best in terms of RMSE. Similarly to the g-and-k example we also see that the MLP network with “preprocessing” (see Section 4.1 for details) performs considerably better than MLP. We also conclude that the inference results do not seem to degrade considerably when we reduce the number of training observations, at lest in terms of RMSE. We now look at the resulting posteriors. In Figure 2 five posteriors from five independent experiments are presented (here we have used training data observations). Inference results when using handpicked summary statistics are poor and for the posterior resembles the prior. Posterior inference is worst for MLP. Results for MLP with preprocessing and PEN-0 are quite similar, but for PEN-0 seems to generate better results, while is somewhat better determined by MLP with preprocessing. However, recall in this example we do not know where the true posterior lies.

### 4.3 Autoregressive time series model

An autoregressive time series model of order two (AR(2)) follows:

.

The AR(2) model is identifiable if the following are fulfilled: (Fuller, 1976). We let the resulting triangle define the uniform prior for the model. The ground-truth parameters for this simulation study are set to , and the data size is .

AR(2) is a Markov model, hence and the requirement for PEN- with is fulfilled.

We compare three methods for computing the summary statistics: (i) handpicked summary statistics, i.e. ( is autocovariance at lag

), which are reasonable summaries since autocovariances are normally employed in parameter estimation for autoregressive models, for instance when using the Yule–Walker equations; (ii) MLP, and (iii) PEN-2. For AR(2) we consider PEN-2 instead of PEN-0 since this is a time series model. Here we do not consider the MLP preprocessing method used in Section

4.1 and 4.2, since the empirical distribution function does not have any reasonable meaning for time series data.

The likelihood function for AR(2) is known and we can therefore sample from the true posterior using MCMC. We compare the approximate posteriors to the true posterior over 100 independent data sets via the multivariate Cramér statistic. To investigate the efficiency of the different DNN methods, we run several experiments by varying the size of the training data.

Results are in Figure 3. PEN-2 outperforms MLP, for example we can see that the precision achieved when PEN-2 is trained on training observations can be achieved by MLP when trained on observations, implying an improvement of a factor. Approximate and exact posteriors are in Figure 4 and we conclude that posteriors for both MLP and PEN-2 are similar to the true posterior when many training observations are used. However, the approximate posterior for MLP degrades significantly when the number of training observations is reduced and is very uninformative with observations, while for PEN-2 the quality of the approximate posterior distribution is only marginally reduced.

### 4.4 Moving average time series model

A moving average time series model (MA2) follows

The MA2 process is identifiable if: , , and . Same as in Jiang et al. (2017), we define a uniform prior over this triangle. We use the same setting as in Jiang et al. (2017) and set the ground-truth parameters for the simulation study to , the number of observations in the data set is .

The MA2 model is not Markovian, hence the Markov property required for PEN of order larger than 0 is not fulfilled, however, the quasi-Markov structure of the data might still allow us to successfully use PEN with an order larger than 0.

Once more, we compare three sets of summary statistics: (i) handpicked summaries , i.e. we follow Jiang et al. (2017); (ii) we use MLP; and finally (iii) we use PEN-10.

Also in this case the likelihood function is available, and we can compute the true posterior distribution. Once more, we compare the approximate posteriors to the true posterior over 100 different data sets, see Figure 5. We conclude that PEN-10 performs slightly better than MLP when the training data set is large, and that PEN-10 outperforms MLP when we restrict the size of the training data. Once more, we notice that PEN-10 implies a factor in terms of savings on the size of the training data.

## 5 Discussion

Simulation experiments show that our partially exchangeable networks (PENs) achieve competitive results compared to the other deep learning methods that we have considered. Moreover, PENs require much smaller training data to achieve the same inference accuracy of competitors: in our experiments a reduction factor of order to was observed.

As mentioned in Section 2, in this work we were not focused on the specific ABC algorithm used for sampling, but only on learning summary statistics for ABC. However, in future work we plan to use our approach for constructing summary statistics alongside more sophisticated variants of ABC methods, such as those which combine ABC with Markov chain Monte Carlo (Sisson & Fan, 2011) or sequential techniques (Beaumont et al., 2009).

Murphy et al. (2019) recently shed light on some limitations of the DeepSets architecture, and proposed to improve it by replacing the sum fed to the outer network by another pooling techinque called Janossy pooling. Since the drawbacks they inspect are also likely to affect our architectures, extending Janossy pooling to the PEN framework might constitute a valuable improvement.

In our all our experiments the PENs have fewer weights compared to the MLP networks. Due to this fact, it may not be surprising that PENs outperform MLPs when we reduce the number of training data observations. The main insight is that PENs by design incorporates the (partial) exchangeability property of the data in our experiments, wheres the MLPs have to learn this property. Exchangeability and partial exchangeability can in principle be expressed in an MLP, but for small data sets these properties will be difficult to learn, and we expect that the model overfit to the training data. One approach to alleviate this problem for MLPs is to perform data augmentation. However, it is not straight forward to perform data augmentation for continuous Markovian data, unless you have access to the underlying data generating process. In ABC the assumption is that we do have access to this process, but data generation may be computational expensive, and in a more general application we may not have access to the process.

Although we have applied the PEN architecture to the problem of learning summary statistics for ABC, notice that PEN is a general architecture and could be used for other applications. One example would be time series classification.

The main limitation for PEN is that observed data should be Markovian or, when considering the special case of PEN DeepSets, exchangeable. However, the MA(2) experiment shows that PEN seems to achieve good results also when applied to models with quasi-Markovian data.

## Acknowledgements

Research was partially supported by the Swedish Research Council (VR grant 2013-05167). We would also like to thank Joakim Lübeck with colleges at LUNARC, Lund University for helping out on setting up the GPU environment used for the simulations.

## References

• Allingham et al. (2009) Allingham, D., King, R., and Mengersen, K. L. Bayesian estimation of quantile distributions. Statistics and Computing, 19(2):189–201, 2009.
• Baringhaus & Franz (2004) Baringhaus, L. and Franz, C. On a new multivariate two-sample test.

Journal of multivariate analysis

, 88(1):190–206, 2004.
• Beaumont et al. (2002) Beaumont, M. A., Zhang, W., and Balding, D. J. Approximate Bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
• Beaumont et al. (2009) Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., and Robert, C. P. Adaptive approximate Bayesian computation. Biometrika, 96(4):983–990, 2009.
• Bezanson et al. (2017) Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. Julia: A fresh approach to numerical computing. SIAM review, 59(1):65–98, 2017.
• Bloem-Reddy & Teh (2019) Bloem-Reddy, B. and Teh, Y. W. Probabilistic symmetry and invariant neural networks. arXiv:1901.06082, 2019.
• Blum et al. (2013) Blum, M. G., Nunes, M. A., Prangle, D., Sisson, S. A., et al. A comparative review of dimension reduction methods in approximate Bayesian computation. Statistical Science, 28(2):189–208, 2013.
• Chan et al. (2018) Chan, J., Perrone, V., Spence, J., Jenkins, P., Mathieson, S., and Song, Y. A likelihood-free inference framework for population genetic data using exchangeable neural networks. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 8603–8614. Curran Associates, Inc., 2018.
• Cornuet et al. (2008) Cornuet, J.-M., Santos, F., Beaumont, M. A., Robert, C. P., Marin, J.-M., Balding, D. J., Guillemaud, T., and Estoup, A. Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics, 24(23):2713–2719, 2008.
• Creel (2017) Creel, M. Neural nets for indirect inference. Econometrics and Statistics, 2:36–49, 2017.
• de Finetti (1929) de Finetti, B. Funzione caratteristica di un fenomeno aleatorio. In Atti del Congresso Internazionale dei Matematici: Bologna dal 3 al 10 di settembre 1928, pp. 179–190, 1929.
• Diaconis (1988) Diaconis, P. Recent progress on de Finetti’s notions of exchangeability. Bayesian statistics, 3:111–125, 1988.
• Diaconis & Freedman (1980) Diaconis, P. and Freedman, D. de Finetti’s theorem for Markov chains. The Annals of Probability, pp. 115–130, 1980.
• Fearnhead & Prangle (2012) Fearnhead, P. and Prangle, D. Constructing summary statistics for approximate bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B, 74(3):419–474, 2012.
• Franz (2014) Franz, C. cramer: Multivariate nonparametric Cramer-Test for the two-sample-problem, 2014. R package version 0.9-1.
• Fuller (1976) Fuller, W. A. Introduction to time series analysis. New York: John Wiley & Sons, 1976.
• Jasra (2015) Jasra, A. Approximate Bayesian computation for a class of time series models. International Statistical Review, 83(3):405–435, 2015.
• Jiang et al. (2017) Jiang, B., Wu, T.-y., Zheng, C., and Wong, W. H. Learning summary statistic for approximate Bayesian computation via deep neural network. Statistica Sinica, pp. 1595–1618, 2017.
• Marin et al. (2012) Marin, J.-M., Pudlo, P., Robert, C. P., and Ryder, R. J. Approximate Bayesian computational methods. Statistics and Computing, 22(6):1167–1180, 2012.
• Minsky & Papert (1988) Minsky, M. and Papert, S. Perceptrons (expanded edition) MIT Press. 1988.
• Murphy et al. (2019) Murphy, R. L., Srinivasan, B., Rao, V., and Ribeiro, B. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. In International Conference on Learning Representations, 2019.
• Ong et al. (2018) Ong, V. M., Nott, D. J., Tran, M.-N., Sisson, S. A., and Drovandi, C. C. Variational bayes with synthetic likelihood. Statistics and Computing, 28(4):971–988, 2018.
• Peters et al. (2012) Peters, G. W., Sisson, S. A., and Fan, Y.

Likelihood-free bayesian inference for

-stable models.
Computational Statistics & Data Analysis, 56(11):3743–3756, 2012.
• Picchini (2014) Picchini, U. Inference for SDE models via approximate Bayesian computation. Journal of Computational and Graphical Statistics, 23(4):1080–1100, 2014.
• Picchini & Anderson (2017) Picchini, U. and Anderson, R. Approximate maximum likelihood estimation using data-cloning ABC. Computational Statistics & Data Analysis, 105:166–183, 2017.
• Prangle (2015) Prangle, D. Summary statistics in approximate Bayesian computation. arXiv:1512.05633, 2015.
• Prangle (2017) Prangle, D. gk: An R package for the g-and-k and generalised g-and-h distributions. arXiv:1706.06889, 2017.
• Pritchard et al. (1999) Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular biology and evolution, 16(12):1791–1798, 1999.
• Ravanbakhsh et al. (2017) Ravanbakhsh, S., Schneider, J., and Poczos, B. Deep learning with sets and point clouds. International Conference on Learning Representations (ICLR) - workshop track, 2017.
• Shawe-Taylor (1989) Shawe-Taylor, J. Building symmetries into feedforward networks. In Artificial Neural Networks, 1989., First IEE International Conference on (Conf. Publ. No. 313), pp. 158–162. IET, 1989.
• Sisson & Fan (2011) Sisson, S. A. and Fan, Y. Handbook of Markov Chain Monte Carlo, chapter Likelihood-free Markov chain Monte Carlo. Chapman and Hall, 2011.
• Sisson et al. (2018) Sisson, S. A., Fan, Y., and Beaumont, M. Handbook of Approximate Bayesian Computation. Chapman and Hall/CRC, 2018.
• Yuret (2016) Yuret, D. Knet: beginning deep learning with 100 lines of julia. In Machine Learning Systems Workshop at NIPS, volume 2016, pp.  5, 2016.
• Zaheer et al. (2017) Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391–3401, 2017.

## Appendix A Supplementary Material

### a.1 Approximate Bayesian computation rejection sampling

#### a.1.1 Settings for ABC rejection sampling “reference table” algorithm

For g-and-k and -stable models we consider for the th percentile, and for AR(2) and MA(2) the th percentile of all distances. The number of proposals for g-and-k and -stable models is , and for AR(2) and MA(2) .

#### a.1.2 The ABC distance function

In all our inference attempts we always used ABC rejection sampling and only needed to change the method used to compute the summary statistics. We employed the Mahalanobis distance

 Δ(s∗,sobs)=√(s∗−sobs)⊤A(s∗−sobs),

where in our case

is the identity matrix, except when using

hand-picked summary statistics for the g-and-k distribution, and in such case is a diagonal matrix with diagonal elements , with a vector with entries , as in (Picchini & Anderson, 2017).

### a.2 Multivariate Cramér statistic

To compare multivariate posterior distributions we use the multivariate Cramér statistic studied in (Baringhaus & Franz, 2004) and implemented in the cramer R package (Franz, 2014). The statistic assumes and to be independent random vectors with distribution function and respectively. Then (Baringhaus & Franz, 2004) defines the statistic which can be used to test the hypothesis vs where

 Tm,n=mnm+n(2mnm,n∑i,jψ(\normXi−Yi2)−1m2m∑i,jψ(\normXi−Xj2)−1n2n∑i,jψ(\normYi−Yj2)),

with a kernel function. In our experiments we do not perform hypothesis testing and we just use as a single measure of closeness between draws generated from different posterior distributions. In our experiments we use the kernel: (Franz, 2014).

### a.3 Regularization

We use early-stopping for all networks. The early-stopping method used is to train the network over epochs and then select the set of weights, out of the sets, that generated the lowest evaluation error.

### a.4 g-and-k distribution

• Procedure to simulate a single draw from the distribution: say that we simulate a draw

from a standard Gaussian distribution,

, then we plug into

 Q=A+B⋅(1+c⋅tanh(g⋅z/2))⋅z⋅(1+z2)k

and obtain a realization from a g-and-k distribution.

• The network settings are presented in Table 2, 3, and 4;

• Values outside of the range

are considered to be outliers and these values are replaced (at random) with values inside the data range. The data cleaning scheme is applied to both the observed and generated data;

• When computing the empirical distribution function we evaluate this function over 100 equally spaced points between 0 and 50;

• Number of training observations: , , , and . Test data observations . Evaluation data observations .

### a.5 α-stable distribution

• The characteristic function for the -stable distribution is given by (Ong et al., 2018)

 φ(x)=exp(iδt−γα|t|α(1+iβtanπα2sgn(t)(|γt|1−α−1))), α≠1,

and,

where, sgn is the sign function, i.e.

 sgn(t)=⎧⎨⎩−1if t<0,0if t=0,1if t>0.
• The network settings are presented in Table 5, 6, and 7;

• Values outside of the range are considered to be outliers and these values are replaced (at random) with values inside the data range. The data cleaning scheme is applied to both the observed and generated data;

• All data sets are standardized using the “robust scalar” method, i.e. each data point is standardized according to

 yi+Q1(y)Q3(y)−Q1(y)

where and are the first and third quantiles respectively;

• When computing the empirical distribution function we evaluate this function over 100 equally spaced points between -10 and 100;

• The root-mean-squared error (RMSE) is computed as

 RMSE= ⎷1RR∑i=1{4∑j=1(^θji−θj)2}

where are ground-truth parameter values and are ABC posterior means. is the number of independent repetitions of the inference procedure;

• Number of training observations: , , , and . Test data observations . Evaluation data observations .

### a.6 Autoregressive time series model

• The network settings are presented in Table 8 and 9;

• Number of training observations: , , , and . Test data observations . Evaluation data observations .

### a.7 Moving average time series model

• The network settings are presented in Table 10 and 11;

• Number of training observations: , , , and . Test data observations . Evaluation data observations .