In recent years, Bayesian inference has become a popular paradigm for machine learning and statistical analysis. Good introductions and references to the primary methods and philosophies of Bayesian inference can be found in texts such asPress (2003), Ghosh et al. (2006), Koch (2007), Koop et al. (2007), Robert (2007), Barber (2012), and Murphy (2012).
In this article, we are concerned with the problem of parametric, or classical Bayesian inference. For details regarding nonparametric Bayesian inference, the reader is referred to the expositions of Ghosh & Ramamoorthi (2003), Hjort et al. (2010), and Ghosh & van der Vaart (2017).
When conducting parametric Bayesian inference, we observe some realizations of the data
that are generated from some data generating process (DGP), which can be characterized by a parametric likelihood, given by a probability density function (PDF)
, determined entirely via the parameter vector. Using the information that the parameter vector
is a realization of a random variable, which arises from a DGP that can be characterized by some known prior PDF , we wish to characterize the posterior distribution
In very simple cases, such as cases when the prior PDF is a conjugate of the likelihood (cf. Robert, 2007, Sec. 3.3), the posterior distribution (1) can be expressed explicitly. In the case of more complex but still tractable pairs of likelihood and prior PDFs, one can sample from (1) via a variety of Monte Carlo methods, such as those reported in Press (2003, Ch. 6).
In cases where the likelihood function is known but not tractable, or when the likelihood function has entirely unknown form, one cannot exactly sample from (1) in an inexpensive manner, or at all. In such situations, a sample from an approximation of (1) may suffice in order to conduct the user’s desired inference. Such a sample can be drawn via the method of approximate Bayesian computation (ABC).
It is generally agreed that the ABC paradigm originated from the works of Rubin (1984), Tavaré et al. (1997), and Pritchard et al. (1999); see Tavaré (2019) for details. Stemming from the initial listed works, there are now numerous variants of ABC methods. Some good reviews of the current ABC literature can be found in the expositions of Marin et al. (2012), Voss (2014, Sec. 5.1), Lintusaari et al. (2017), and Karabatsos & Leisen (2018). The volume of Sisson et al. (2019) provides a comprehensive treatment regarding ABC methodologies.
The core philosophy of ABC is to define a quasi-posterior by comparing data with plausibly simulated replicates. The comparison is traditionally based on some summary statistics, the choice of which being regarded as a key challenge of the approach.
In recent years, data discrepancy measures bypassing the construction of summary statistics have been proposed by viewing data sets as empirical measures. Examples of such an approach is via the use of the Kullback–Leibler divergence, the Wasserstein distance, or a maximum mean discrepancy (MMD) variant.
In this article, we develop upon the discrepancy measurement approach of Jiang et al. (2018), via the importance sampling ABC (IS-ABC) approach which makes use of a weight function (see e.g., Karabatsos & Leisen, 2018). In particular, we report on a class of ABC algorithms that utilize the two-sample energy statistic (ES) of Szekely & Rizzo (2004) (see also Baringhaus & Franz, 2004, Szekely & Rizzo, 2013, and Szekely & Rizzo, 2017). Our approach is related to the maximum MMD ABC algorithms that were implemented in Park et al. (2016), Jiang et al. (2018), and Bernton et al. (2019). The MMD is a discrepancy measurement that is closely related to the ES (cf. Sejdinovic et al., 2013).
We establish new asymptotic results that have not been proved in these previous papers. In the IS-ABC setting and in the regime where both the observation sample size and the simulated data sample size increase to infinity, our theoretical result highlights how the data discrepancy measure impacts the asymptotic pseudo-posterior. More specifically, under the assumption that the data discrepancy measure converges to some asymptotic value , we show that the pseudo-posterior distribution converges almost surely to a distribution proportional to : the prior distribution times the IS weight function evaluated at , where stands for the ‘true’ parameter value associated to the DGP that generates observations . Although devised in settings where likelihoods are assumed intractible, ABC can also be cast in the setting of robustness with respect to misspecification, where the ABC posterior distribution can be viewed as a special case of a coarsened posterior distribution (cf. Miller & Dunson, 2018).
The remainder of the article proceeds as follows. In Section 2, we introduce the general IS-ABC framework. In Section 3, we introduce the two-sample ES and demonstrate how it can be incorporated into the IS-ABC framework. Theoretical results regarding the IS-ABC framework and the two-sample ES are presented in Section 4. Illustrations of the IS-ABC framework are presented in Section 5. Conclusions are drawn in Section 6.
2 Importance sampling ABC
Assume that we observe independent and identically distributed (IID) replicates of from some DGP, which we put into . We suppose that the DGP that generates is dependent on some parameter vector , a realization of from space , which is random and has prior PDF .
Denote to be the PDF of , given , and write
where is a realization of , and each is a realization of ().
If were known, then we could use (1) to write the posterior PDF
where is a constant that makes . When evaluating is prohibitive and ABC is required, then operating with is similarly difficult. We suppose that given any , we at least have the capability of sampling from the DGP with PDF . That is, we have a simulation method that allows us to feasibly sample the IID vector , for any , for a DGP with PDF
Using the simulation mechanism that generates samples and the prior distribution that generates parameters , we can simulate a set of simulations , where and is the transposition operator. Here, for each , is an observation from the DGP with joint PDF , hence each is composed of a parameter value and a datum conditional on the parameter value. We now consider how and can be combined in order to construct an approximation of (2).
Following the approach of Jiang et al. (2018), we define to be some non-negative real-valued function that outputs a small value if and are similar, and outputs a large value if and are different, in some sense. We call the data discrepancy measurement between and , and we say that is the data discrepancy function.
Next, we let be a non-negative, decreasing (in ), and bounded (importance sampling) weight function (cf. Section 3 of Karabatsos & Leisen, 2018), which takes as inputs a data discrepancy measurement and a calibration parameter . Using the weight and discrepancy functions, we can propose the following approximation for (2).
In the language of Jiang et al. (2018), we call
the quasi-posterior PDF, where
is the approximate likelihood function, and
is a normalization constant. We can use (3) to approximate (2) in the following way. For any functional of the parameter vector of interest, say, we may approximate the posterior Bayes estimator of via the expression
IS-ABC procedure for approximating .
Input: a data discrepancy function , a weight function , and a calibration parameter .
sample from the DGP with PDF ;
generate from the DGP with PDF ;
put into .
Output: and construct the estimator .
3 The energy statistic (ES)
Let define a metric and let and be two random variables that are in a metric space endowed with , where . Furthermore, let and be two random variables that have the same distributions as and , respectively. Here, , , , and are all independent of one another.
we can define the original ES of Baringhaus & Franz (2004) and Szekely & Rizzo (2004), as a function of and , via the expression , where is the metric corresponding to the (). Thus, the original ES statistic, which we shall also denote as , is defined using the Euclidean norm .
The original ES has numerous useful mathematic properties. For instance, under the assumption that , it was shown that
in Proposition 1 of Szekely & Rizzo (2013), where is the gamma function and (respectively,
) is the characteristic function of(respectively, ). Thus, we have the fact that for any , and if and only if and are identically distributed.
The result above is generalized in Proposition 3 of Szekely & Rizzo (2013), where we have the following statement. If is a continuous function and are independent random variables, then it is necessary and sufficient that is strictly negative definite (see Szekely & Rizzo, 2013 for the precise definition) for the following conclusion to hold: for any , and if and only if and are identically distributed.
We observe that there is thus an infinite variety of functions from which we can construct energy statistics. We shall concentrate on the use of the original ES, based on , since it is the most well known and popular of the varieties.
3.1 The V-statistic estimator
Suppose that we observe and , where the former is a sample containing IID replicates of , and the latter is a sample containing IID replicates of , respectively, with and being independent. In Gretton et al. (2012), it was shown that for any , upon assuming that , the so-called V-statistic estimator (cf. Serfling, 1980, Ch. 5 and Koroljuk & Borovskich, 1994)
can be proved to converge in probability to , as and , under the condition that , for some constant (see also Gretton et al., 2007).
We note that the assumption of this result is rather restrictive, since it either requires the bounding of the space or the function . In the sequel, we will present a result for the almost sure convergence of the V-statistic that depends on the satisfaction of a more realistic hypothesis.
It is noteworthy that if the ES is non-negative, then the V-statistic retains the non-negativity property of its corresponding ES (cf. Gretton et al., 2012). That is, for any continuous and negative definite function , we have .
3.2 The ES-based IS-ABC algorithm
From Algorithm 1, we observe that an IS-ABC algorithm requires three components. A data discrepancy measurement , a weighting function , and a tuning parameter . We propose the use of the ES in the place of the data discrepancy measurement , in combination with various weight functions that have been used in the literature. That is we set
in Algorithm 1.
3.3 Related methods
The ES-ABC algorithm that we have presented here is closely related to ABC algorithms based on the maximum mean discrepancy (MMD) that were implemented in Park et al. (2016), Jiang et al. (2018), and Bernton et al. (2019). For each positive definite Mercer kernel function (, the corresponding MMD is defined via the equation
where are random variable such that and are identically distributed to and , respectively.
The MMD as a statistic for testing goodness-of-fit was studied prominently in articles such as Gretton et al. (2007), Gretton et al. (2009), and Gretton et al. (2012). It is clear that if , the forms of the ES and the squared MMD are identical. More details regarding the relationship between the two classes of statistics can be found in Sejdinovic et al. (2013).
We note two shortcomings with respect to the applications of the MMD as a basis for an ABC algorithm in the previous literature. Firstly, no theoretical results regarding the consistency of the MMD-based methods have been proved. And secondly, in the application by Park et al. (2016) and Jiang et al. (2018), the MMD was implemented using the unbiased U-statistic estimator, rather than the biased V-statistic estimator. Although both estimators are consistent, in the sense that they can be proved to be convergent to the desired limiting MMD value, the U-statistic estimator has the unfortunate property of not being bounded from below by zero (cf. Gretton et al., 2012). As such, it does not meet the criteria for a data discrepancy measurement.
4 Theoretical results
4.1 General asymptotic analysis
We now establish a consistency result for the quasi-posterior density (3), when and approach infinity. Our result generalizes the main result of Jiang et al. (2018) (i.e., Theorem 1), which is the specific case when the weight function is restricted to the form
where is the Iverson bracket notation, which equals 1 when the internal statement is true, and 0, otherwise (cf. Graham et al., 1994).
The weighting function of form (8), when implemented within the IS-ABC framework, produces the common rejection ABC algorithms, that were suggested by Tavaré et al. (1997), and Pritchard et al. (1999). We extended upon the result of Jiang et al. (2018) so that we may provide theoretical guarantees for more exotic ABC procedures, such as the kernel-smoothed ABC procedure of Park et al. (2016), which implements weights of the form
for . See Karabatsos & Leisen (2018) for further discussion and examples.
In order to prove our consistency result, we require Hunt’s lemma, which is reported in Dellacherie & Meyer (1980), as Theorem 45 of Section V.5. For convenience to the reader, we present the result, below.
Let be a probability space with increasing and let . Suppose that is a sequence of random variables that is bounded from above in absolute value by some integrable random variable , and further suppose that converges almost surely to the random variable . Then, almost surely, and in mean, as .
Define the continuity set of a function as
Let and be IID samples from DGPs that can be characterized by PDFs and , respectively, with corresponding parameter vectors and . Suppose that the data discrepancy converges to some , which is a function of and , almost surely as , for some . If is piecewise continuous and decreasing in and for all and any , and if
then we have
almost surely, as .
Using the notation of Theorem 1, we set . Since , for any , we have the existence of a such that is integrable, since we can take . Since converges almost surely to , and is continuous at , we have with probability one by the extended continuous mapping theorem (cf. DasGupta, 2011, Thm. 7.10).
Now, let be the generated by the sequence . Thus, is an increasing , which approaches . We are in a position to directly apply Theorem 1. This yields
almost surely, as , where the right-hand side equals .
Notice that the left-hand side has the form
and therefore , almost surely, as . Thus, the numerator of (3) converges to
To complete the proof, it suffices to show that the denominator of (3) converges almost surely to
Since and , we obtain our desired convergence via the dominated convergence theorem, because . An application of a Slutsky-type theorem yields the almost sure convergence of the ratio between (11) and (12) to the right-hand side of (10), as . ∎
The following result and proof guarantees the applicability of Theorem 2 to rejection ABC procedures, and to kernel-smoothed ABC procedures, as used in Jiang et al. (2018) and Park et al. (2016), respectively.
For weights of form (8), we note that is continuous in at all points, other than when . Furthermore, and is hence non-negative and bounded. Thus, under the condition that , we have the desired conclusion of Theorem 2.
For weights of form (9), we note that for fixed , is continuous and positive in . Since is uniformly bounded by 1, differentiating with respect to , we obtain , which is negative for any and . Thus, (9) constitutes a weight function and satisfies the conditions of Theorem 2.
4.2 Asymptotic of the energy statistic
Let and be arbitrary elements of and , respectively. That is and arise from DGPs that can be characterized by PDFs and , respectively. Under the assumption , Proposition 1 of Szekely & Rizzo (2013) states that we can write the ES as
where is the characteristic function corresponding to the PDF .
We write . From Szekely & Rizzo (2004) we have the fact that for arbitrary ,
is the kernel of the V-statistic that is based on the function . The following result is a direct consequence of Theorem 1 of Sen (1977), when applied to V-statistics constructed from functionals that satisfy the hypothesis of Szekely & Rizzo (2013, Prop. 3).
Make the same assumptions regarding and as in Theorem 2. Let be a continuous and strictly negative definite function. If
then converges almost surely to , as , where and are arbitrary elements of and , respectively. Furthermore, if and only if and are identically distributed.
We may apply the result of Lemma 1 directly to the case of in order to provide an almost sure convergence result regarding the V-statistic .
where and . The first term on the right-hand side of (16) is equal to zero, since , whenever . Thus, we need only be concerned with bounding the second term.
For , , thus
The condition that is thus fulfilled if , which is equivalent to
by virtue of the integrability of implying the existence of
since it is defined on a bounded support.
Next, by the triangle inequality, , and hence
Since are all pairwise independent, and and are identically distributed to and , respectively, we have
which concludes the proof since is satisfied by the hypothesis and implies .
) is somewhat more intuitive and verifiable since it is concerned with the polynomial moments of norms and does not involve the piecewise function. It is also suggested in Zygmund (1951) that one may replace by if it is more convenient to do so.
Combining the result of Theorem 2 with Corollary 1 and the conclusion from Proposition 1 of Szekely & Rizzo (2013) provided in Equation (13) yields the key result below. This result justifies the use of the V-statistic estimator for the energy distance within the IS-ABC framework.
We illustrate the use of the ES on some standard models. The standard rejection ABC algorithm is employed (that is, we use Algorithm 1 with weight function of form (8)) for constructing estimators (5). The proposed ES is compared to the Kullback–Leibler divergence (KL), the Wasserstein distance (WA), and the maximum mean discrepancy (MMD). Here, the ES is applied using the Euclidean metric , the Wasserstein distance using the exponent (cf. Bernton et al., 2019) and the MMD using a Gaussian kernel . The Gaussian kernel is commonly used in the MMD literature, and was also considered for ABC in Park et al. (2016) and Jiang et al. (2018). Details regarding the use of the Kullback–Leibler divergence as a discrepancy function for ABC algorithms can be found in Jiang et al. (2018, Sec. 2).
We use to denote that the random variable has probability law . Furthermore, we denote the normal law by , where states that the DGP of
is multivariate normal distribution with mean vectorand covariance matrix . We further denote the uniform law, in the interval , for , by .
We consider examples explored in Jiang et al. (2018, Sec. 4.1). For each illustration below, we sample synthetic data of the same size as the observed data size, , whose value is specified for each model below. We consider only the rejection weight function, and the number of ABC iterations in Algorithm 1 is set to . The tuning parameter is set so that only the smallest discrepancies are kept to form ABC posterior sample. We postpone the discussion of the results of our simulation experiments to Section 5.5
The experiments were implemented in R, using in particular the winference package (Bernton et al., 2019) and the FNN package (Beygelzimer et al., 2013). The Kullback–Leibler divergence between two PDFs is computed within the -nearest neighbor framework (Boltz et al., 2009). Moreover, the -d trees is adopted for implementing the nearest neighbor search, which is the same as the method of Jiang et al. (2018). For estimating the -Wasserstein distance between two multivariate empirical measures, we propose to employ the swapping algorithm (Puccetti, 2017), which is simple to implement, and is more accurate and less computationally expensive than other algorithms commonly used in the literature (Bernton et al., 2019). Regarding the MMD, the same unbiased U-statistic estimator is adopted as given in Jiang et al. (2018) and Park et al. (2016). For reproduction of the the experimental results, the original source code can be accessed at https://github.com/hiendn/Energy_Statistics_ABC.
5.1 Bivariate Gaussian mixture model
Let be a sequence of IID random variables, such that each has mixture of Gaussian probability law
with known covariance matrices
We aim to estimate the generative parameters consisting of the mixing probability and the population means and . To this end, we perform ABC using observations, sampled from model (17) with , and
. A kernel density estimate (KDE) of the ABC posterior distribution is presented in Figure1.
5.2 Moving-average model of order 2
The moving-average model of order , MA(), is a stochastic process defined as
with being a sequence of unobserved noise error terms. Jiang et al. (2018) used a MA model for their benchmarking; namely . Each observation corresponds to a time series of length . Here, we use the same model as that proposed in Jiang et al. (2018), where follows the Student- distribution with degrees of freedom, and . The priors on the model parameters and are taken to be uniform, that is, and . We performed ABC using samples generated from a model with the true parameter values . A KDE of the ABC posterior distribution is displayed in Figure 2.
5.3 Bivariate beta model
The bivariate beta model proposed by Crackel & Flegal (2017) is defined with five positive parameters by letting
where , for , and setting and . The bivariate random variable has marginal laws and . We performed ABC using samples of size , which are generated from a DGP with true parameter values . The prior on each of the model parameters is taken to be independent . A KDE of the ABC posterior distribution is displayed in Figure 3.
5.4 Multivariate -and- distribution
A univariate -and-
distribution can be defined via its quantile function(Drovandi & Pettitt, 2011):
where parametersis the th quantile of the standard normal distribution. Given a set of parameters , it is easy to simulate observations of a DGP with quantile function (19), by generating a sequence of IID sample , where , for .
A so-called -dimensional -and- DGP can instead be defined by applying the quantile function (19) to each of the elements of a multivariate normal vector , where is a covariance matrix. In our experiment, we use a 5-dimensional -and- model with the same covariance matrix and parameter values for as that considered by Jiang et al. (2018). That is, we generate samples of size from a -and- DGP with the true parameter values and the covariance matrix
where . Marginal KDEs of the ABC posterior distributions is presented in Figure 4.
5.5 Discussion of the results and performance
For each of the four experiments and each parameter, we computed the posterior mean , posterior median , mean absolute error and mean squared error defined by
where denotes the pseudo-posterior sample and denotes the true parameter. Here since and is chosen as to retain of the samples. Each experiment was replicated ten times by keeping the same fixed (true) values for the parameters and by sampling new observed data each of the ten times. The estimated quantities , , and errors MAE and
were then averaged over the ten replications, and are reported along with standard deviationsin columns associated with each estimator and true values for each parameter in Tables 1, 2, 3 and 4.
Upon inspection, Tables 1, 2, 3 and 4 showed some advantage in performance from WA on the bivariate Gaussian mixtures, some advantage from the MMD on the bivariate beta model, and some advantage from the ES on the -and- model, while multiple methods are required to make the best inference in the case of the MA experiment. When we further take into account the standard deviations of the estimators, we observe that all four data discrepancy measures essentially perform comparatively well across the four experimental models. Thus, we may conclude that there is no universally best performing discrepancy measure, and one must choose the right method for each problem of interest. Alternatively, one may also consider some kind of averaging over the results of the different discrepancy measures. We have not committed to an investigation of such methodologies and leave it as a future research direction.