K2-ABC: Approximate Bayesian Computation with Kernel Embeddings
Complicated generative models often result in a situation where computing the likelihood of observed data is intractable, while simulating from the conditional density given a parameter value is relatively easy. Approximate Bayesian Computation (ABC) is a paradigm that enables simulation-based posterior inference in such cases by measuring the similarity between simulated and observed data in terms of a chosen set of summary statistics. However, there is no general rule to construct sufficient summary statistics for complex models. Insufficient summary statistics will "leak" information, which leads to ABC algorithms yielding samples from an incorrect (partial) posterior. In this paper, we propose a fully nonparametric ABC paradigm which circumvents the need for manually selecting summary statistics. Our approach, K2-ABC, uses maximum mean discrepancy (MMD) as a dissimilarity measure between the distributions over observed and simulated data. MMD is easily estimated as the squared difference between their empirical kernel embeddings. Experiments on a simulated scenario and a real-world biological problem illustrate the effectiveness of the proposed algorithm.READ FULL TEXT VIEW PDF
Approximate Bayesian Computation (ABC) are likelihood-free Monte Carlo
Performing exact posterior inference in complex generative models is oft...
Approximate Bayesian computation (ABC) has become an essential part of t...
We develop and apply two calibration procedures for checking the coverag...
We consider the fundamental problem of how to automatically construct su...
The distributions underlying complex datasets, such as images, text or
Approximate Bayesian Computation (ABC) enables statistical inference in
K2-ABC: Approximate Bayesian Computation with Kernel Embeddings
ABC is an approximate Bayesian inference framework for models with intractable likelihoods. Originated in population genetics, ABC has been widely used in a broad range of scientific applications including ecology, cosmology, and bioinformatics [2, 3, 4]
. The goal of ABC is to obtain an approximate posterior distribution over model parameters, which usually correspond to interpretable inherent mechanisms in natural phenomena. However, in many complex models of interest, exact posterior inference is intractable since the likelihood function, the probability of observationsfor given model parameters , is either expensive or impossible to evaluate111Note that this is different from the intractability of the marginal likelihood, which requires integrating out from the likelihood function. While marginal likelihood is often computationally intractable, this issue is readily resolved via a suite of MCMC algorithms.. This leads to a situation where the posterior density cannot be evaluated even up to a normalisation constant.
ABC resorts to an approximation of the likelihood function using simulated observations that are similar to the actual observations. Most ABC algorithms evaluate the similarity between the simulated and actual observations in terms of a pre-chosen set of summary statistics. Since the full dataset is represented in a lower-dimensional space of summary statistics, unless the selected summary statistic is sufficient, ABC results in inference on the partial posterior , rather than the desired full posterior . Therefore, a poor choice of a summary statistic can lead to an additional bias that can be difficult to quantify and the selection of summary statistics is a crucial step in ABC [5, 6]. A number of methods have been proposed for constructing informative summary statistics by appropriate transformations of a set of candidate summary statistics: a minimum-entropy approach , regression from a set of candidate summary statistics to parameters , or a partial least-squares transformation with boosting [9, 10]. A review of these methods is given in 
. However, the obtained summary statistics are optimal only with respect to a given loss function (e.g., focuses on the quadratic loss) and may not suffice for full posterior inference. In addition, they still heavily depend on the initial choice of candidate summary statistics.
Another line of research, inspired by indirect inference framework, constructs the summary statistics using auxiliary models [11, 12]. Here, the estimated parameters of an auxiliary model become the summary statistics for ABC. A thorough review of these method is given in . The biggest advantage of this framework is that the dimensionality of the summary statistic can be determined by controlling the complexity of the auxiliary model using standard model selection techniques such as Bayesian information criterion (BIC) and Akaike information criterion (AIC). However, even if the complexity can be set in a principled way, one still needs to select the right parametric form of the auxiliary model. Furthermore, unless one uses an exponential family model, the dimensionality of the sufficient statistic will increase with the sample size according to the classical Pitman-Koopman-Darmois theorem (cf. e.g. ). Thus, for many complex models, the minimal sufficient statistic is effectively the full dataset.
In this light, we introduce a method that circumvents an explicit selection of summary statistics. Our method proceeds by applying a similarity measure to data themselves, via embedding [14, 15] of the empirical data distributions into an infinite-dimensional reproducing kernel Hilbert space (RKHS), corresponding to a positive definite kernel function defined on the data space. For suitably chosen kernels called characteristic 
, these embeddings capture all possible differences between distributions, e.g. all the high-order moment information that may be missed with a set of simple summary statistics. Thus, no information loss is incurred when going from the posterior given datato that given the embedding of data, . A flexible representation of probability measures given by kernel embeddings has already been applied to nonparametric hypothesis testing , inference in graphical models  and to proposal construction in adaptive MCMC . The key quantity arising from this framework is an easily computable notion of a distance between probability measures, termed Maximum Mean Discrepancy (MMD). When the kernel used is characteristic, a property satisfied by most commonly used kernels including Gaussian, Laplacian and inverse multiquadrics, embeddings are injective, meaning that the MMD gives a metric on probability measures. Here, we adopt MMD in the context of ABC as a nonparametric distance between empirical distributions of simulated and observed data. As such, there is no need to select a summary statistic first and the kernel embedding itself plays a role of a summary statistic. In addition to the positive definite kernel used for obtaining the estimates of MMD, we apply an additional Gaussian smoothing kernel which operates on the corresponding RKHS, i.e., on embeddings themselves, to obtain a measure of similarity between simulated and observed data. For this reason, we refer to our method as double-kernel ABC, or K2-ABC. Our experimental results in section 4 demonstrate that this approach results in an effective and robust ABC method which requires no hand-crafted summary statistics.
The rest of the paper is organised as follows. In section 2, we overview classical approaches (rejection and soft ABC) as well as several relevant recent techniques (synthetic likelihood ABC, semi-automatic ABC, indirect score ABC, and kernel ABC) to which we will compare our method in section 4. In section 3, we introduce our proposed algorithm. Experimental results including comparisons with methods discussed in section 2, are presented in section 4. We explore computational tractability of K2-ABC in section 5.
We start by introducing ABC and reviewing existing algorithms. Consider a situation where it is possible to simulate a generative model and thus sample from the conditional density , given a value of parameters, while the computation of the likelihood for the observed data is intractable. Neither exact posterior inference nor posterior sampling are possible in this case, as the posterior , for a prior , cannot be computed up to a normalizing constant. ABC uses an approximation of the likelihood obtained from simulation.
The simplest form of ABC is rejection ABC. Let be a similarity threshold, and be a notion of distance, e.g., a premetric on domain of observations. The rejection ABC proceeds by sampling multiple model parameters . For each , a pseudo dataset is generated from . The parameters for which the generated are similar to the observed , as decided by , are accepted. The result is an exact sample from the approximated posterior , where and . Choice of is crucial for the design of an accurate ABC algorithm. Applying a distance directly on dataset is often challenging, when the dataset consists of a large number of (possibly multivariate) observations.
Thus, one resorts to first choosing a summary statistic and comparing them between the datasets, i.e. . Since it is generally difficult to construct sufficient statistics for complex models, this will often “leak” information, e.g., if represents first few empirical moments of dataset . It is only when the summary statistic is sufficient, that this approximation is consistent as , i.e. that the ABC posterior will converge to the full posterior. Otherwise, ABC operates on the partial posterior , rather than the full posterior .
Another interpretation of the approximate likelihood is as the convolution of the true likelihood and the “similarity” kernel . Being the indicator of the -ball computed w.r.t. , this kernel imposes a hard-constraint leading to the rejection sampling. In fact, one can use any similarity kernel parametrised by which approaches delta function as . A frequently used similarity kernel takes the form
Such construction would result in a weighted sample , where , which can be directly utilised in estimating posterior expectations. That is, for a test function , the expectation is estimated using . This is an instance of a soft ABC, where parameter samples from the prior are weighted, rather than accepted or rejected.
Introduced in 
, the synthetic likelihood ABC models simulated data in terms of their summary statistics and further assumes the summary statistics have multivariate normal distribution,, with the empirical mean and covariance defined by
denotes the vector of summary statistics of theth simulated dataset. Using the following similarity kernel to measure the distance from the summary statistics of actual observations , the resulting synthetic likelihood is given by
Relying on the synthetic likelihood, SL-ABC algorithm performs MCMC sampling based on Metropolis-Hastings accept/reject steps with the acceptance probability given by , where is a proposal distribution.
Under a quadratic loss,  shows that the optimal choice of the summary statistics is the true posterior means of the parameters – however, these cannot be calculated analytically. Thus,  proposed to use the simulated data in order to construct new summary statistics – estimates of the posterior means of the parameters – by fitting a vector-valued regression model from a set of candidate summary statistics to parameters. Namely, a linear model with the vector of candidate summary statistics used as explanatory variables (these can be simply or also include nonlinear functions of ) is fitted based on the simulated data . Here, assuming and , is a matrix of coefficients. The resulting estimates of can then be used as summary statistics in standard ABC by setting .
Based on an auxiliary model with a vector of parameters , the indirect score ABC uses a score vector as the summary statistic : When the auxiliary model parameters are set by the maximum-likelihood estimate (MLE) fitted with observed data , the score for the observed data becomes zero. Based on this fact, IS-ABC searches for the parameter values whose corresponding simulated data produce a score close to zero. The discrepancy between observed and simulated data distributions in IS-ABC is given by
where is the approximate covariance matrix of the observed score.
The use of a positive definite kernel in ABC has been explored recently in  (K-ABC) in the context of population genetics. In K-ABC, ABC is cast as a problem of estimating a conditional mean embedding operator mapping from summary statistics to corresponding parameters . The problem is equivalent to learning a regression function in the RKHSs of and induced by their respective kernels . The training set needed for learning the regression function is generated by firstly sampling from which by summarising each pseudo dataset into a summary statistic .
In effect, given a summary statistic corresponding to the observations , the learned regression function allows one to represent the embedding of the posterior distribution in the form of a weighted sum of the canonical feature maps where is a kernel associated with an RKHS . In particular, if we assume that is a linear kernel (as in ), the posterior expectation of a function is given by
where , is a kernel on , and is a regularization parameter. The use of a kernel on summary statistics implicitly transforms non-linearly, thereby increasing the representativeness of . Nevertheless, the need for summary statistics is not eliminated.
We first overview kernel MMD, a notion of distance between probability measures that is used in the proposed K2-ABC algorithm.
For a probability distributionon a domain , its kernel embedding is defined as , an element of an RKHS associated with a positive definite kernel . An embedding exists for any whenever the kernel is bounded, or if satisfies a suitable moment condition w.r.t. an unbounded kernel . For any two given probability measures and , their maximum mean discrepancy (MMD) is simply the Hilbert space distance between their embeddings:
where and . While simple kernels like polynomial of order capture differences in first moments of distributions, particularly interesting are kernels with a characteristic property 222A related notion of universality is often employed. , for which the kernel embedding is injective and thus MMD defined by such kernels gives a metric on the space of probability distributions. Examples of such kernels include widely used kernels such as Gaussian RBF and Laplacian. Being written in terms of expectations of kernel functions allows straightforward estimation of MMD on the basis of samples: given
, an unbiased estimator is given by
Further operations are possible on kernel embeddings - one can define a positive definite kernel on probability measures themselves using their representation in a Hilbert space. An example of a kernel on probability measures is ,  with . This has recently led to a thread of research tackling the problem of learning on distributions, e.g., . These insights are essential to our contribution, as we employ such kernels on probability measures in the design of the K2-ABC algorithm which we describe next.
The first component of K2-ABC is a nonparametric distance between empirical data distributions. Given two datasets and consisting of i.i.d. observations333The i.i.d. assumption can be relaxed in practice, as we demonstrate in section 4 on time series data., we use MMD to measure the distance between : , i.e. is an unbiased estimate of between probability distributions and used to generate and . This is almost the same as setting empirical kernel embedding to be the summary statistic. However, in that case would have been a biased estimate of the population . Our choice of is guaranteed to capture all possible differences (i.e. all moments) between and whenever a characteristic kernel is employed , i.e. we are operating on a full posterior and there is no loss of information due to the use of insufficient statistics.
Further, we introduce a second kernel into the ABC algorithm (summarised in Algorithm 1), the one that operates directly on probability measures, and compute the ABC posterior sample weights,
with a suitably chosen parameter . Now, the datasets are compared using the estimated similarity between their generating distributions. There are two sets of parameters in K2-ABC, parameters of kernel (on original domain) and in the kernel (on probability measures).
K2-ABC is readily applicable to non-Euclidean input objects if a kernel can be defined on them. Arguably the most common application of ABC is to genetic data. Over the past years there have been a number of works addressing the use of kernel methods for studying genetic data including  which considered genetic association analysis, and  which studied gene-gene interaction. Generic kernels for strings, graphs and other structured data have also been explored .
We start by illustrating how the choice of summary statistics can significantly affect the inference result, especially when the summary statistics are not sufficient. We consider a symmetric Dirichlet prior and a likelihood
given by a mixture of uniform distributions as
The model parameters are a vector of mixing proportions. The goal is to estimate where is generated with true parameter (see Fig. 1
A). The summary statistics are chosen to be empirical mean and variance i.e..
We compare three ABC algorithms: K2-ABC, rejection ABC, and soft ABC. Here, soft ABC refers to an ABC algorithm which uses a similarity kernel in Eq. 1 with and . For K2-ABC, a Gaussian kernel defined as is used where is set to . We test different values of on a coarse grid, and report the estimated which is closest to as measured with a Euclidean distance.
The results are shown in Fig. 1 where the top row shows the estimated from each method, associated with the best as reported in the third row. The second row of Fig. 1, from left to right, shows and realizations of drawn from obtained from the three algorithms. In all cases, the mean and variance of the drawn realizations match that of . However, since the first two moments are insufficient to characterise , there exists other that can give rise to the same , which yields inaccurate posterior means shown in the top row. In contrast, K2-ABC taking into account infinite-dimensional sufficient statistic correctly estimates the posterior mean.
As an example of statistical inference for ecological dynamic systems, we use observations on adult blowfly populations over time introduced in . The population dynamics are modelled by a discretised differential equation:
where an observation at time is denoted by which is determined by time-lagged observations and
as well as Gamma distributed noise realisationsand . Here, the parameters are . We put broad Gaussian priors on log of parameters as shown in Fig. 2A. Note that the time series data given the parameters drawn from the priors vary drastically (see Fig. 2B), and therefore inference with those data is very challenging as noted in .
The observation (black trace in Fig. 2B) is a time-series of length , where each point in time indicates how many flies survive at each time under food limitation. For SL-ABC and K-ABC, we adopted the custom 10 summary statistics used in : the log of the mean of all quantiles of (four statistics), the mean of quantiles of the first-order differences of (four statistics), and the maximal peaks of smoothed , with two different thresholds (two statistics). For IS-ABC, following 
, we use a Gaussian mixture model with three components as an auxiliary model. In addition, we ran two versions of SA-ABC algorithm on this example: SAQ regresses to simulated parameters from the corresponding simulated instances of time-series appended with the quadratic terms , i.e.,, whereas SA-custom uses the above custom 10 summary statistics from  appended with their squares as the candidate summary-statistics vector (which is thus 20-dimensional in this instance).444For SL-ABC, we used the Python implementation by the author of . For IS-ABC, we used the MATLAB implementation by the author of . For SA-ABC, we used the R package abctools written by the author of .
For setting and kernel parameters in K2-ABC, we split the data into two sets: training ( of data points) and test (the rest) sets. Using the training data, we ran each ABC algorithm given each value of and kernel parameters defined on a coarse grid, then, computed test error555We used Euclidean distance between the histogram (with 10 bins) of test data and that of predictions made by each method. We chose the difference in histogram rather than in the realisation of itself, to avoid the error due to the time shift in . to choose the optimal values of and kernel parameters in terms of the minimum prediction error. Finally, with the chosen and kernel parameters, we ran each ABC algorithm using the entire data.
We show the concentrated posterior mass after performing K2-ABC in Fig. 2A, as well as an example trajectory drawn from the inferred posterior mean in Fig. 2B 666Note that reproducing the trajectory exactly is not the main goal of this experiment. We show the example trajectory to give the readers a sense of what the trajectory looks like.. To quantify the accuracy of each method, we compute the Euclidean distance between the vector of chosen summary statistics for the observed data and for the simulated data given the estimated posterior mean of the parameters. As shown in Fig. 3, K2-ABC outperforms other methods, although SL-ABC, SA-custom, and K-ABC all explicitly operate on this vector of summary statistics while K2-ABC does not. In other words, while those methods attempt to explicitly pin down the parts of the parameter space which produce summary statistics similar to , insufficiency of these summary statistics affects the posterior mean estimates undesirably even with respect to that very metric.
In K2-ABC, given a dataset and pseudo dataset with observations each, the cost for computing is . For pseudo datasets, the total cost then becomes , which can be prohibitive for a large number of observations. Since computational tractability is among the core considerations for ABC, in this section, we examine the performance of K2-ABC with different MMD approximations which reduce the computational cost.
The unbiased linear-time MMD estimator presented in [17, section 6] reduces the total cost to at the price of a higher variance. Due to its computational advantage, the linear-time MMD has been successfully applied in large-scale two-sample testing 
as a test statistic. The original linear-time MMD is given by
Note that we have assumed the same number of observations from and . The estimator is constructed so that the independence of the summands allows derivation of its asymptotic distribution and the corresponding quantile computation needed for two-sample testing. However, since we do not require such independence, we employ a linear-time estimator with a larger number of summands, which also does not require . Without loss of generality, we assume Denote for , i.e., we allow a cyclic shift through the smaller dataset . The linear-time MMD estimator that we propose is
which is an unbiased estimator of . The total cost of the resulting K2-ABC algorithm is .
Another fast linear MMD estimator can be achieved by considering an approximation to the kernel function with an inner product of finite dimensional feature vectors where and is the number of features. Given the feature map such that, , MMD can be approximated as
A straightforward (biased) estimator is
which can be computed in ), i.e., linear in the sample size, leading to the overall cost of .
Given a kernel , there are a number of ways to obtain such that . One approach which became popular in recent years is based on random Fourier features  which can be applied to any translation invariant kernel. Assume that is translation invariant i.e., for some function . According to Bochner’s theorem , can be written as
where and due to positive-definiteness of
, its Fourier transformis nonnegative and can be treated as a probability measure. By drawing random frequencies and , can be approximated with a Monte Carlo average. It follows that and . Note that a Gaussian kernel corresponds to normal distribution .
We employ the linear-time and the random Fourier feature MMD estimators in our K2-ABC algorithm, which we call K2-lin and K2-rf, respectively, and test these variants on the blowfly data. For K2-rf, we used random features.
We investigated the feasibility of using MMD as a discrepancy measure of samples from two distributions in the context of ABC. Via embeddings of empirical data distributions into an RKHS, we effectively take into account infinitely many implicit features of these distributions as summary statistics. When tested on both simulated and real-world datasets, our approach obtained more accurate posteriors, compared to other methods that rely on hand-crafted summary statistics.
While any choice of a characteristic kernel will guarantee infinitely many features and no information loss due to the use of partial posteriors, we note that the kernel choice is nonetheless important for MMD estimation and therefore also for the efficiency of the proposed K2-ABC algorithm. As widely studied in the RKHS literature, the choice should be made to best capture characteristics of given data, i.e., by utilising domain-specific knowledge. For instance, when some data components are believed a priori to be on different scales, one can adopt the automatic relevance detemination (ARD) kernel instead of the Gaussian kernel. Formulating explicit efficiency criteria in the context of ABC and optimizing over kernel choice, similarly as in the context of two-sample testing , would be an essential extension.
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.