DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression
Performing exact posterior inference in complex generative models is often difficult or impossible due to an expensive to evaluate or intractable likelihood function. Approximate Bayesian computation (ABC) is an inference framework that constructs an approximation to the true likelihood based on the similarity between the observed and simulated data as measured by a predefined set of summary statistics. Although the choice of appropriate problem-specific summary statistics crucially influences the quality of the likelihood approximation and hence also the quality of the posterior sample in ABC, there are only few principled general-purpose approaches to the selection or construction of such summary statistics. In this paper, we develop a novel framework for this task using kernel-based distribution regression. We model the functional relationship between data distributions and the optimal choice (with respect to a loss function) of summary statistics using kernel-based distribution regression. We show that our approach can be implemented in a computationally and statistically efficient way using the random Fourier features framework for large-scale kernel learning. In addition to that, our framework shows superior performance when compared to related methods on toy and real-world problems.READ FULL TEXT VIEW PDF
Complicated generative models often result in a situation where computin...
Approximate Bayesian Computation (ABC) are likelihood-free Monte Carlo
Approximate Bayesian inference on the basis of summary statistics is
Approximate Bayesian computation (ABC) provides us with a way to infer
Approximate Bayesian computation (ABC) is a simulation-based likelihood-...
Comparing competing mathematical models of complex natural processes is ...
Approximate Bayesian Computation (ABC) now serves as one of the major
DR-ABC: Approximate Bayesian Computation with Kernel-Based Distribution Regression
Complex generative models arise in many application domains, e.g. when we are interested in modeling population dynamics in ecology Wood (2010); Lopes & Boessenkool (2010), performing phylogenetic inference and disease dynamics modeling in epidemiology Poon (2015); Tanaka et al. (2006) or modeling galaxy demographics in cosmology Weyant et al. (2013); Cameron & Pettitt (2012). In these models, it is often difficult or impossible to perform exact posterior inference due to an expensive to evaluate or intractable likelihood function. Approximate Bayesian computation (ABC) Beaumont et al. (2002)
is an inference framework that approximates the true likelihood based on the similarity between the observed and simulated data as measured by a predefined set of summary statistics. Unless the chosen summary statistics are sufficient, an information loss associated with the projection of the data onto the lower-dimensional subspace of the summary statistics occurs. This results in an approximation bias in the likelihood and subsequently in the posterior sample that is difficult to estimate. More precisely, this information loss implies that ABC performs inference on the partial posterior of the model parameters given the summary statistics of the observed datain lieu of doing it on the full posterior . Thus, the choice of appropriate problem-specific summary statistics is of crucial importance for the quality of posterior inference in ABC.
Several methods exist in the literature for the selection or construction of summary statistics. A number of these methods can be assembled around the idea of constructing summary statistics by linear or non-linear regression from the full dataset or a set of candidate statistics. In addition to considerations about the sufficiency of summary statistics, all of these methods require either expert knowledge for the selection of the set of candidate statistics, e.g.Nakagome et al. (2013), or perform complex and high-dimensional regression by using the full dataset, e.g. Fearnhead & Prangle (2012). Other examples of this approach include Blum & François (2010), Boulesteix & Strimmer (2007) and Wegmann et al. (2009).
In this paper, we develop a novel framework for the construction of appropriate problem-specific summary statistics. Following the approach of Fearnhead & Prangle (2012), we want to derive summary statistics that will allow inference about certain parameters of interest to be as accurate as possible. Thus, we study loss functions and reason about the optimality of summary statistics in terms of minimizing appropriate instances of these functions. In particular, we model the functional relationship between data distributions and the optimal choice of summary statistics using kernel-based distribution regression Szabó et al. (2015). In order to properly account for the nature of the data, we take a two-step approach to distribution regression. Additionally, we present two different variants of our framework to account for the diverse structural properties present in real-world data.
In the first variant of our framework, we assume that all aspects of the data are important for estimating the optimal summary statistics, i.e. we model the full marginal distribution of the data and regress from it into the space of optimal summary statistics. First, we embed the empirical distributions of newly simulated data via the mean embedding into a reproducing kernel Hilbert space (RKHS). For this embedding, we choose a characteristic kernel Sriperumbudur et al. (2011)
to ensure that no information from the data is lost. We then regress from these embeddings to the optimal choice of summary statistics with kernel ridge regressionFriedman et al. (2001). The space of candidate regressors is thus another RKHS of functions whose domain is the space of mean-embedded data distributions. For the construction of this RKHS, one can use a simple linear kernel or more flexible kernels defined on distributions such as those described in Christmann & Steinwart (2010). The learned regression function can then be used as the summary statistics within ABC.
For the second variant of our framework, we assume that only certain aspects of the data have a direct influence on the parameter of interest in ABC and thus we restrict our attention to modeling the functional relationship between these aspects of the data and the optimal summary statistics. In particular, we assume that the observed data can be decomposed into important and auxiliary components such that the parameter of interest depends on the auxiliary components of the data only through the family of induced conditional distributions of the important components of the data given the auxiliary ones. In order to model the functional relationship between conditional distributions and the optimal choice of summary statistics, we embed these distributions with a conditional embedding operator Song et al. (2013) into an RKHS and use kernel ridge regression to regress from the space of conditional embedding operators into the space of optimal summary statistics. The space of candidate regressors is thus another RKHS defined on the space of bounded linear operators between RKHS defined on the auxiliary and important components of the data, respectively. For the construction of this RKHS, one can use a simple linear kernel or any kernel given in terms of the Hilbert-Schmidt operator norm on the difference of operators. As before, the learned regression function can be used as the summary statistics within ABC.
In this paper, we specifically study the choice of summary statistics and use a simple rejection sampling mechanism. While more complex sampling mechanisms are possible, we take this particular approach in order to decouple the influence of these two important components of ABC on the quality of posterior inference. The rest of the paper is organized as follows. Section 2 gives an overview of related work, while section 3 introduces and discusses our framework. Experimental results on toy and real-world problems, and a comparison with related methods are given in section 4. Section 5 concludes.
Most existing methods for the selection or construction of informative summary statistics can be grouped into three categories. A first category assembles methods that first perform best subset selection in a set of candidate statistics according to various information-theoretic criteria and then use this subset as the summary statistics within ABC. In particular, optimal subsets are selected according to, e.g. a measure of sufficiency Joyce & Marjoram (2008), entropy Nunes & Balding (2010) and AIC/BIC Blum et al. (2013).
A second category consists of methods that construct summary statistics from auxiliary models. An example of this approach is indirect score ABC Gleim & Pigorsch (2013)
. Here, a score vector that is calculated from the partial derivatives of the auxiliary likelihood plays the role of the summary statistics. Motivated by the fact that the score of the observed data is zero when the auxiliary model parameters are set by maximum-likelihood estimation (MLE), the method searches the parameter space for values whose simulated data produce a score close to zero under the same auxiliary model parameters. Thus, the discrepancy measure between the observed and simulated data is defined in terms of scores of the simulated data at the parameter values estimated with MLE from the observed data. A detailed review of this class of approaches can be found inDrovandi et al. (2015).
A third, and last, category is comprised of methods that construct summary statistics using regression from either the full dataset or a set of candidate statistics, e.g. Fearnhead & Prangle (2012). Aeschbacher et al. (2012) provides a general overview of such approaches, while we discuss the aforementioned method in more detail. The semi-automatic ABC (SA-ABC) method Fearnhead & Prangle (2012) focuses on deriving summary statistics that will allow inference about certain parameters of interest to be as accurate as possible. Fearnhead & Prangle (2012) focus on the construction of summary statistics that allow inference to be accurate with respect to a specific loss function. They show that the true posterior mean of the model parameters is the optimal choice of summary statistics under the quadratic loss function. As this quantity cannot be analytically calculated, they estimate it by fitting a regression model from simulated data. In particular, given simulated data , a linear model is fitted; here, is taken to be either the identity function or a power function. The resulting estimates are used as the summary statistics in ABC.
In the literature, there are a few other methods that are not strictly aligned with the above categorization. Here, we review three such methods – synthetic likelihood ABC Wood (2010), K-ABC Nakagome et al. (2013) and K2-ABC Park et al. (2015). Synthetic likelihood ABC Wood (2010)
assumes that the summary statistics follow a multivariate normal distribution and uses plug-in estimates for the mean and covariance parameters. In order to generate posterior samples, the method utilizes MCMC with a synthetic likelihood that is derived by convolving the fitted distribution of the summary statistics with a Gaussian kernel that measures the similarity between the observed and simulated data via the fitted summary statistics. On the other hand, K-ABCNakagome et al. (2013) and K2-ABC Park et al. (2015) use the RKHS framework in connection with ABC, albeit in a different fashion than our framework. K-ABC regresses from already chosen summary statistics to posterior expectations of interest, i.e. it estimates a conditional mean embedding operator mapping from to the corresponding model parameters . While the use of a kernel on the summary statistics increases their representative power, the method does not eliminate the challenge of selecting these summary statistics. A potential solution to this shortcoming could be to choose the whole dataset to regress with, i.e. use . This differs from our approach in two ways. First, the choice of an appropriate kernel that can be defined directly on the data is not straightforward. Our approach does not suffer from this shortcoming since we treat datasets as bags of samples. Second, instead of performing regression to estimate posterior expectations, we utilize it to calculate summary statistics that can be used within ABC. This decouples the regression model from the actual ABC method and thus, does not limit the number of samples that can be used within ABC, i.e. it allows for an arbitrary large number of samples to be drawn after performing the regression step. The K-ABC method has recently been used in HIV epidemiology Poon (2015). On the other hand, K2-ABC embeds the empirical data distributions into an RKHS via the mean embedding and uses a dissimilarity measure on the space of these embeddings to assess the similarity between the observed and simulated data. In particular, the maximum mean discrepancy is used as the dissimilarity measure on the space of the mean-embedded data distributions, and an exponential smoothing kernel is utilized to compute the ABC posterior sample weights. In contrast to the methods discussed above, there is no explicit construction or selection of summary statistics, but rather the summary statistics are given implicitly as the mean embeddings of the empirical data distributions into an possibly infinite dimensional RKHS. Our framework is different from this method in that it performs an additional step and regresses from the mean embeddings to the space of summary statistics optimal with respect to a loss function.
In this section, we introduce and discuss the novel framework of ABC with kernel-based distribution regression (DR-ABC) and review some of its important building blocks.
Given a probability distributiondefined on a non-empty set , the mean embedding of , , is an element of the RKHS defined by the kernel . For two probability distributions and , the maximum mean discrepancy (MMD) between and is defined as
with and . Given samples and
, an unbiased estimate of the MMD can be computed as
Distribution Regression. The goal of distribution regression is to establish a functional relationship between probability distributions over a given set and real-valued (possibly multidimensional) responses. In particular, given data drawn i.i.d. from a meta distribution defined on the product space of responses and probability distributions on the space of observations, we are interested in capturing this data-generating mechanism with a regression model and predicting new responses given new distributions .
In this setting, one major challenge arises due to the fact that the probability distributions are not observed directly, but are available only in terms of their i.i.d. samples. In particular, the data is given as with and the underlying sample space. Thus, one is interested in predicting new responses given a new bag of samples . This particularity makes regressing directly from the space of probability distributions to the response space difficult as one has to capture the two-stage sampled nature of the data in one functional relationship.
If we take the kernel ridge regression approach, then the functional relationship between and is modeled as an element from the RKHS of functions mapping from to with the kernel defined on . In order to properly account for the two-stage sampled nature of the data, we take a two-stage approach to distribution regression. First, a distribution is mapped via the mean embedding into the RKHS defined by the kernel . Second, this result is composed with an element from the RKHS defined by the kernel , where is the image of under the mean embedding. Finally, this yields and with such that and for every , i.e. we treat every dimension of separately. Taking the classical regularization approach, the solution of kernel ridge regression can be calculated as
with , and the regularization parameter. Given a new in terms of samples , a prediction for can be calculated in the following way
Distribution Regression from Conditional Distributions. Often only certain aspects of the data are assumed to have a direct influence on the response, especially in hierarchical or spatio-temporal modeling, and thus one might be interested in modeling only these aspects of the data. This motivates a decomposition of the data as with encoding the important aspects of the data (for the inference task at hand) and describing the rest of the information that we are not interested in modeling explicitly (e.g. this could correspond to locations on a grid where the observations are recorded).111In particular, , and . In other words, we assume that depends on only through the induced conditionals , and thus the problem of distribution regression reduces to the question of modeling the functional relationship between the induced family of conditional distributions and the response , i.e. learning a map from the set of functions into the response space . In order to ensure that all the necessary mathematical objects exist and are well-defined, we make the following assumptions:
is a Polish space with the associated Borel -algebra,
is a measurable space with the associated -algebra,
kernel is bounded and can be factorized as , and
for all .
In contrast to distribution regression from joint distributions, here we are interested in simultaneously embedding whole families of conditional distributions into an RKHS. To achieve this, we model this embedding as a function
i.e. for every , the embedding of the conditional distribution of given is a function in the RKHS . Following the approach of Song et al. (2013), we encode the embedding of with the conditional embedding operator , where
Thus, the embedding of a family of conditional distributions is modeled as an operator between RKHS defined on and , respectively. Next, we regress from the space of conditional embedding operators, i.e. from the space of bounded linear operators , into using kernel ridge regression. For this purpose, we define a kernel to measure the similarity between different conditional embedding operators. Typical choices for this kernel include the linear kernel or any other kernel given in terms of with the Hilbert-Schmidt operator norm. Finally, putting all the building blocks together, the solution of kernel ridge regression can be computed as
where with , , , and is the regularization parameter. Given a new distribution in terms of samples with , a prediction for can be calculated as
Our framework, ABC with kernel-based distribution regression, provides a novel approach to the construction of appropriate problem-specific summary statistics. Motivated by Fearnhead & Prangle (2012), we study loss functions and use simulated data to construct approximations to optimal summary statistics with respect to these loss functions. While any loss function can be used in our framework,222For loss functions not admitting closed-form solutions for the argument of their minimum, numerical optimization techniques might need to be used. we focus on the quadratic loss function with the true value of the parameter of interest. Given simulated data , we regress into the space of optimal summary statistics, i.e. into the parameter space in the case of quadratic loss, with kernel-based distribution regression. As discussed in the previous section, we study two different variants of our framework – full and conditional DR-ABC – to account for the diverse structural properties present in real-world data.
Full DR-ABC: In this variant of the DR-ABC framework, we assume that all aspects of the data are important for estimating the parameter of interest, i.e. we model the complete marginal distribution of the data and regress from it into the parameter space. In particular, we first embed the empirical distributions of the simulated data via the mean embedding into the RKHS defined by the kernel . Next, we define a second RKHS via the kernel ,
with the kernel bandwidth. This kernel provides a dissimilarity measure on the space of mean embeddings. Third, we perform kernel ridge regression from into the parameter space with as the space of candidate regressors. Finally, for a particular given in terms of a sample , the approximated optimal summary statistics are equal to Equation 1, i.e. the value of the fitted distribution regression function evaluated at the empirical distribution of that sample.
Conditional DR-ABC: Unlike full DR-ABC, here we assume that only certain aspects of the data have a direct influence on the parameter of interest. Thus, we restrict our attention to modeling the functional relationship between these aspects of the data and the parameter of interest. First, we identify the important and auxiliary aspects of the data, i.e. we decompose the simulated data as . Second, for every , we encode the embedding of the induced family of conditional distributions with the conditional embedding operator , where and are RKHS defined on and , respectively, and and are the corresponding kernels. Third, we define a new RKHS via the kernel ,
This kernel defines a dissimilarity measure on the space of conditional embedding operators. Next, we perform kernel ridge regression from this space into the parameter space and use the newly constructed RKHS as the space of candidate regressors. Finally, the fitted distribution regression function can be used as a summary statistics within ABC; the approximated optimal summary statistics of a new dataset are given by Equation 2.
Application to ABC: Having learned a regression model, we can now perform ABC. First, we sample points from the prior and generate the corresponding datasets . Depending on the inference task at hand and the structural properties of the data, we may or may not perform a splitting of the data, i.e. . In order to assess the similarity between the observed and simulated data, we estimate the optimal summary statistics for each dataset and compare these approximations via a smoothing kernel that defines a dissimilarity measure on the parameter space. In particular, we calculate one of the following
depending on whether we are in the setting of full or conditional DR-ABC, respectively. Here, and , and and are the empirical data distributions and conditional embedding operators of the observed and newly simulated data, respectively.
Putting together kernel-based distribution regression and ABC as discussed above, the following algorithms summarize the two different variants of the DR-ABC framework.
Computational complexity. Assuming that both the size of the observed and simulated datasets is , the cost of computing between two bags of samples or computing for any is . Given and as the number of simulated datasets for (conditional) distribution regression and ABC, respectively, the total computational cost of fitting the regression model and running ABC is in both full and conditional DR-ABC. In order to mitigate the large computational cost of our two methods, we apply the popular large-scale kernel learning framework of random Fourier features (RFF) Rahimi & Recht (2007). This framework has successfully been applied in several contexts Chitta et al. (2012); Huang et al. (2013), extended Le et al. (2013); Yang et al. (2014) and thoroughly analyzed Bach (2015); Sutherland & Schneider (2015); Sriperumbudur & Szabo (2015). The context most similar to ours is that of Jitkrittum et al. (2015) where two layers of random Fourier features are applied in connection with distribution regression, albeit in the context of emulating Expectation Propagation messages. Using random Fourier features, we approximate the potentially infinite-dimensional feature maps that figure in the computation of kernel functions with finite-dimensional vectors. This implies that kernel evaluations can be approximated by inner products of these finite-dimensional features. Using random Fourier features in each layer of approximation, we get a significantly reduced computational cost of for full DR-ABC. For conditional DR-ABC, the computational cost can be reduced to . In our experiments, we use and the following RFF expansion
Due to a result from Bach (2015), a comparatively small number of random Fourier features can be used even for large datasets since the number of random Fourier features needed for good approximations of kernel ridge regression solutions often scales sublinearly with the number of observations. Nevertheless, the dependence of the required number of random Fourier features on the number of datasets and the number of observations within each dataset, particularly in settings such as ours where there are two layers of random Fourier features, is not yet fully understood.
Toy example. The first problem we study is the following Gaussian hierarchical model
This simple example serves as a proof of concept for our framework. In this model, the parameter of interest is , and our goal is to estimate with the observed dataset. In our experiments, we compare our two methods, full and conditional DR-ABC, against SA-ABC and K2-ABC. We specifically compare our framework against these methods as they are examples illustrating the regression and RKHS approach to the construction of summary statistics. For the performance metric, we calculate the mean square error (MSE) of the parameter of interest on synthetic data. In particular, we set and generate observations given this parameter value as ; for every newly simulated dataset, we also draw datapoints. For full DR-ABC, we take the kernels and as Gaussian kernels, while for conditional DR-ABC, is a Gaussian kernel and
is a linear kernel. The hyperparameters in the two DR-ABC methods are set via five-fold cross-validation on appropriately defined grids. For the grids of the different kernel bandwidth parameters, we multiply the respective median heuristicsReddi et al. (2014) with a set of ten equally spaced points between and . For and , we choose the grids by exponentiating to the powers given by ten equally spaced points between and . In order to account for the randomness in the generative process, we run each of the methods times and display the mean of MSE across the different runs.
Figure 1 describes the performance of our chosen methods across different numbers of particles used in the (conditional) distribution regression phase (for full and conditional DR-ABC) and in the ABC phase (for all four methods). In order to achieve comparable results, we use the same number of particles in the regression phase as in the ABC phase for SA-ABC.
K2-ABC exhibits a fairly stable reconstruction error for different numbers of ABC particles and outperforms SA-ABC only when relatively small numbers of ABC particles are used. Across the wide spectrum of the number of ABC particles, we see both conditional and full DR-ABC outperforming K2-ABC by a large margin. While full DR-ABC usually also outperforms SA-ABC, conditional DR-ABC does this consistently by a large margin.
Ecological Dynamical Systems. Many ecological problems have an intractable likelihood due to a dynamic generative process and thus rely on ABC for posterior inference. Deriving appropriate summary statistics in this setting is quite challenging as the dependence structure of the data needs to be appropriately taken into account. As an example of an ecological system with a dynamic generative process, we examine the problem of inferring the dynamics of the adult blowfly population as introduced in Wood (2010). In particular, the population dynamics are modeled by the following discretised differential equation
with denoting the observation at time which is determined by the time-lagged observations and
, and the Gamma distributed noise variablesGam and Gam. The parameters of interest in this model are . As before, we compare our framework with SA-ABC and K2-ABC. The performance metric and the kernel and hyperparameter selection is done in the same way as in the previous example.
From Figure 2, we that our methods outperform SA-ABC by a large margin across the whole spectrum of test situations. Full DR-ABC displays competitive performance to K2-ABC, even outperforming it in certain instances by a large margin. On the other hand, conditional DR-ABC outperforms K2-ABC in all test situations; in some of these situations, the performance of our method is massively superior.
Lotka-Volterra Model. Another popular ecological model in which posterior inference is difficult is the Lotka-Volterra model Lotka (1925); Volterra (1927). This model describes the dynamics of biological systems in which two species interact in a predator-prey relationship.
where are the number of prey and predators, respectively, are positive real parameters describing the interaction of the two species, denotes time and , are the respective growth rates. In addition to the dynamical nature of the generative process, the interaction between the two species makes deriving informative summary statistics even more challenging. The parameters of interest in this model are . As in the previous two experiments, we compare our framework with SA-ABC and K2-ABC and use the same performance metric, kernels and hyperparameter selection procedure.
From Figure 3
, we see that our framework outperforms competing methods by a large margin. While for small numbers of ABC particles, full DR-ABC seems to perform better, for large numbers of ABC particles, conditional DR-ABC slightly outperforms full DR-ABC with a clear downward trend in the error for higher numbers of ABC particles. As for SA-ABC, the method cannot directly be applied to this problem due to the high correlation in the observations which leads to a regression problem that is ill conditioned. In order to mitigate this, we performed principal component analysis and used an approximation of the data matrix given by the firstprincipal components in SA-ABC. The resulting errors are 1–2 orders of magnitude larger than those displayed in the figure and are thus excluded from it for clarity.
In this paper, we developed a novel framework for the construction of informative problem-specific summary statistics using the flexible and expressive setting of reproducing kernel Hilbert spaces. We introduced two different approaches based on embeddings of probability distributions and kernel-based distribution regression. Our proposed framework has several advantages over previous general-purpose and semi-automatic summary statistics construction methods. First, by using the flexible RKHS framework, we are able to regulate the kind and amount of information that is extracted from the data and thus construct more informative problem-specific summary statistics, as opposed to mandating an ad hoc selection of a limited set of candidate statistics or postulating heuristic summary statistics which inevitably leads to a hard to evaluate approximation bias in the likelihood and subsequently in the posterior sample. Moreover, our framework compactly encodes the extracted information. Second, due to the modeling flexibility of our framework, we are able to appropriately account for different structural properties present in real-world data. Third, our methods can be implemented in a computationally and statistically efficient way using the random Fourier features framework for large-scale kernel learning. Fourth, our framework can be easily extended to any object class on which the embedding kernel(s) can be defined. Examples of such object classes include genetic data Wu et al. (2010), phylogenetic trees Poon (2015), strings, graphs and other structured data Gärtner (2003). Fifth, although there are multiple sets of hyperparameters in each of our methods, their selection can be performed in a principled way via cross-validation. From experimental results on toy and real-world problems, we see that our framework substantially reduces the bias in the posterior sample achieving superior performance when compared to related methods that used for the construction of summary statistics in ABC.
Proceedings of the International Conference on Machine Learning, 2013.
Efficient approximate bayesian computation coupled with markov chain monte carlo without likelihood.Genetics, 182(4):1207–1218, 2009.