1 Introduction
Consider the problem of estimating the posterior distribution of some model parameters
given data points. This distribution has a closedform expression given by the Bayes’ theorem up to a multiplicative constant:
For many statistical models of interest, the likelihood cannot be numerically evaluated in a reasonable amount of time, which prevents the application of classical likelihoodbased approximate inference methods. Nevertheless, in various settings, even if the associated likelihood is numerically intractable, one can still generate synthetic data given any model parameter value. This generative setting gave rise to an alternative framework of likelihoodfree inference techniques. Among them, Approximate Bayesian Computation (ABC, [1, 2]) methods have become a popular choice and have proven useful in various practical applications (e.g., [3, 4, 5]
). The core idea of ABC is to bypass calculation of the likelihood by using simulations: the exact posterior is approximated by retaining the parameter values for which the synthetic data are close enough to the observations. Closeness is usually measured with a discrepancy measure between the two datasets reduced to some ‘summary statistics’ (e.g., empirical mean or empirical covariance). While summaries allow a practical and efficient implementation of ABC, especially in highdimensional data spaces, the quality of the approximate posterior distribution highly depends on them and constructing sufficient statistics is a nontrivial task. Summary statistics can be designed by hand using expert knowledge, which can be tedious especially in realworld applications, or in an automated way, for instance see
[6].Recently, discrepancy measures that view data sets as empirical probability distributions to eschew the construction of summary statistics have been proposed for ABC. Examples include the KullbackLeibler divergence (KL,
[7]), maximum mean discrepancy [8], and Wasserstein distance [9]. This latter distance emerges from the optimal transport (OT) theory and has attracted abundant attention in statistics and machine learning due to its strong theoretical properties and applications on many domains. In particular, it has the ability of making meaningful comparisons even between probability measures with nonoverlapping supports, unlike KL. However, the computational complexity of the Wasserstein distance rapidly becomes a challenge when the dimension of the observations is large. Several numerical methods have been proposed during the past few years to speedup this computation. WassersteinABC (WABC,
[9]) uses an approximation based on the Hilbert spacefilling curve and termed the Hilbert distance, which is computationally efficient but accurate for small dimensions only. Besides, under a general setting, the Wasserstein distance suffers from a curse of dimensionality in the sense that the error made when approximating it from samples grows exponentially fast with the data space dimension
[10]. These computational and statistical burdens can strongly affect the performance of WABC applied to highdimensional data.The SlicedWasserstein distance (SW, [11, 12]) is an alternative OT distance and leverages the attractive property that the Wasserstein distance between onedimensional measures has an analytical form which can be efficiently approximated. SW is defined as an average of onedimensional Wasserstein distances, and therefore has a significantly better computational complexity. Several recent studies have reported empirical success on generative modeling with SW [13, 14, 15] as well as nice asymptotic and statistical properties [16, 17, 18], making this alternative distance increasingly popular. In this paper, we develop a novel ABC framework that uses SW as the data discrepancy measure. This defines a likelihoodfree method which does not require choosing summary statistics and is efficient even with highdimensional observations. We derive asymptotical guarantees on the convergence of the resulting ABC posterior, and we illustrate the superior empirical performance of our methodology by applying it on a synthetical problem and an image denoising task.
2 Background
Consider a probability space with associated expectation operator
, on which all the random variables are defined. Let
be a sequence of independent and identically distributed (i.i.d.) random variables associated with some observations valued in . Denote by the common distribution of and by the set of probability measures on . For any , denotes the empirical distribution corresponding to observations. Consider a statistical model parametrized by . We focus on parameter inference for purely generative models: for any , we can draw i.i.d. samples from , but the numerical evaluation of the likelihood is not possible or too expensive. For any , is the empirical distribution of i.i.d. samples generated by , . We assume that , endowed with the Euclidean distance , is a Polish space (i.e., complete and separable), , endowed with the distance , is a Polish space, parameters are identifiable, i.e. implies . denotes the Borel field of . Approximate Bayesian Computation. ABC methods are used to approximate the posterior distribution in generative models when the likelihood is numerically intractable but easy to sample from. The basic and simplest ABC algorithm is an acceptancerejection method [1], which iteratively draws a candidate parameter from a prior distribution , and ‘synthetic data’ from , and keeps if is close enough to the observations . Specifically, the acceptance rule is , where is a data discrepancy measure taking nonnegative values, is a tolerance threshold, and with small is a summary statistics. The algorithm is summarized in Pseudocode 1 and returns samples of that are distributed from:denote the quantile functions of
and respectively. For empirical onedimensional distributions, (3) can be efficiently approximated by simply sorting the samples drawn from each distribution and computing the average cost between the sorted samples. This amounts to operations at worst. WassersteinABC [9] is a variant of ABC (1) that uses , between the empirical distributions of the observed and synthetic data, in place of the discrepancy measure between summaries. To make this method scalable to any dataset size, [9] introduces a new approximation of (2), the Hilbert distance, which extends the idea behind the computation of in 1D to higher dimensions, by sorting samples according to their projection obtained via the Hilbert spacefilling curve. This alternative can be computed in , but yields accurate approximations only for low dimensions [9]. They also use a second approximation, the swapping distance, based on an iterative greedy swapping algorithm. However, each iteration requires operations, and there is no guarantee of convergence to . Input: observations , number of iterations , data discrepancy measure , summary statistics , tolerance threshold . for do repeat and i.i.d. until ; return3 SlicedWasserstein ABC
SlicedWasserstein distance. The analytical expression of in (3) motivates the formulation of an alternative OT distance, called the SlicedWasserstein distance [11, 12]. SW is obtained by reducing multidimensional distributions to onedimensional representations through linear projections, and then by averaging 1D Wasserstein distances between these projected distributions. More formally, we denote by the dimensional unit sphere, and by the Euclidean innerproduct. For any , is the linear form associated with such that for any . For , the SlicedWasserstein distance of order between is defined as,
(4) 
where
is the uniform distribution on
and for any measurable function and , is the pushforward measure of by , defined as: , , with . is a distance on [16] with significantly lower computational requirements than the Wasserstein distance: in practice, the integration in (4) is approximated with a finitesample average, using a simple Monte Carlo scheme.SW also seems to have better statistical properties than the Wasserstein distance and its approximations. We illustrate it with the task of estimating the scaling factor of the covariance matrix in a multivariate Normal model, as in the supplementary material of [9]. Denote for any , the
dimensional Gaussian distribution with zeromean and covariance matrix
. Observations are assumed i.i.d. from with , and we draw the same number of i.i.d. data from for 100 values of equispaced between 0.1 and 9. We then compute and between the empirical distributions of the samples, and the swapping and Hilbert approximations presented in [9], for and 1000 observations. between two Gaussian measures has an analytical formula, which boils down in our setting to: , and we approximate the exact SW using a Monte Carlo approximation of:This formula is derived from (4) and the exact between onedimensional Gaussian distributions. We also compute KL with the estimator proposed for KLbased ABC (KLABC, [7]). Figure 1 shows the distances plotted against for each . When the dimension increases, we observe that (i) as pointed out in [9], the quality of the approximation of empirical Wasserstein returned by Hilbert and swapping rapidly deteriorates, and (ii) SW, approximated using densities or samples, is the only approximate metric that attains its minimum at . This curse of dimensionality can be a limiting factor for the performance of WABC and KLABC in high dimensions.
SlicedWasserstein ABC. Motivated by the practical success of SW regardless of the dimension value in the previous experiment, we propose a variant of ABC based on SW, referred to as the SlicedWasserstein ABC (SWABC). Our method is similar to WABC in the sense that it compares empirical distributions, but instead of , we choose the discrepancy measure to be , . The usage of SW allows the method to scale better to the data size and dimension. The resulting posterior distribution, called the SWABC posterior, is thus defined in (1) with replaced by .
4 Theoretical Study
In this section, we analyze the asymptotic behavior of the SWABC posterior under two different regimes. Our first result concerns the situation where the observations are fixed, and goes to zero. We prove that the SWABC posterior is asymptotically consistent in the sense that it converges to the true posterior, under specific assumptions on the density used to generate synthetic data.
Proposition (Asymptotic consistency when , fixed).
Let . Suppose that has a density w.r.t. the Lebesgue measure such that is continuous and there exists satisfying and . In addition, assume that there exists such that , where . Then, with fixed, the SWABC posterior converges to the true posterior as goes to 0, in the sense that, for any measurable , , where is defined by (1).
Proof of Section 4.
The proofs consists in applying [9, Proposition 3.1], which establishes the conditions for the data discrepancy measure to yield an ABC posterior that converges to the true posterior in the asymptotic regime we consider. This amounts to verify that:

[wide, labelwidth=!, labelindent=0pt,label=()]

For any and , with respective empirical distributions and , if and only if .

is continuous in the sense that, if converges to in the metric , then, for any empirical distribution , , where is the empirical measure of .
Condition 1 follows from the fact that is a distance [16, Proposition 5.1.2]. Now, let and be a continuous function such that for any with . Since converges to in the metric and is continuous, we get that . This implies that weakly converges to in [23, Definition 6.7], which, by [23, Theorem 6.8], is equivalent to . By applying the triangle inequality and [16, Proposition 5.1.3], there exists such that, for any empirical measure ,
We conclude that , making condition 2 applicable. ∎
Next, we study the limiting SWABC posterior when the value of is fixed and the number of observations increases, i.e. . We suppose the size of the synthetic dataset grows to with , such that can be written as a function of , , satisfying . We show that, under this setting and appropriate additional conditions, the resulting approximate posterior converges to the prior distribution on restricted to the region .
Proposition (Asymptotic consistency when fixed, , ).
Let , and be an increasing sequence satisfying , for . Assume that the statistical model is well specified, i.e. there exists such that , and that almost surely the following holds:
(5) 
where , denote the empirical distributions of the observations and synthetic data respectively. Then, the SWABC posterior converges to the restriction of the prior on the region as , i.e. for any ,
Proof of Section 4.
The result follows from the application of [7, Theorem 1] to and the required conditions. ∎
Remark 1.
Condition (5) is a mild assumption, e.g. is fulfilled when is compact and separable: in this case, for any and its empirical instantiation over samples, almost surely [10], and by [16, Proposition 5.1.3], almost surely. Equation 5 then follows by applying the triangle inequality.
5 Experiments
Multivariate Gaussians. As a first set of experiments, we investigate the performance of SWABC on a synthetical setting where the posterior distribution is analytically available. We consider observations i.i.d. from a dimensional Gaussian , with and . The parameter is
for which the prior distribution is assigned to be an inverse gamma distribution
. Therefore, the posterior distribution of given and is .We compare SWABC against ABC using the Euclidean distance between sample variances (EuclideanABC), WABC with the Hilbert distance, WABC with the swapping distance and KLABC. Each ABC approximation was obtained using the sequential Monte Carlo samplerbased ABC method
[24], which is more computationally efficient than vanilla ABC (1) and implemented in the package pyABC [25]. We provide our code in [26]. Figure 2 reports for , the resulting ABC posteriors and to the true posterior vs. time. Due to the poor performance of the estimator of KL between two empirical distributions proposed in [7] (see Fig. 1), KLABC fails at approximating well the posterior in these experiments. Hence, we excluded those results from Fig. 2 for clarity. EuclideanABC provides the most accurate approximation as expected since it relies on statistics that are sufficient in our setting. WABC performs poorly with highdimensional observations, contrary to SWABC, which approximates well the posterior for each dimension value and is as fast.Image denoising. We now evaluate our approach on a real application, namely image denoising. We consider a widely used algorithm for this task, the NonLocal Means algorithm (NLmeans, [27]), and we present a novel variant of it derived from SWABC.
We formally define the denoising problem: let , denote a clean graylevel image. We observe a corrupted version of this image, , where is some noise in . The goal is to restore from . We focus on denoising methods that consider ‘patchbased representations’ of images, e.g. NLmeans. Specifically, let be a patch size and the set of pixel positions. For , denotes the pixel value at position in image , and is a window in centered at : for , , where is extended to by periodicity. Let be a dictionary of positions, and such that, for , , i.e. is the position in of the most similar patch to . For , an estimator of is given by , being the uniform distribution on . In practice, it is approximated with a Monte Carlo scheme:
(6) 
where , , and , are mutually independent. Finally, we construct an estimate of as follows: for any , . The classical NLmeans estimator corresponds to the case where (thus ) and for any and , , where is a search window.
In our work, we assume that the likelihood is not available, but we observe for , i.i.d. from . By replacing in (6) by the SWABC posterior, we obtain the proposed denoising method, called the SWABC NLmeans algorithm. We provide our Python implementation in [26].
We compare our approach with the classical NLmeans. We consider one of the standard image denoising datasets (cf. [28]), namely the CBSD68 dataset [29]. It consists of colored images of size
. We first convert the images to gray scale, then manually corrupt each image with a Gaussian noise with standard deviation
, and finally try to recover the clean image. The quality of the output images is evaluated with the Peak Signal to Noise Ratio (
) measure: . In all of our experiments, we use a dictionary of patches picked uniformly at random, and we set , , , .We report the average PSNR values for different values of the noise level in Table 1. We observe that for small , NLmeans provides more accurate results, whereas when becomes larger SWABC outperforms NLmeans, thanks to the patch representation and the use of the SW distance. On the other hand, another important advantage of SWABC becomes prominent in the computation time: the proposed approach takes on a standard laptop computer per image whereas the classical NLmeans algorithm takes . Indeed, the computational complexity of SWABC NLmeans is upperbounded by , whereas it is for the naïve implementation of NLmeans, where denotes the cardinal number of the set . By taking , we can directly observe that SWABC has a lower computational complexity since in practice. We note that the computation time of NLmeans can be improved by certain acceleration techniques, which can be directly used to improve the speed of SWABC NLmeans as well. Finally, in Figure 3, we illustrate the performance of SWABC on two images for visual inspection. The results show that the injected noise is successfully removed by the proposed approach.
NLmeans  

SWABC 
, we finetuned the hyperparameters of NLmeans and reported the best result.
6 Conclusion
We proposed a novel ABC method, SWABC, based on the SW distance. We derived asymptotic guarantees for the convergence of the SWABC posterior to the true posterior under different regimes, and we evaluated our approach on a synthetical and an image denoising problem. Our results showed that SWABC provides an accurate approximation of the posterior, even with highdimensional data spaces and a small number of samples. Future work includes extending SWABC to generalized SW distances [30].
Acknowledgments
This work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX project (ANR16CE230014) and by the Data Science & AI chair from Télécom Paris. Alain Durmus acknowledges support from Polish National Science Center grant: NCN UMO2018/31/B/ST1/00253.
References
 [1] S. Tavaré, D. J. Balding, R. C. Griffiths, and P. Donnelly. Inferring coalescence times from dna sequence data. Genetics, 145(2):505–518, 1997.
 [2] M. A. Beaumont, W. Zhang, and D. J. Balding. Approximate bayesian computation in population genetics. Genetics, 162(4):2025–2035, 2002.
 [3] G. Peters and S. Sisson. Bayesian inference, monte carlo sampling and operational risk. Journal of Operational Risk, 1, 09 2006.
 [4] M. Tanaka, A. Francis, F. Luciani, and S. Sisson. Using approximate bayesian computation to estimate tuberculosis transmission parameters from genotype data. Genetics, 173:1511–20, 08 2006.
 [5] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems. Nature, 466(7310):1102–1104, 8 2010.
 [6] P. Fearnhead and D. Prangle. Constructing summary statistics for approximate bayesian computation: semiautomatic approximate bayesian computation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):419–474, 2012.

[7]
B. Jiang, T.Y. Wu, and W.H. Wong.
Approximate bayesian computation with kullbackleibler divergence as
data discrepancy.
In Amos Storkey and Fernando PerezCruz, editors,
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics
, volume 84 of Proceedings of Machine Learning Research, pages 1711–1721, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.  [8] M. Park, W. Jitkrittum, and D. Sejdinovic. K2abc: Approximate bayesian computation with kernel embeddings. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 398–407, Cadiz, Spain, 09–11 May 2016. PMLR.
 [9] E. Bernton, P. E. Jacob, M. Gerber, and C. P. Robert. Approximate bayesian computation with the wasserstein distance. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(2):235–269, 2019.
 [10] J. Weed and F. Bach. Sharp asymptotic and finitesample rates of convergence of empirical measures in wasserstein distance. Bernoulli, 25(4A):2620–2648, 11 2019.

[11]
J. Rabin, G. Peyré, J. Delon, and M. Bernot.
Wasserstein barycenter and its application to texture mixing.
In
Scale Space and Variational Methods in Computer Vision
, pages 435–446, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.  [12] N. Bonneel, J. Rabin, G. Peyré, and H. Pfister. Sliced and Radon Wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.

[13]
I. Deshpande, Z. Zhang, and A. G. Schwing.
Generative modeling using the sliced Wasserstein distance.
In
IEEE Conference on Computer Vision and Pattern Recognition
, pages 3483–3491, 2018.  [14] A. Liutkus, U. Şimşekli, S. Majewski, A. Durmus, and F.R. Stoter. SlicedWasserstein flows: Nonparametric generative modeling via optimal transport and diffusions. In International Conference on Machine Learning, 2019.
 [15] S. Kolouri, P. E. Pope, C. E. Martin, and G. K. Rohde. Sliced Wasserstein autoencoders. In International Conference on Learning Representations, 2019.
 [16] N. Bonnotte. Unidimensional and Evolution Methods for Optimal Transportation. PhD thesis, Paris 11, 2013.
 [17] I. Deshpande, Y.T. Hu, R. Sun, A. Pyrros, N. Siddiqui, S. Koyejo, Z. Zhao, D. Forsyth, and A. Schwing. MaxSliced Wasserstein distance and its use for GANs. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
 [18] K. Nadjahi, A. Durmus, U. Şimsekli, and R. Badeau. Asymptotic Guarantees for Learning Generative Models with the SlicedWasserstein Distance. In Advances in Neural Information Processing Systems, 2019.
 [19] S. A. Sisson, Y. Fan, and M. A. Beaumont. Overview of Approximate Bayesian Computation. arXiv eprints, page arXiv:1802.09720, Feb 2018.
 [20] D. T. Frazier, G. M. Martin, C.P. Robert, and J. Rousseau. Asymptotic properties of approximate Bayesian computation. Biometrika, 105(3):593–607, 06 2018.
 [21] G. Peyré, M. Cuturi, et al. Computational optimal transport. Foundations and Trends® in Machine Learning, 11(56):355–607, 2019.
 [22] S. T. Rachev and L. Rüschendorf. Mass transportation problems. Vol. I. Probability and its Applications (New York). SpringerVerlag, New York, 1998. Theory.
 [23] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer, 2009 edition, September 2008.
 [24] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. Stumpf. Approximate bayesian computation scheme for parameter inference and model selection in dynamical systems. Journal of the Royal Society, Interface / the Royal Society, 6:187–202, 03 2009.
 [25] E. Klinger, D. Rickert, and J. Hasenauer. pyABC: distributed, likelihoodfree inference. Bioinformatics, 34(20):3591–3593, 05 2018.
 [26] K. Nadjahi, V. De Bortoli, A. Durmus, R. Badeau, and U. Şimşekli. Swabc software implementation. https://github.com/kimiandj/slicedwass_abc, https://vdeborto.github.io/publication/sw_abc.
 [27] A. Buades, B. Coll, and J.M. Morel. A nonlocal algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 2, pages 60–65. IEEE, 2005.
 [28] L. Fan, F. Zhang, H. Fan, and C. Zhang. Brief review of image denoising techniques. Visual Computing for Industry, Biomedicine, and Art, 2(1):7, 2019.
 [29] D. Martin, C. Fowlkes, D. Tal, J. Malik, et al. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Iccv Vancouver:, 2001.
 [30] S. Kolouri, K. Nadjahi, U. Simsekli, R. Badeau, and G. K. Rohde. Generalized Sliced Wasserstein Distances. In Advances in Neural Information Processing Systems, 2019.
Comments
There are no comments yet.