Statistical inference for Bures-Wasserstein barycenters

In this work we introduce the concept of Bures-Wasserstein barycenter Q_*, that is essentially a Fréchet mean of some distribution supported on a subspace of positive semi-definite Hermitian operators _̋+(d). We allow a barycenter to be restricted to some affine subspace of _̋+(d) and provide conditions ensuring its existence and uniqueness. We also investigate convergence and concentration properties of an empirical counterpart of Q_* in both Frobenious norm and Bures-Wasserstein distance, and explain, how obtained results are connected to optimal transportation theory and can be applied to statistical inference in quantum mechanics.



There are no comments yet.



Exact Statistical Inference for the Wasserstein Distance by Selective Inference

In this paper, we study statistical inference for the Wasserstein distan...

Hypothesis Test and Confidence Analysis with Wasserstein Distance on General Dimension

We develop a general framework for statistical inference with the Wasser...

On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification

Optimal transport (OT) distances are increasingly used as loss functions...

Wasserstein Statistics in One-dimensional Location-Scale Model

Wasserstein geometry and information geometry are two important structur...

Wasserstein statistics in 1D location-scale model

Wasserstein geometry and information geometry are two important structur...

Approximation of Wasserstein distance with Transshipment

An algorithm for approximating the p-Wasserstein distance between histog...

The Wasserstein transform

We introduce the Wasserstein transform, a method for enhancing and denoi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Space of finite-dimensional Hermitian operators provides a powerful toolbox for data representation. For instance, in quantum mechanics it is used for mathematical description of physical properties of a quantum system, also known as observables. The reason is due to the fact that the measurements obtained in a physical experiment should be associated to real-valued quantities. Hermitian operators posses real-valued spectrum and satisfy the above requirement. A subspace of real-valued symmetric matrices is also of great interest: points in are widely used for description of systems in engineering applications, medical studies, neural sciences, evolutionary biology e.t.c. Usually such data sets are considered to be randomly sampled from an unknown distribution  (Goodnight and Schwartz (1997); Calsbeek and Goodnight (2009); Álvarez-Esteban et al. (2015); del Barrio et al. (2017); Gonzalez et al. (2017)), and statistical characteristics of

such as, in particular, mean and variance, appear to be of interest for further planning of an experiment and analysis of obtained results fur further development of natural science models. The current study focuses on a space of positive semi-definite Hermitian matrices

and presents a possible approach to analysis and aggregation of relevant statistical information from data-sets, for which the linearity assumption might be violated. This makes classical Euclidean definitions of mean and variance not sensitive enough to capture effects of interest. This case appears extremely often in multiple contexts. As an example one can consider a data set that is represented as probability measures which belong to the same scale-location family, e.g. some astronomic measurements 

Alvarez-Esteban et al. (2018), Example 4.6. Non-linearity assumption requires the development of a novel toolbox suitable for further statistical analysis. In order to detect non-linear effects, we suggest to endow with the Bures-Wasserstein distance which is recently introduced in a seminal paper Bhatia et al. (2018). It is defined as follows. For any pair of positive matrices it is written as:


It is worth noting that being restricted to the space of symmetric positive definite matrices , boils down to a classical 2-Wasserstein distance between measures that belong to the same scale-location family (see e.g. Agueh and Carlier (2011), Section 6 or Álvarez-Esteban et al. (2015)). A more detailed discussion on this particular choice of the distance is presented in Section 2.1.

After choosing a proper distance, we are now ready to introduce a model of information aggregation the statistical properties of which are investigated in the current study. Let

be a probability distribution supported on some set

. Further without loss of generality we assume, that assigns positive probability to the intersection of with space of positive definite Hermitian matrices , and that the spectrum of its elemnts is on average bounded away from infinity:

Assumption 1.

Two statistically important characteristics of are Fréchet mean and Fréchet variance. The former one can be regarded as a typical representative of a data-set in hand, whereas the latter appears in analysis on data variability, see e.g. Del Barrio et al. (2015). We briefly recall both concepts below. For an arbitrary point Fréchet variance of is defined as

Classical Fréchet mean of is a set of global minimizers of :


However, in many cases we are interested in a minimizer, that belongs to some affine sub-space :


For instance, such a necessity may arise while considering a random set of quantum density operators. For introduction to density operators theory one may look through Fano (1957). This example is considered in more details in Section 3.2. Note, that the setting (1.3) covers the setting (1.2). So without loss of generality we further address only (1.3). Obviously, the first crucial question concerns existence and uniqueness of . And positive answers on both issues, along with necessary conditions, are presented in Theorem 2.1. This immediately allows us to define the global Fréchet variance of as .

Given an i.i.d. sample , , one constructs an empirical analogue of :

An empirical Fréchet mean and global empirical variance also exist and unique:


These facts follow from Theorem 2.1

. This work studies convergence of the estimators

and and investigate concentration properties of both quantities. The discussion of practical applicability of the obtained results is postponed to Section 3. There we explain their relation to optimal transportation theory and present a possible application to statistical analysis in quantum mechanics.

1.1 Contribution of the present study

Central limit theorem and concentration of

The first main result of this study concerns asymptotic normality of the approximation error of population Fréchet mean by its empirical counterpart:

where “” stands for weak convergence, and is some covariance operator acting on the linear subspace associated with affine subspace . From now on we use bold symbols e.g. , to denote operators, wheres classical ones i.e.

stand for matrices or vectors. This convergence result cannot be directly used for construction of asymptotic confidence sets because it relies on the unknown covariance matrix

. However, Theorem 2.2 ensures, that this covariance matrix can be replaced by its empirical counterpart :

where denotes an identity operator. Along with asymptotic normality of , we are interested in the limiting distribution of . Corollary 2.1 shows, that


is some normally distributed vector. Data-driven asymptotic confidence sets for

are obtained by replacement of by its empirical counterpart :

where is a metric which induces weak convergence. Furthermore, we investigate concentration properties of in both Frobenius norm and metric. The following two bounds hold with h.p.:

where stands for some generic constant. A more detailed discussion is presented in Theorem 2.3 and Corollary 2.2 respectively. It is worth noting that concentration results are obtained under assumption of sub-Gaussianity of in the following sense:

Assumption 2 (Sub-Gaussianity of ).

Let be sub-Gaussian, i.e.

with some constants .

All above-mentioned results are closely connected to convergence and concentration of empirical -Wasserstein barycenters. For the sake of transparency we postpone this discussion to Section 3.1.

CLT and concentration for

We also show asymptotic normality of approximation error of by and prove concentration of :

where the latter result holds with h.p., and , are constants which depend on sub-Gaussianity parameters , dimension , and parameter . See Theorem 2.4 and Theorem 2.5 respectively.

The paper is organised as follows. Section 2 explains the obtained results in more details. Section 3 illustrates the connection to other scientific problems. Finally, Section 4 contains simulations and experiments on both artificial and real data-sets.

2 Results

This section presents obtained results in more details, and the first question we address is the particular choice of the distance.

2.1 Bures-Wasserstein distance

The original Bures metric appears in quantum mechanics in relation to fidelity measure between two quantum states and is used for measurement of quantum entanglement Marian and Marian (2008); Dajka et al. (2011). Let , be two quantum states. Mathematically speaking, this means that


Fidelity of these states is defined as . It quantifies “closeness” of and , see  Jozsa (1994). It is obvious, that in case of (2.1) Bures-Wasserstein distance turns into


It is interesting to note, that the distance appears not only one of the central distances, used in quantum mechanics, but also an object of extensive investigation in transportation theory Takatsu et al. (2011). Let and

be two centred Gaussian distributions. Then 2-Wasserstein distance between them is written as

The case of Gaussian measures is naturally extended to measures that belong to a same scale-location family Alvarez-Esteban et al. (2018)

. In the last few years Wasserstein distance attracts a lot of attention of data scientists and machine learning community, as it takes into account geometrical similarities between objects, see e.g.  

Gramfort et al. (2015); Flamary et al. (2018); Montavon et al. (2016). Due to this fact satisfies the requirement of taking into account non-linearity of a data set under consideration. For more information on optimal transportation theory we recommend  Villani (2009).

Following Bhatia et al. (2018), we continue to investigate properties of . The next lemma presents an alternative analytical expression for the distance.

Lemma 2.1.

Let and . Then (1.1) can be rewritten as



By we denote the pseudo-inverse matrix .

Note, that in optimal transportation theory is referred to as an optimal push-forward (optimal transportation map) between two centred normal distributions and . Following optimal transport notations it is denoted as .

For general notes on optimal transportation maps see Brenier (1991); for a particular case of scale-location and Gaussian families one may refer to Alvarez-Esteban et al. (2018); Takatsu et al. (2011). Lemma A.2 presents differentiability of the optimal map . It is one of the key-ingredients in the proof of main results of the present study. Note, that in case of differentiability of is obtained in Rippl et al. (2016). More technical details on properties of are presented in Section A.2. However, for better understanding of the proofs of main results we highly recommend to at least look through Section A.1 which is dedicated to investigation of properties of and its differential .

2.2 Existence and uniqueness of and

Along with investigation of properties of the distance in hand and before moving to more general statistical questions, one should ask her- or himself, whether Fréchet mean exists and, if so, is it unique or not? Let be a linear subspace of associated to , i.e. the following representation holds: for some . We further assume that has a non-empty intersection with the space of positive definite operators:

Assumption 3.


The next theorem ensures existence and uniqueness of the Fréchet mean (1.3).

Theorem 2.1 (Existence and uniqueness of Fréchet mean ).

Under Assumptions 1 and 3 there exists a unique positive-definite barycenter of : . Moreover, it is characterised as the unique solution of the equation


where is the orthogonal projector onto .

Note, that this result generalises the result for scale-location families in 2-Wasserstein space, presented in Álvarez-Esteban et al. (2015), Theorem 3.10 and originally obtained in a seminal work  Agueh and Carlier (2011), Theorem 6.1. Namely, if , then exists, is unique, and is characterised as the unique solution of a fixed-point equation similar to (2.4)

Existence, uniqueness, and measurability of the estimator defined in (1.4) is a direct corollary of the above theorem. The proof of Theorem 2.1 is presented in Section A.3.

2.3 Convergence of and

Armed with the knowledge about properties of , , and , we are now equipped enough, so that to introduce the main results of the current study. Theorem 2.2 presents asymptotic convergence of to .

Theorem 2.2 (Central limit theorem for the Fréchet mean).

Under Assumptions 1 and 3 an approximation error rate of the Fréchet mean by its empirical counterpart is

where is a self-adjoint linear operator acting from to defined in (A.7). Moreover, if is non-degenerated, then

with is a data-driven empirical counterpart of defined in (A.8).

Remark 1.

Here denotes a restriction of a quadratic form to a subspace :

We intentionally postpone the explicit definitions of and , as they require an introduction of many technical details. This would make the description of main results less transparent. The proof of the theorem relies on the Fréchet differentiablilty of in the vicinity of :

where is a differential of at point . Here we imply differentiability of by the lower argument .

It is worth noting that the result (B) obtained in CLT enables construction of data-driven asymptotic confidence sets. However, there might appear technical problems with inversion of the empirical covariance. For instance, numerical simulations show, that can be degenerated if is supported on a set of diagonal matrices. This immediately raises a question concerning the development of some other confidence set construction methodology based on re-sampling techniques which would simplify the process from computational point of view. We consider this as a subject for further research.

As soon as the Bures-Wasserstein distance is the main tool for the analysis in , the convergence properties of are also of great interest. The next lemma is almost a straightforward corollary of Theorem 2.2.

Corollary 2.1 (Asymptotic distribution of ).

Under conditions of Theorem 2.2 it holds


is random matrix, s.t. 

. Moreover,

where and .

To illustrate the result, we consider the case of diagonal . This setting allows us to write down the explicit form of the limiting distribution. If , then right-hand side of the above corollary for -case is:

where . All proofs are collected in Section A.3. Section 4 illustrates asymptotic behaviour of and on both artificial and real data sets.

2.4 Concentration of

The next important issue is concentration properties of under the assumption of sub-Gaussianity of (Assumption 2).

Theorem 2.3 (Concentration of ).



It holds under Assumptions 2 and 3, that

for any and , where

operator is defined in (B.3), comes from Proposition 1, and and are defined in Lemma B.3.

Concentration of is a corollary of the above theorem.

Corollary 2.2 (Concentration of in distance).

Under conditions of Theorem 2.3 the following result holds

Proofs are collected in Section B.

2.5 Central limit theorem and concentration for

In this section we investigate properties of the Fréchet variance , defined in (1.4). The next theorem presents central limit theorem for empirical variance .

Theorem 2.4 (Central limit theorem for ).

Let be s.t.  and . Then

The last important result of the current study describes concentration properties of .

Theorem 2.5 (Concentration of ).

Let Assumption 2 be fulfilled. Then under conditions of Theorem 2.4 it holds:


There are parameters of sub-exponential r.v. .

Proofs of these two theorems are collected in Section B.1.

3 Connection to other problems

In this section we explain the connection of obtained results to some other problems. Section 3.1 investigates the relation between Bures-Wasserstein barycenter and 2-Wasserstein barycenter of some scale-location family. Section 3.2 illustrates the idea of search of a barycenter on some affine subspace .

3.1 Connection to scale-location families of measures

We first present the concept of a scale-location family of absolutely continuous measures supported on .

Definition 3.1.


be a random variable that follows law

: , where

is a set of all continuous measures with finite second moment. A set of all affine transformations of


It is referred to as a scale-location family.

Scale-location families attract lots of attention in modern data analysis and appear in many practical applications, as this concept is user-friendly in terms of theoretical analysis and, at the same time, possess very high modelling power. For example, it is widely used in medical imaging Wassermann et al. (2010), modelling of molecular dynamic Gonzalez et al. (2017), clustering procedures del Barrio et al. (2017), climate modelling Mallasto and Feragen (2017), embedding of complex objects in low dimensional spaces Muzellec and Cuturi (2018) and so on.

A possible metric that takes into account non-linearity of the underlying data-set is -Wasserstein distance . Let be elements of and let , . We denote their first and second moments as


It is a well-known fact, that in case of scale-location families depends only on the first and second moments of observed measures:

For more details on general class of optimal transportation distances we recommend excellent books Ambrosio and Gigli (2013) or Villani (2009).

Distribution over scale-location family

In many cases we are interested in scale-location families generated at random. Let be a probability measure supported on some . And let be a generic probability space, s.t. for any there exists an image in , where is a scaling parameter and is a shift parameter. A randomly sampled measure belongs to by construction, and its first and second moments are written as

where denote the first and the second moments of .

Fréchet variance of at any arbitrary point is written as

Given an i.i.d. sample from , we define an empirical analogon of :

Then population and empirical barycenters and are

Note, that and belong to and are uniquely characterised by their first and second moments and respectively, see e.g. Theorem 3.10 Álvarez-Esteban et al. (2015):


It is worth noting that the concept of Wasserstein barycenter originally presented in a seminal work by  Agueh and Carlier (2011) becomes a topic of extensive scientific interest in the last few years. A work Bigot et al. (2012) focuses on convergence of parametric class of barycenters, while Bigot et al. (2017) investigate asymptotic properties of regularised barycenters. The most general results on limiting distribution of convergence of empirical barycenters are obtained in Ahidar-Coutrix et al. (2018). This work provides rates of convergence for empirical barycentres of a Borel probability measure on a metric space either under assumptions on weak curvature constraint of the underlying space or for a case of a non-negatively curved space on which geodesics, emanating from a barycenter, can be extended. Theorem 2.2 specifies the results, obtained in Ahidar-Coutrix et al. (2018) for the case of scale-location families. Corollary 2.1 partially answers an (implicit) question, raised by work
Le Gouic and Loubes (2015), concerned the rate of convergence of . Namely, for the case of scale-location families it is of order . However, the above mentioned work covers only (1.2) case. The paper Kroshnin (2018)

obtains an analog of law of large numbers for the case of arbitrary cost functions for barycenters on some affine sub-space

 (1.3). A result, similar in spirit to Theorem 2.5 is obtained in Del Barrio et al. (2016). However, there authors consider only the space of probability measures supported on the real line (i.e. ) endowed with 2-Wasserstein distance. To the best of our knowledge, there are no results similar to concentration Theorem 2.3 and Theorem 2.2 in case of 2-Wasserstein distance.

3.2 Connection to quantum mechanics

This section illustrates the idea of barycenter restricted to some affine sub-space . We first briefly recall the concept of quantum densities. Quantum density operator is used in quantum mechanics as a possible way of description of statistical state of a quantum system. It might be considered as an analogue to a phase-space density in classical statistical mechanics. The formalism was introduced by John von Neumann in 1927. In essence a density matrix is a Hermitian positive semi-definite operator with the unit trace, , .

Given a random ensemble of density matrices, one is able to recovery the mean using averaging in classical Euclidean sense. However, Bures-Wasserstein barycenter suggests an alternative way to define the “most typical” representant (1.3) in terms of fidelity measure (2.2). We consider a following statistical setting. Let be some mechanism which generates quantum states . Given an i.i.d sample we write population and empirical variance of as

Then population and empirical barycenters in the class of all -dimensional density operators are defined as

It can be easily shown, that “taking global Fréchet barycenter” or, in other words neglecting the condition , we end up with the global baryceneter, which is the solution of the fixed point equation which is already mentioned in Section 2: . This is a contraction mapping. Thus and is not a density operator. In other words condition ensures, that and also belong to the class of density operators. Taking into account the results obtained in Section 2, is a natural consistent estimator of with known rate of convergence and deviation properties.

4 Experiments on simulated and real data sets

4.1 Simulated data

In this section we consider a simulated data set. So as to generate a covariance matrix ,

we generate at random an orthogonal matrix

and a diagonal matrix , s.t. . The following images Fig.1 - Fig.3 illustrate convergence of , , and presented in Theorem 2.2, Corollary 2.1, and Theorem 2.4 respectively. The following numerical experiments were performed using . The population barycenter was computed using a sample of observed covariance matrices. A solid line depicts the density of a limiting distribution, whereas dashed lines correspond to densities for different sample sizes for Bures-Wasserstein barycenter with . Simulation were carried out for matrices of size and .

Figure 1: Densities of
Figure 2: Densities of
Figure 3: Densities of

4.2 Data aggregation in climate modelling

In this section we carry out the experiments on a family of Gaussian process, using a climate-related data set, collected in Siberia (Russia) between 1930 and 2009 Bulygina and Razuvaev (2012); Tatusko (1990). We set to be a family of Gaussian curves, that describe the daily minimum temperatures within one year, measured at a set of 30 randomly sampled meteorological stations. Each curve is obtained by means of regression and maximum likelihood estimation and is sampled in points. More details on this data set are provided in Mallasto and Feragen (2017). The scale-location family under consideration is written as

where is a Gaussian process, characterised by mean and covariance inherent to a year , . We let for all . A Gaussian process is the population Wasserstein barycenter of . It is characterised by

A family of approximating processes with parameters (3.3) is constructed by means of re-sampling with replacement of the original data set. Sample size varies in range . Fig. 5 and Fig. 5 present densities of and respectively.

Figure 4:
Figure 5:


  • Agueh and Carlier (2011) Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • Ahidar-Coutrix et al. (2018) Adil Ahidar-Coutrix, Thibaut Le Gouic, and Quentin Paris. On the rate of convergence of empirical barycentres in metric spaces: curvature, convexity and extendible geodesics. arXiv preprint arXiv:1806.02740, 2018.
  • Álvarez-Esteban et al. (2015) P. C. Álvarez-Esteban, E. del Barrio, J. A. Cuesta-Albertos, and C. Matrán. Wide consensus for parallelized inference. ArXiv e-prints, November 2015.
  • Alvarez-Esteban et al. (2018) Pedro C Alvarez-Esteban, Eustasio del Barrio, Juan A Cuesta-Albertos, Carlos Matrán, et al. Wide consensus aggregation in the Wasserstein space. application to location-scatter families. Bernoulli, 24(4A):3147–3179, 2018.
  • Ambrosio and Gigli (2013) Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pages 1–155. Springer, 2013.
  • Bhatia et al. (2018) Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 2018.
  • Bigot et al. (2012) Jérémie Bigot, Thierry Klein, et al. Consistent estimation of a population barycenter in the Wasserstein space. ArXiv e-prints, 2012.
  • Bigot et al. (2017) Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Penalized barycenters in the Wasserstein space. 2017.
  • Brenier (1991) Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  • Bulygina and Razuvaev (2012) ON Bulygina and VN Razuvaev. Daily temperature and precipitation data for 518 Russian meteorological stations. Technical report, ESS-DIVE (Environmental System Science Data Infrastructure for a Virtual Ecosystem); Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), 2012.
  • Calsbeek and Goodnight (2009) Brittny Calsbeek and Charles J Goodnight.

    Empirical comparison of G matrix test statistics: finding biologically relevant change.

    Evolution: International Journal of Organic Evolution, 63(10):2627–2635, 2009.
  • Dajka et al. (2011) Jerzy Dajka, Jerzy Łuczka, and Peter Hänggi.

    Distance between quantum states in the presence of initial qubit-environment correlations: A comparative study.

    Physical Review A, 84(3):032120, 2011.
  • Del Barrio et al. (2016) E. Del Barrio, P. Gordaliza, H. Lescornel, and J.-M. Loubes. Central limit theorem and bootstrap procedure for Wasserstein’s variations with application to structural relationships between distributions. ArXiv e-prints, November 2016.
  • del Barrio et al. (2017) E del Barrio, JA Cuesta-Albertos, C Matrán, and A Mayo-Íscar. Robust clustering tools based on optimal transportation. Statistics and Computing, pages 1–22, 2017.
  • Del Barrio et al. (2015) Eustasio Del Barrio, Hélène Lescornel, and Jean-Michel Loubes. A statistical analysis of a deformation model with Wasserstein barycenters: estimation procedure and goodness of fit test. arXiv preprint arXiv:1508.06465, 2015.
  • Fano (1957) Ugo Fano. Description of states in quantum mechanics by density matrix and operator techniques. Reviews of Modern Physics, 29(1):74, 1957.
  • Flamary et al. (2018) Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107(12):1923–1945, 2018.
  • Gonzalez et al. (2017) Oscar Gonzalez, Marco Pasi, Daiva Petkevičiūtė, Jaroslaw Glowacki, and JH Maddocks. Absolute versus relative entropy parameter estimation in a coarse-grain model of DNA. Multiscale Modeling & Simulation, 15(3):1073–1107, 2017.
  • Goodnight and Schwartz (1997) Charles J Goodnight and James M Schwartz. A bootstrap comparison of genetic covariance matrices. Biometrics, pages 1026–1039, 1997.
  • Gramfort et al. (2015) Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of neuroimaging data. In International Conference on Information Processing in Medical Imaging, pages 261–272. Springer, 2015.
  • Hsu et al. (2012) Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
  • Jozsa (1994) Richard Jozsa. Fidelity for mixed quantum states. Journal of modern optics, 41(12):2315–2323, 1994.
  • Koltchinskii et al. (2011) Vladimir Koltchinskii et al. Von Neumann entropy penalization and low-rank matrix estimation. The Annals of Statistics, 39(6):2936–2973, 2011.
  • Kroshnin (2018) Alexey Kroshnin. Fréchet barycenters in the Monge–Kantorovich spaces. Journal of Convex Analysis, 25(4):1371–1395, 2018.
  • Le Gouic and Loubes (2015) Thibaut Le Gouic and Jean-Michel Loubes. Existence and consistency of Wasserstein barycenters. June 2015.
  • Mallasto and Feragen (2017) Anton Mallasto and Aasa Feragen. Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5660–5670. Curran Associates, Inc., 2017.
  • Marian and Marian (2008) Paulina Marian and Tudor A Marian. Bures distance as a measure of entanglement for symmetric two-mode Gaussian states. Physical Review A, 77(6):062319, 2008.
  • Montavon et al. (2016) Grégoire Montavon, Klaus-Robert Müller, and Marco Cuturi.

    Wasserstein training of restricted Boltzmann machines.

    In Advances in Neural Information Processing Systems, pages 3718–3726, 2016.
  • Muzellec and Cuturi (2018) Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the Wasserstein space of elliptical distributions. arXiv preprint arXiv:1805.07594, 2018.
  • Rippl et al. (2016) Thomas Rippl, Axel Munk, and Anja Sturm. Limit laws of the empirical Wasserstein distance: Gaussian distributions.

    Journal of Multivariate Analysis

    , 151:90–109, 2016.
  • Takatsu et al. (2011) Asuka Takatsu et al. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005–1026, 2011.
  • Tatusko (1990) Rene Tatusko. Cooperation in climate research: An evaluation of the activities conducted under the US–USSR agreement for environmental protection since 1974. 1990.
  • Villani (2009) Cédric Villani. Optimal Transport, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2009. ISBN 978-3-540-71049-3 978-3-540-71050-9.
  • Wassermann et al. (2010) Demian Wassermann, Luke Bloy, Efstathios Kanterakis, Ragini Verma, and Rachid Deriche. Unsupervised white matter fiber clustering and tract probability map generation: Applications of a Gaussian process framework for white matter fibers. NeuroImage, 51(1):228–241, 2010.

Appendix A Proof of Central Limit Theorem

a.1 Properties of

Proof of Lemma 2.1.

First, we prove that optimal is self-adjoint. Indeed, assume the opposite, then

and thus . Therefore

If is Hermitian but not positive semi-definite, then , , hence again .

Finally, if , then it is straightforward to check that given by (2.3) and

The proof of the Central Limit theorem mainly relies on the differentiability of the map (2.3). Lemma A.2 shows that can be linearised in the vicinity of :

where is a self-adjoint negative-definite operator and stands for an operator norm. Properties of are investigated in Lemma A.3. Let us introduce some notation: if is a functional of a matrix , then we denote its differential as .

Lemma A.1.

Map is differentiable on , and its differential is given by


is the eigenvalue decomposition.


First, let us consider the map . It is smooth and its differential

is non-degenerated:

whenever . From now on denotes a scalar product associated to Frobenius norm.

Now applying the inverse function theorem we obtain that the inverse map is also smooth and its differential enjoys the following equation