1 Introduction
Space of finitedimensional Hermitian operators provides a powerful toolbox for data representation. For instance, in quantum mechanics it is used for mathematical description of physical properties of a quantum system, also known as observables. The reason is due to the fact that the measurements obtained in a physical experiment should be associated to realvalued quantities. Hermitian operators posses realvalued spectrum and satisfy the above requirement. A subspace of realvalued symmetric matrices is also of great interest: points in are widely used for description of systems in engineering applications, medical studies, neural sciences, evolutionary biology e.t.c. Usually such data sets are considered to be randomly sampled from an unknown distribution (Goodnight and Schwartz (1997); Calsbeek and Goodnight (2009); ÁlvarezEsteban et al. (2015); del Barrio et al. (2017); Gonzalez et al. (2017)), and statistical characteristics of
such as, in particular, mean and variance, appear to be of interest for further planning of an experiment and analysis of obtained results fur further development of natural science models. The current study focuses on a space of positive semidefinite Hermitian matrices
and presents a possible approach to analysis and aggregation of relevant statistical information from datasets, for which the linearity assumption might be violated. This makes classical Euclidean definitions of mean and variance not sensitive enough to capture effects of interest. This case appears extremely often in multiple contexts. As an example one can consider a data set that is represented as probability measures which belong to the same scalelocation family, e.g. some astronomic measurements
AlvarezEsteban et al. (2018), Example 4.6. Nonlinearity assumption requires the development of a novel toolbox suitable for further statistical analysis. In order to detect nonlinear effects, we suggest to endow with the BuresWasserstein distance which is recently introduced in a seminal paper Bhatia et al. (2018). It is defined as follows. For any pair of positive matrices it is written as:(1.1) 
It is worth noting that being restricted to the space of symmetric positive definite matrices , boils down to a classical 2Wasserstein distance between measures that belong to the same scalelocation family (see e.g. Agueh and Carlier (2011), Section 6 or ÁlvarezEsteban et al. (2015)). A more detailed discussion on this particular choice of the distance is presented in Section 2.1.
After choosing a proper distance, we are now ready to introduce a model of information aggregation the statistical properties of which are investigated in the current study. Let
be a probability distribution supported on some set
. Further without loss of generality we assume, that assigns positive probability to the intersection of with space of positive definite Hermitian matrices , and that the spectrum of its elemnts is on average bounded away from infinity:Assumption 1.
Two statistically important characteristics of are Fréchet mean and Fréchet variance. The former one can be regarded as a typical representative of a dataset in hand, whereas the latter appears in analysis on data variability, see e.g. Del Barrio et al. (2015). We briefly recall both concepts below. For an arbitrary point Fréchet variance of is defined as
Classical Fréchet mean of is a set of global minimizers of :
(1.2) 
However, in many cases we are interested in a minimizer, that belongs to some affine subspace :
(1.3) 
For instance, such a necessity may arise while considering a random set of quantum density operators. For introduction to density operators theory one may look through Fano (1957). This example is considered in more details in Section 3.2. Note, that the setting (1.3) covers the setting (1.2). So without loss of generality we further address only (1.3). Obviously, the first crucial question concerns existence and uniqueness of . And positive answers on both issues, along with necessary conditions, are presented in Theorem 2.1. This immediately allows us to define the global Fréchet variance of as .
Given an i.i.d. sample , , one constructs an empirical analogue of :
An empirical Fréchet mean and global empirical variance also exist and unique:
(1.4) 
These facts follow from Theorem 2.1
. This work studies convergence of the estimators
and and investigate concentration properties of both quantities. The discussion of practical applicability of the obtained results is postponed to Section 3. There we explain their relation to optimal transportation theory and present a possible application to statistical analysis in quantum mechanics.1.1 Contribution of the present study
Central limit theorem and concentration of
The first main result of this study concerns asymptotic normality of the approximation error of population Fréchet mean by its empirical counterpart:
where “” stands for weak convergence, and is some covariance operator acting on the linear subspace associated with affine subspace . From now on we use bold symbols e.g. , to denote operators, wheres classical ones i.e.
stand for matrices or vectors. This convergence result cannot be directly used for construction of asymptotic confidence sets because it relies on the unknown covariance matrix
. However, Theorem 2.2 ensures, that this covariance matrix can be replaced by its empirical counterpart :where denotes an identity operator. Along with asymptotic normality of , we are interested in the limiting distribution of . Corollary 2.1 shows, that
where
is some normally distributed vector. Datadriven asymptotic confidence sets for
are obtained by replacement of by its empirical counterpart :where is a metric which induces weak convergence. Furthermore, we investigate concentration properties of in both Frobenius norm and metric. The following two bounds hold with h.p.:
where stands for some generic constant. A more detailed discussion is presented in Theorem 2.3 and Corollary 2.2 respectively. It is worth noting that concentration results are obtained under assumption of subGaussianity of in the following sense:
Assumption 2 (SubGaussianity of ).
Let be subGaussian, i.e.
with some constants .
All abovementioned results are closely connected to convergence and concentration of empirical Wasserstein barycenters. For the sake of transparency we postpone this discussion to Section 3.1.
CLT and concentration for
2 Results
This section presents obtained results in more details, and the first question we address is the particular choice of the distance.
2.1 BuresWasserstein distance
The original Bures metric appears in quantum mechanics in relation to fidelity measure between two quantum states and is used for measurement of quantum entanglement Marian and Marian (2008); Dajka et al. (2011). Let , be two quantum states. Mathematically speaking, this means that
(2.1) 
Fidelity of these states is defined as . It quantifies “closeness” of and , see Jozsa (1994). It is obvious, that in case of (2.1) BuresWasserstein distance turns into
(2.2) 
It is interesting to note, that the distance appears not only one of the central distances, used in quantum mechanics, but also an object of extensive investigation in transportation theory Takatsu et al. (2011). Let and
be two centred Gaussian distributions. Then 2Wasserstein distance between them is written as
The case of Gaussian measures is naturally extended to measures that belong to a same scalelocation family AlvarezEsteban et al. (2018)
. In the last few years Wasserstein distance attracts a lot of attention of data scientists and machine learning community, as it takes into account geometrical similarities between objects, see e.g.
Gramfort et al. (2015); Flamary et al. (2018); Montavon et al. (2016). Due to this fact satisfies the requirement of taking into account nonlinearity of a data set under consideration. For more information on optimal transportation theory we recommend Villani (2009).Following Bhatia et al. (2018), we continue to investigate properties of . The next lemma presents an alternative analytical expression for the distance.
Lemma 2.1.
Note, that in optimal transportation theory is referred to as an optimal pushforward (optimal transportation map) between two centred normal distributions and . Following optimal transport notations it is denoted as .
For general notes on optimal transportation maps see Brenier (1991); for a particular case of scalelocation and Gaussian families one may refer to AlvarezEsteban et al. (2018); Takatsu et al. (2011). Lemma A.2 presents differentiability of the optimal map . It is one of the keyingredients in the proof of main results of the present study. Note, that in case of differentiability of is obtained in Rippl et al. (2016). More technical details on properties of are presented in Section A.2. However, for better understanding of the proofs of main results we highly recommend to at least look through Section A.1 which is dedicated to investigation of properties of and its differential .
2.2 Existence and uniqueness of and
Along with investigation of properties of the distance in hand and before moving to more general statistical questions, one should ask her or himself, whether Fréchet mean exists and, if so, is it unique or not? Let be a linear subspace of associated to , i.e. the following representation holds: for some . We further assume that has a nonempty intersection with the space of positive definite operators:
Assumption 3.
,
The next theorem ensures existence and uniqueness of the Fréchet mean (1.3).
Theorem 2.1 (Existence and uniqueness of Fréchet mean ).
Note, that this result generalises the result for scalelocation families in 2Wasserstein space, presented in ÁlvarezEsteban et al. (2015), Theorem 3.10 and originally obtained in a seminal work Agueh and Carlier (2011), Theorem 6.1. Namely, if , then exists, is unique, and is characterised as the unique solution of a fixedpoint equation similar to (2.4)
Existence, uniqueness, and measurability of the estimator defined in (1.4) is a direct corollary of the above theorem. The proof of Theorem 2.1 is presented in Section A.3.
2.3 Convergence of and
Armed with the knowledge about properties of , , and , we are now equipped enough, so that to introduce the main results of the current study. Theorem 2.2 presents asymptotic convergence of to .
Theorem 2.2 (Central limit theorem for the Fréchet mean).
Remark 1.
Here denotes a restriction of a quadratic form to a subspace :
We intentionally postpone the explicit definitions of and , as they require an introduction of many technical details. This would make the description of main results less transparent. The proof of the theorem relies on the Fréchet differentiablilty of in the vicinity of :
where is a differential of at point . Here we imply differentiability of by the lower argument .
It is worth noting that the result (B) obtained in CLT enables construction of datadriven asymptotic confidence sets. However, there might appear technical problems with inversion of the empirical covariance. For instance, numerical simulations show, that can be degenerated if is supported on a set of diagonal matrices. This immediately raises a question concerning the development of some other confidence set construction methodology based on resampling techniques which would simplify the process from computational point of view. We consider this as a subject for further research.
As soon as the BuresWasserstein distance is the main tool for the analysis in , the convergence properties of are also of great interest. The next lemma is almost a straightforward corollary of Theorem 2.2.
Corollary 2.1 (Asymptotic distribution of ).
To illustrate the result, we consider the case of diagonal . This setting allows us to write down the explicit form of the limiting distribution. If , then righthand side of the above corollary for case is:
where . All proofs are collected in Section A.3. Section 4 illustrates asymptotic behaviour of and on both artificial and real data sets.
2.4 Concentration of
The next important issue is concentration properties of under the assumption of subGaussianity of (Assumption 2).
Theorem 2.3 (Concentration of ).
Concentration of is a corollary of the above theorem.
Corollary 2.2 (Concentration of in distance).
Under conditions of Theorem 2.3 the following result holds
Proofs are collected in Section B.
2.5 Central limit theorem and concentration for
In this section we investigate properties of the Fréchet variance , defined in (1.4). The next theorem presents central limit theorem for empirical variance .
Theorem 2.4 (Central limit theorem for ).
Let be s.t. and . Then
The last important result of the current study describes concentration properties of .
Theorem 2.5 (Concentration of ).
Proofs of these two theorems are collected in Section B.1.
3 Connection to other problems
In this section we explain the connection of obtained results to some other problems. Section 3.1 investigates the relation between BuresWasserstein barycenter and 2Wasserstein barycenter of some scalelocation family. Section 3.2 illustrates the idea of search of a barycenter on some affine subspace .
3.1 Connection to scalelocation families of measures
We first present the concept of a scalelocation family of absolutely continuous measures supported on .
Definition 3.1.
Let
be a random variable that follows law
: , whereis a set of all continuous measures with finite second moment. A set of all affine transformations of
isIt is referred to as a scalelocation family.
Scalelocation families attract lots of attention in modern data analysis and appear in many practical applications, as this concept is userfriendly in terms of theoretical analysis and, at the same time, possess very high modelling power. For example, it is widely used in medical imaging Wassermann et al. (2010), modelling of molecular dynamic Gonzalez et al. (2017), clustering procedures del Barrio et al. (2017), climate modelling Mallasto and Feragen (2017), embedding of complex objects in low dimensional spaces Muzellec and Cuturi (2018) and so on.
A possible metric that takes into account nonlinearity of the underlying dataset is Wasserstein distance . Let be elements of and let , . We denote their first and second moments as
(3.1) 
It is a wellknown fact, that in case of scalelocation families depends only on the first and second moments of observed measures:
For more details on general class of optimal transportation distances we recommend excellent books Ambrosio and Gigli (2013) or Villani (2009).
Distribution over scalelocation family
In many cases we are interested in scalelocation families generated at random. Let be a probability measure supported on some . And let be a generic probability space, s.t. for any there exists an image in , where is a scaling parameter and is a shift parameter. A randomly sampled measure belongs to by construction, and its first and second moments are written as
where denote the first and the second moments of .
Fréchet variance of at any arbitrary point is written as
Given an i.i.d. sample from , we define an empirical analogon of :
Then population and empirical barycenters and are
Note, that and belong to and are uniquely characterised by their first and second moments and respectively, see e.g. Theorem 3.10 ÁlvarezEsteban et al. (2015):
(3.2) 
(3.3) 
It is worth noting that the concept of Wasserstein barycenter originally presented in
a seminal work by Agueh and Carlier (2011)
becomes a topic of extensive scientific interest in the last few years.
A work Bigot et al. (2012) focuses on convergence of
parametric class of barycenters, while Bigot et al. (2017)
investigate asymptotic properties of regularised barycenters.
The most general results on limiting distribution of convergence of empirical barycenters
are obtained in AhidarCoutrix et al. (2018).
This work provides rates of convergence for empirical barycentres of a Borel probability
measure on a metric space either under assumptions on
weak curvature constraint of the underlying space or for a case of a nonnegatively curved space on which geodesics, emanating from a barycenter, can be extended.
Theorem 2.2
specifies the results, obtained in AhidarCoutrix et al. (2018)
for the case of scalelocation families.
Corollary 2.1 partially answers an (implicit) question,
raised by work
Le Gouic and Loubes (2015),
concerned the rate of convergence
of .
Namely, for the case of scalelocation families
it is of order .
However, the above mentioned work covers only (1.2) case.
The paper Kroshnin (2018)
obtains an analog of law of large numbers for the case of arbitrary cost functions for barycenters on some affine subspace
(1.3). A result, similar in spirit to Theorem 2.5 is obtained in Del Barrio et al. (2016). However, there authors consider only the space of probability measures supported on the real line (i.e. ) endowed with 2Wasserstein distance. To the best of our knowledge, there are no results similar to concentration Theorem 2.3 and Theorem 2.2 in case of 2Wasserstein distance.3.2 Connection to quantum mechanics
This section illustrates the idea of barycenter restricted to some affine subspace . We first briefly recall the concept of quantum densities. Quantum density operator is used in quantum mechanics as a possible way of description of statistical state of a quantum system. It might be considered as an analogue to a phasespace density in classical statistical mechanics. The formalism was introduced by John von Neumann in 1927. In essence a density matrix is a Hermitian positive semidefinite operator with the unit trace, , .
Given a random ensemble of density matrices, one is able to recovery the mean using averaging in classical Euclidean sense. However, BuresWasserstein barycenter suggests an alternative way to define the “most typical” representant (1.3) in terms of fidelity measure (2.2). We consider a following statistical setting. Let be some mechanism which generates quantum states . Given an i.i.d sample we write population and empirical variance of as
Then population and empirical barycenters in the class of all dimensional density operators are defined as
It can be easily shown, that “taking global Fréchet barycenter” or, in other words neglecting the condition , we end up with the global baryceneter, which is the solution of the fixed point equation which is already mentioned in Section 2: . This is a contraction mapping. Thus and is not a density operator. In other words condition ensures, that and also belong to the class of density operators. Taking into account the results obtained in Section 2, is a natural consistent estimator of with known rate of convergence and deviation properties.
4 Experiments on simulated and real data sets
4.1 Simulated data
In this section we consider a simulated data set. So as to generate a covariance matrix ,
we generate at random an orthogonal matrix
and a diagonal matrix , s.t. . The following images Fig.1  Fig.3 illustrate convergence of , , and presented in Theorem 2.2, Corollary 2.1, and Theorem 2.4 respectively. The following numerical experiments were performed using . The population barycenter was computed using a sample of observed covariance matrices. A solid line depicts the density of a limiting distribution, whereas dashed lines correspond to densities for different sample sizes for BuresWasserstein barycenter with . Simulation were carried out for matrices of size and .4.2 Data aggregation in climate modelling
In this section we carry out the experiments on a family of Gaussian process, using a climaterelated data set, collected in Siberia (Russia) between 1930 and 2009 Bulygina and Razuvaev (2012); Tatusko (1990). We set to be a family of Gaussian curves, that describe the daily minimum temperatures within one year, measured at a set of 30 randomly sampled meteorological stations. Each curve is obtained by means of regression and maximum likelihood estimation and is sampled in points. More details on this data set are provided in Mallasto and Feragen (2017). The scalelocation family under consideration is written as
where is a Gaussian process, characterised by mean and covariance inherent to a year , . We let for all . A Gaussian process is the population Wasserstein barycenter of . It is characterised by
A family of approximating processes with parameters (3.3) is constructed by means of resampling with replacement of the original data set. Sample size varies in range . Fig. 5 and Fig. 5 present densities of and respectively.
References
 Agueh and Carlier (2011) Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
 AhidarCoutrix et al. (2018) Adil AhidarCoutrix, Thibaut Le Gouic, and Quentin Paris. On the rate of convergence of empirical barycentres in metric spaces: curvature, convexity and extendible geodesics. arXiv preprint arXiv:1806.02740, 2018.
 ÁlvarezEsteban et al. (2015) P. C. ÁlvarezEsteban, E. del Barrio, J. A. CuestaAlbertos, and C. Matrán. Wide consensus for parallelized inference. ArXiv eprints, November 2015.
 AlvarezEsteban et al. (2018) Pedro C AlvarezEsteban, Eustasio del Barrio, Juan A CuestaAlbertos, Carlos Matrán, et al. Wide consensus aggregation in the Wasserstein space. application to locationscatter families. Bernoulli, 24(4A):3147–3179, 2018.
 Ambrosio and Gigli (2013) Luigi Ambrosio and Nicola Gigli. A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pages 1–155. Springer, 2013.
 Bhatia et al. (2018) Rajendra Bhatia, Tanvi Jain, and Yongdo Lim. On the Bures–Wasserstein distance between positive definite matrices. Expositiones Mathematicae, 2018.
 Bigot et al. (2012) Jérémie Bigot, Thierry Klein, et al. Consistent estimation of a population barycenter in the Wasserstein space. ArXiv eprints, 2012.
 Bigot et al. (2017) Jérémie Bigot, Elsa Cazelles, and Nicolas Papadakis. Penalized barycenters in the Wasserstein space. 2017.
 Brenier (1991) Yann Brenier. Polar factorization and monotone rearrangement of vectorvalued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
 Bulygina and Razuvaev (2012) ON Bulygina and VN Razuvaev. Daily temperature and precipitation data for 518 Russian meteorological stations. Technical report, ESSDIVE (Environmental System Science Data Infrastructure for a Virtual Ecosystem); Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States), 2012.

Calsbeek and Goodnight (2009)
Brittny Calsbeek and Charles J Goodnight.
Empirical comparison of G matrix test statistics: finding biologically relevant change.
Evolution: International Journal of Organic Evolution, 63(10):2627–2635, 2009. 
Dajka et al. (2011)
Jerzy Dajka, Jerzy Łuczka, and Peter Hänggi.
Distance between quantum states in the presence of initial qubitenvironment correlations: A comparative study.
Physical Review A, 84(3):032120, 2011.  Del Barrio et al. (2016) E. Del Barrio, P. Gordaliza, H. Lescornel, and J.M. Loubes. Central limit theorem and bootstrap procedure for Wasserstein’s variations with application to structural relationships between distributions. ArXiv eprints, November 2016.
 del Barrio et al. (2017) E del Barrio, JA CuestaAlbertos, C Matrán, and A MayoÍscar. Robust clustering tools based on optimal transportation. Statistics and Computing, pages 1–22, 2017.
 Del Barrio et al. (2015) Eustasio Del Barrio, Hélène Lescornel, and JeanMichel Loubes. A statistical analysis of a deformation model with Wasserstein barycenters: estimation procedure and goodness of fit test. arXiv preprint arXiv:1508.06465, 2015.
 Fano (1957) Ugo Fano. Description of states in quantum mechanics by density matrix and operator techniques. Reviews of Modern Physics, 29(1):74, 1957.
 Flamary et al. (2018) Rémi Flamary, Marco Cuturi, Nicolas Courty, and Alain Rakotomamonjy. Wasserstein discriminant analysis. Machine Learning, 107(12):1923–1945, 2018.
 Gonzalez et al. (2017) Oscar Gonzalez, Marco Pasi, Daiva Petkevičiūtė, Jaroslaw Glowacki, and JH Maddocks. Absolute versus relative entropy parameter estimation in a coarsegrain model of DNA. Multiscale Modeling & Simulation, 15(3):1073–1107, 2017.
 Goodnight and Schwartz (1997) Charles J Goodnight and James M Schwartz. A bootstrap comparison of genetic covariance matrices. Biometrics, pages 1026–1039, 1997.
 Gramfort et al. (2015) Alexandre Gramfort, Gabriel Peyré, and Marco Cuturi. Fast optimal transport averaging of neuroimaging data. In International Conference on Information Processing in Medical Imaging, pages 261–272. Springer, 2015.
 Hsu et al. (2012) Daniel Hsu, Sham Kakade, Tong Zhang, et al. A tail inequality for quadratic forms of subgaussian random vectors. Electronic Communications in Probability, 17, 2012.
 Jozsa (1994) Richard Jozsa. Fidelity for mixed quantum states. Journal of modern optics, 41(12):2315–2323, 1994.
 Koltchinskii et al. (2011) Vladimir Koltchinskii et al. Von Neumann entropy penalization and lowrank matrix estimation. The Annals of Statistics, 39(6):2936–2973, 2011.
 Kroshnin (2018) Alexey Kroshnin. Fréchet barycenters in the Monge–Kantorovich spaces. Journal of Convex Analysis, 25(4):1371–1395, 2018.
 Le Gouic and Loubes (2015) Thibaut Le Gouic and JeanMichel Loubes. Existence and consistency of Wasserstein barycenters. June 2015.
 Mallasto and Feragen (2017) Anton Mallasto and Aasa Feragen. Learning from uncertain curves: The 2Wasserstein metric for Gaussian processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5660–5670. Curran Associates, Inc., 2017.
 Marian and Marian (2008) Paulina Marian and Tudor A Marian. Bures distance as a measure of entanglement for symmetric twomode Gaussian states. Physical Review A, 77(6):062319, 2008.

Montavon et al. (2016)
Grégoire Montavon, KlausRobert Müller, and Marco Cuturi.
Wasserstein training of restricted Boltzmann machines.
In Advances in Neural Information Processing Systems, pages 3718–3726, 2016.  Muzellec and Cuturi (2018) Boris Muzellec and Marco Cuturi. Generalizing point embeddings using the Wasserstein space of elliptical distributions. arXiv preprint arXiv:1805.07594, 2018.

Rippl et al. (2016)
Thomas Rippl, Axel Munk, and Anja Sturm.
Limit laws of the empirical Wasserstein distance: Gaussian
distributions.
Journal of Multivariate Analysis
, 151:90–109, 2016.  Takatsu et al. (2011) Asuka Takatsu et al. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005–1026, 2011.
 Tatusko (1990) Rene Tatusko. Cooperation in climate research: An evaluation of the activities conducted under the US–USSR agreement for environmental protection since 1974. 1990.
 Villani (2009) Cédric Villani. Optimal Transport, volume 338 of Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2009. ISBN 9783540710493 9783540710509.
 Wassermann et al. (2010) Demian Wassermann, Luke Bloy, Efstathios Kanterakis, Ragini Verma, and Rachid Deriche. Unsupervised white matter fiber clustering and tract probability map generation: Applications of a Gaussian process framework for white matter fibers. NeuroImage, 51(1):228–241, 2010.
Appendix A Proof of Central Limit Theorem
a.1 Properties of
Proof of Lemma 2.1.
First, we prove that optimal is selfadjoint. Indeed, assume the opposite, then
and thus . Therefore
If is Hermitian but not positive semidefinite, then , , hence again .
The proof of the Central Limit theorem mainly relies on the differentiability of the map (2.3). Lemma A.2 shows that can be linearised in the vicinity of :
where is a selfadjoint negativedefinite operator and stands for an operator norm. Properties of are investigated in Lemma A.3. Let us introduce some notation: if is a functional of a matrix , then we denote its differential as .
Lemma A.1.
Proof.
First, let us consider the map . It is smooth and its differential
is nondegenerated:
whenever . From now on denotes a scalar product associated to Frobenius norm.
Now applying the inverse function theorem we obtain that the inverse map is also smooth and its differential enjoys the following equation
thus
and
Comments
There are no comments yet.