1 Introduction
Given a separable and complete metric space , define as the set of Borel probability measures on such that
for all . A barycenter of , also called a Fréchet mean [Fr48], is any element such that
(1.1) 
When it exists, a barycenter stands as a natural analog of the mean of a (square integrable) probability measure on . Alternative notions of mean value include local minimisers [Kar14], means [yokota2017], exponential barycenters [emery1991barycentre] or convex means [emery1991barycentre]. Extending the notion of mean value to the case of probability measures on spaces with no Euclidean (or Hilbert) structure has a number of applications ranging from geometry [sturm2003] and optimal transport [villani2003, villani2008optimal, santambrogio2015, cp2018]
to statistics and data science
[Pelletier2005, BLL15, Bigotandco2018, KSS19], and the context of abstract metric spaces provides a unifying framework encompassing many nonstandard settings.Properties of barycenters, such as existence and uniqueness, happen to be closely related to geometric characteristics of the space .
These properties are addressed in the context of Riemannian manifolds in [Af11].
Many interesting examples of metric spaces, however, cannot be described as smooth manifolds because of their singularities or infinite dimensional nature.
More general geometrical structures are geodesic metric spaces which include many more examples of interest (precise definitions and necessary background on metric geometry are reported in Appendix A).
The barycenter problem has been addressed in this general setting.
The scenario where has nonpositive curvature (from here on, curvature bounds are understood in the sense of Aleksandrov) is considered in [sturm2003].
More generally, the case of metric spaces with upper bounded curvature is studied in [yokota2016] and [yokota2017].
The context of spaces with lower bounded curvature is discussed in [yokota2012rigidity] and [ohta2012barycenters].
Focus on the case of metric spaces with nonnegative curvature may be motivated by the increasing interest for the theory of optimal transport and its applications.
Indeed, a space of central importance in this context is the Wasserstein space , equipped with the Wasserstein metric , known to be geodesic and with nonnegative curvature (see Section 7.3 in [ambrosio2008gradient]).
In this framework, the barycenter problem was first studied by [agueh2010barycenters] and has since gained considerable momentum.
Existence and uniqueness of barycenters in has further been studied in [Legouic2017].
A number of objects of interest, including barycenters as a special case, may be described as minimisers of the form
(1.2) 
for some probability measure on metric space and some functional . While we obviously recover the definition of barycenters whenever , many functionals of interest are not of this specific form. With a slight abuse of language, minimisers such as will be called generalised barycenters in the sequel. A first example we have in mind, in the context where , is the case where functional is an divergence, i.e.
for some convex function . Known for their importance in statistics [LeCam86, Tsyb09], and information theory [Vaj89], divergences have become a crucial tool in a number of other fields such as geometry and optimal transport [Sturm06I, Sturm06II, LottVillani2009]
[Goodfellow2014]. Other examples arise when the squared distance in (1.1) is replaced by a regularised version aiming at enforcing computationally friendly properties, such as convexity, while providing at the same time a sound approximation of . A significant example in this spirit is the case where functional is the entropyregularised Wasserstein distance (also known as the Sinkhorn divergence) largely used as a proxy for in applications [cuturi2013, cp2018, Altschuler2017, Dvurechensky2018].In the paper, our main concern is to provide rates of convergence for empirical generalised barycenters, defined as follows. Given a collection of independent and
valued random variables with same distribution
, we call empirical generalised barycenter any(1.3) 
Any such provides a natural empirical counterpart of a generalised barycenter defined in (1.2).
The statistical properties of have been studied in a few specific scenarios.
In the case where and is a Riemannian manifold, significant contributions, establishing in particular consistency and limit distribution under general conditions, are [Bhattacharya2003, Bhattacharya2005] and [Kendall2011].
Asymptotic properties of empirical barycenters in the Wasserstein space are studied in [Legouic2017].
We are only aware of a few contributions providing finite sample bounds on the statistical performance of .
Paper [Bigotandco2018] provides upper and lower bounds on convergence rates for empirical barycenters in the context of the Wasserstein space over the real line.
Independently of the present contribution, [Sh18] studies a similar problem and provides results complementary to ours.
In addition to more transparent conditions, our results are based on the fundamental assumption that there exists constants and such that, for all ,
(1.4) 
We show that condition (1.4) provides a connection between usual assumptions in the field of empirical processes and geometric characteristics of the metric space . First, the reader familiar with the theory of empirical processes will identify in the proof of Theorems 2.1 and 2.5 that condition (1.4) implies a Bernstein condition on the class of functions indexing our empirical process, that is an equivalence between their and norms. Many authors have emphasised the role of this condition for obtaining fast rates of convergence of empirical minimisers. Major contributions in that direction are for instance [mammen1999, massart2000, blanchard2003, bartlett2005, bartlett2006, koltchinskii2006, bandm2006] and [Men15]. In particular, this assumption may be understood in our context as an analog of the MammenTsybakov lownoise assumption [mammen1999] used in binary classification. Second, we show that condition (1.4
) carries a strong geometrical meaning. In the context where
, [sturm2003] established a tight connection between (1.4), with , and the fact that has nonpositive curvature. When , we show that (1.4) actually holds with and in geodesic spaces of nonnegative curvature under flexible conditions related to the possibility of extending geodesics emanating from a barycenter. Finally, for a general functional , we connect (1.4) to its strong convexity properties. Using terminology introduced in [sturm2003] in a slightly more specific context, we will call by extention (1.4) a variance inequality.The paper is organised has follows. Section 2 provides convergence rates for generalised empirical barycenters under several assumptions of functional and two possible complexity assumptions on metric space . Section 3 investigates in details the validity of the variance inequality (1.4) in different scenarios. In particular, we focus on studying (1.4) under curvature bounds of the metric whenever . Additional examples where our results apply are discussed in Section 4. Proofs are postponed to Section 5. Finally, Appendix A presents an overview of basic concepts and results in metric geometry for convenience.
2 Rates of convergence
In this section, we provide convergence rates for generalised empirical barycenters. Paragraph 2.1 defines our general setup and mentions our main assumptions on functional . Paragraphs 2.2 and 2.3 present rates under different assumptions on the complexity of metric space . Subsection 2.4 discusses the optimality of our results.
2.1 Setup
Let be a separable and complete metric space and a measurable function. Let be a Borel probability measure on . Suppose that, for all , the function is integrable with respect to , and let
(2.1) 
which we suppose exists. Given a collection of independent and valued random variables with same distribution , we consider an empirical minimiser
(2.2) 
The present section studies the statistical performance of under the following assumptions on .

[label=(A0),leftmargin=*]

There exists a constant such that, for all ,

There exist constants and such that, for all ,

(Variance inequality) There exist constants and such that, for all ,
Assumptions 1 and 2 are transparent boundedness and regularity conditions. For instance if , these assumptions are satisfied whenever is bounded with , and , by the triangular inequality. The meaning of condition 3 is less obvious at first sight. A detailed discussion of 3 is postponed to section 3. For now, we mention three straightforward implications of 3. First, note that imposing both 2 and 3 requires to be bounded. Indeed, plugging 2 into 3 yields
More importantly, 3 implies that minimiser is unique. Finally, condition 3 applied to minimiser reads
(2.3) 
The left hand side of this inequality is the estimation performance of
. The integral, under power , may be called the learning performance of , or its excess risk. Having this comparison in mind, we will focus on controlling the learning performance of knowing that an upper bound on may be readily deduced from our results. The remainder of the section therefore presents upper bounds for the right hand side of (2.3) under specific complexity assumptions on . For that purpose, we recall the definition of covering numbers. For and , an net for is a finite subset such thatwhere . The covering number is the smallest integer such that there exists an net of size for in . The function will be referred to as the metric entropy of .
2.2 Doubling condition
Our first complexity assumption is the following.

[label=(B1),leftmargin=*]

(Doubling condition) There exist constants such that, for all ,
Condition 1 essentially characterises as a dimensional space and implies the following result.
Theorem 2.1.
Note that bounds in expectation may be derived from this result, using classical arguments. As described in section 3, 2 and 3 hold in several interesting cases for . In this case, Theorem 2.1 exhibits an upper bound of order . A discussion on the optimality of this result is postponed to paragraph 2.4 below. Next, we shortly comment condition 1.
Remark 2.2.
With a slight abuse of terminology, condition 1 is termed doubling condition. In the literature, the doubling condition usually refers to the situation where inequality
holds for some . It may be seen that this inequality implies 1 with . Note however that 1 is slightly less restrictive as it requires only the control of the covering numbers of balls centered at . This fact is sometimes useful as described in example 2.4 below.
We now give two examples where assumption 1 holds.
Example 2.3.
Suppose there exists a positive Borel measure on such that, for some , we have
(2.4) 
for some constants . Then, for all and all ,
(2.5) 
and thus 1 is satisfied. The proof is given in Section 5. Measures satisfying condition (2.4) are called regular or AhlforsDavid regular. Many examples of such spaces are discussed in section 12 in [GrLu00] or section 2.2 in [AT04]. Note that the present example includes the case where is a dimensional compact Riemannian manifold equipped with the volume measure .
A direct and simple consequence of Example 2.3 is that 1 holds in any
dimensional vector space equipped with any norm since the Lebesgue measure satisfies (
2.4) with . While simple in essence, this observation allows to exhibit more general parametric families satisfying 1 as in the next example.Example 2.4 (Locationscatter family).
Here, we detail an example of a subset of the Wasserstein space for which assumption 1 holds. We say that is a locationscatter family if the following two requirements hold:

All elements of have a nonsingular covariance matrix.

For every two measures , with expectations and and with covariance matrices and respectively, the map
pushes forward to , i.e. for any Borel set , which we denote .
Such sets have been studied for instance in [alvarez2016fixed]. The map being the gradient of a convex function, the theory of optimal transport guarantees that the coupling is optimal, so that
(2.6) 
where denotes the trace of matrix and refers to the standard euclidean norm. Next, we show that such a family satisfies 1. Let be a probability measure on and denote a barycenter of with mean and covariance matrix . For any two measures , set
where and denote the mean and covariance matrix of , . The pushforward is a (possibly suboptimal) coupling between and . Therefore,
(2.7)  
where stands for the Frobenius norm. Note that defines a norm on the vector space where denotes the space of symmetric matrices of size . Then, define the function that maps each in the locationscatter family, with mean and covariance , to
Then, combining (2.6) and (2.7), it follows that
(2.8) 
with equality if or . Therefore, since is a subset of a vector space of dimension , there exists such that for all ,
Hence, 1 holds.
The result, derived in example 2.4, may be generalised to other parametric subsets of the Wasserstein space (or more generally parametric subsets of geodesic Polish spaces with nonnegative curvature). Indeed, since the Wasserstein space over has nonnegative curvature, the support of pushed forward to the tangent cone at a barycenter is isometric to a Hilbert space (this result follows by combining Theorem 5.5 and Lemma 5.8) and its norm satisfies (2.8) (see Proposition A.13). Therefore it is enough, for 1 to hold, to require that the image of the support of by the map (see paragraph A.5 for a definition) is included in a finite dimensional vector space.
2.3 Polynomial metric entropy
Condition 1 is essentially a finite dimensional behaviour and does not apply in some scenarios of interest. This paragraph addresses the situation where the complexity of set , measured by its metric entropy, is polynomial.

[label=(B2),leftmargin=*]

(Polynomial metric entropy) There exists constants such that, for all ,
Theorem 2.5.
As for Theorem 2.1, bounds in expectation may be easily derived from this result. The optimality of Theorem 2.5 is addressed in paragraph 2.4. Next is an example where assumption 1 applies.
Example 2.6 (Wasserstein space).
Let be a closed Euclidean ball in and let be the set of squareintegrable probability measures supported on equipped with the 2Wassertein metric . Combining the result of Appendix A in [bolley2007] with a classical bound on the covering number of euclidean balls, it follows that for all ,
In particular, for any , there exists depending on and such that, for all ,
so that 1 is satisfied for all .
We finally point towards Theorem 3 in [WeedBerthet19] which may be used to derive upper bounds on the covering number of subsets of the Wasserstein space composed of measures, absolutely continuous with respect to the Lebesgue measure, and with density belonging to some Besov class.
2.4 On optimality
At the level of generality considered by Theorems 2.1 and 2.5, we have not been able to assess the optimality of the given rates for all choices of functional satisfying the required assumptions and all values of constants .
In particular, it is likely that the rates displayed in Theorem 2.5 are artefacts of our proof techniques and that results may be improved in some specific scenarios using additional information on the problem at hand.
However, we discuss below some regimes where our results appear sharp and, on the contrary, settings where our results should allow for improvements.
To start our discussion, consider the barycenter problem, i.e. the case where . In the context where is a Hilbert space equipped with its usual metric and is square integrable, explicit computations reveal that is an empirical barycenter of in the sense of (2.2) and that (in the sense of the Pettis or Bochner integral) is the unique barycenter of . In addition, we check that, for all ,
(2.9) 
We notice that, under assumptions much more general than those considered in the present paper, the rate of convergence (in expectation) of empirical barycenters in a Hilbert space is of order . While this observation concerns the very special case of Hilbert spaces, we conjecture that the rates of convergence of empirical barycenters is of order in a wide family of metric spaces including Hilbert spaces as a special case. Identifying precisely this wider family remains an open question but it appears from this discussion that boundedness and complexity restrictions, such as 1, 1 and 1, may be unnecessary for the barycenter problem. Whenever is equipped with a general norm, a very interesting recent contribution, connected to that question, is [LuMe19]. On a more positive note, we point towards two encouraging aspects encoded in our results in the context of the barycenter problem. First, consider the case where is equipped with the euclidean metric and suppose that the
’s are independent with gaussian distribution
. Then, identity (2.9) reads in this caseIt is known, furthermore, that corresponds (up to universal constants) to the minimax rate of estimation of in the context where the ’s are i.i.d.
subgaussian random variables with mean and variance proxy (see Chapter 4 in [Rig17]).
Therefore, provided is a bounded metric space (which guarantees 1 and 2 with ) and provided assumption 3 holds for (which is often the case as discussed in paragraphs 3.1 and 3.2 below) Theorem 2.1 recovers the optimal rate of convergence , up to constants, in a fairly wide context.
Finally, note that while possibly suboptimal in some cases, the rates provided by Theorems 2.1 and 2.5, combined with examples 2.4, 2.6 and discussions of paragraph 3.2, provide up to our knowledge the first rates for the Wasserstein barycenter problem at this level of generality.
An exception is the Wasserstein space over the real line (studied, for instance, in [Bigotandco2018]) which happens to be isometric to a convex subset of a Hilbert space as can be deduced for instance from combining statement (iii) of Proposition 3.5 in [sturm2003] and Proposition 4.1 in [kloeckner2010].
Outside from the setting of the barycenter problem, not much is known on optimal rates of estimation or learning (in the sense described at the end of paragraph 2.1) of defined in (2.1). We believe this question remains mainly open. It is our impression that the rate, conjectured to hold for empirical barycenters in a wide setup, is a behavior very specific to the case . For more general functionals, we suspect that the complexity of should have an impact as it is classically the case in nonparametric statistics or learning theory. Note in particular that whenever parameters in 2 and 3, the rate in Theorem 2.5 becomes
which corresponds to known state of the art learning rates, under complexity assumptions in the same flavor as 1, as displayed for instance by Theorem 2 in [RaSrTs17].
However, exact situations under which Theorem 2.5 provides optimal rates of convergence remains unclear to us.
Finally, note that the second inequality in both Theorems 2.1 and 2.5 hold for the limiting case (with remaining finite), which correspond to dropping assumption 3. In the context of Theorem 2.1 (or that of Theorem 2.5 with ) the case gives rise to a bound of order
with high probability. Note however that this limiting case does not allow to provide any bound for .
3 Variance inequalities
This section studies conditions implying the validity of 3. The first three paragraphs below focus on the barycenter problem, i.e. the case where , and investigate 3 in the light of curvature bounds. Aleksandrov curvature bounds of a geodesic space (see paragraph A.3) is a key concept of comparison geometry and many geometric phenomena are known to depend on whether the space has a curvature bounded from below or above. In paragraphs 3.1 and 3.2, devoted respectively to nonpositively and nonnegatively curved spaces, we show that curvature bounds also affect statistical procedures through their relation with 3. Finally, paragraph 3.3 addresses the case of a general and connects 3 to its convexity properties. The material presented in this section relies heavily on background in metric geometry gathered in appendix A for convenience.
3.1 Non positive curvature
This first paragraph introduces a fundamental insight due to K.T. Sturm, in the context of geodesic spaces of nonpositive curvature, that has strongly influenced our study. To put the following result in perspective, we recall that a geodesic space is said to have nonpositive curvature ( for short) if, for any and any geodesic such that and ,
for all . Nonpositive curvature is given a probabilistic description in the next result.
Theorem 3.1 (Theorem 4.9 in [sturm2003]).
Let be a separable and complete metric space. Then, the following properties are equivalent.

is geodesic and .

Any probability measure has a unique barycenter and, for all ,
(3.1)
In words, Theorem 3.1 states in particular that 3 holds for any possible probability measure on , with and , provided . It is worth mentioning again that 1 and 2 also hold, provided in addition , so that the case of bounded metric spaces with nonpositive curvature fits very well our basic assumptions. Condition is satisfied in a number of interesting examples. Such examples include (convex subsets of) Hilbert spaces or the case where is a simply connected Riemannian manifold with nonpositive sectional curvature. Other examples are metric trees and other metric constructions such as products or gluings of spaces of nonpositive curvature (see [BriHaf99], [BuBuIv01] or [AlKaPe17] for more details).
3.2 Non negative curvature and extendable geodesics
The present paragraph investigates the case of spaces of nonnegative curvature. Contrary to the case of spaces of nonpositive curvature, condition 3 may not hold for every probability measure on if . Indeed, note that unlike in the case when , there might exist probability measures with more than one barycenter whenever . A simple example when , the unit euclidean sphere in with angular metric, is the uniform measure on the equator having the north and south poles as barycenters. Since 3 implies uniqueness of barycenter , this condition disqualifies such probability measures. Hence, establishing conditions under which 3 holds is more delicate whenever . The next result provides an important first step in this direction.
Theorem 3.2.
Let be a separable and complete geodesic space such that . Let and be a barycenter of . Then, for all ,
(3.2) 
where, for all and all ,
(3.3) 
Therefore, satisfies 3 with and if and only if, for all ,
(3.4) 
By definition of a barycenter, the right hand side of (3.2) is non negative. In addition implies that for all (see Proposition A.13 in appendix A). Combining these two observations with the definition of implies that
The next result identifies a condition under which a variance inequality holds.
Theorem 3.3.
Let be a separable and complete geodesic space such that . Let and be a barycenter of . Fix and suppose that the following properties hold.

For almost all , there exists a geodesic connecting to that can be extended to a function that remains a shortest path between its endpoints.

The point remains a barycenter of the measure where is defined by .
Then, for all ,
and thus 3 holds with and .
Examples of geodesic spaces of nonnegative curvature include (convex subsets of) Hilbert spaces or the case where is a simply connected Riemannian manifold with nonnegative sectional curvature. Next is a simple example where the condition of Theorem 3.3, i.e. the ability to extend geodesics, takes a simple form.
Example 3.4 (Unit sphere).
Let be the unit Euclidean sphere in equipped with the angle metric. Let be such that it has a unique barycenter . In , a shortest path between two points is a part of a great circle and a part of a great circle is a shortest path between its endpoints if, and only if, it has length less than . Therefore, if a neighborhood of , the cut locus of , satisfies , then condition of Theorem 3.3 is satisfied for some . Note however that condition is not enough to give a variance inequality in general. Indeed, consider the uniform measure on the equator with the north and south poles for barycenters. Then, for almost all , the geodesic connecting the south pole to can be extended by a factor in the sense of in the above theorem. However, since there is no unique barycenter, no variance inequality can hold in this case. Therefore, requirement cannot be dropped in Theorem 3.3.
In the rest of this paragraph, we provide a sufficient condition for the extendable geodesics condition of Theorem 3.3 to hold in the context where is the Wasserstein space over a Hilbert space . In this case, is known to have nonnegative curvature (see Section 7.3 in [ambrosio2008gradient]). We recall the following definition. For a convex function , its subdifferential is defined by
Then we can prove the following result.
Theorem 3.5.
Let be the Wasserstein space over a Hilbert space . Let and be two elements of and let be a geodesic connecting to in . Then, can be extended by a factor (in the sense of in Theorem 3.3) if, and only if, the support of the optimal transport plan of lies in the subdifferential of a strongly convex map .
3.3 Convexity
Here, we connect 3 to convexity properties of functional along paths in .
Definition 3.6 (convexity).
Given , and a path , a function is called convex along if the function
is convex. If is geodesic, a function is called geodesically convex if, for all , is convex along at least one geodesic connecting to .
In the sequel, we abbreviate convexity by convexity. When is geodesic, a convex function refers to a geodesically convex function unless stated otherwise. Note that convexity is a special case of uniform convexity (see Definition 1.6. in [sturm2003]). We start by a general result.
Theorem 3.7.
Let and . Suppose that, for all , there exists a path connecting to along which the function
is convex. Then, for all ,
(3.5) 
and hence 3 holds for . In particular, the assumption of the theorem holds whenever, for all and (almost) all , there exists a path connecting to along which the function is itself convex. A special case in when is geodesic and, for (almost) all , the functional is itself geodesically convex.
The previous result is deliberately stated in a general form. This allows to investigate several notions of convexity that may coexists as in the space of probability measures.
Remark 3.8.
For some spaces, such as the Wasserstein space over , there exist two canonical paths between two points and . One is the geodesic defined by the fact that for
(which needs not be unique). A second one is the linear interpolation between the two measures
. It may be that is strongly convex along only one of these two paths. This case is further discussed in section 4.We end by discussing the notion of a convex metric space. In the remainder of the paragraph, we consider .
Definition 3.9.
A geodesic space is said to be convex if, for all , the function is geodesically convex.
Using Theorem 3.7, it follows that if is convex, then 3 holds with and . When , note that a geodesic space is convex if and only if (see Proposition A.6). When , the connection between curvature bounds and convexity of a metric space is not straightforward. In particular, convex metric spaces include many interesting spaces, for which condition does not hold. We give two examples.
Example 3.10 (Proposition 3.1. in [ohta2007]).
Let be a geodesic space with and . Then is convex for
Implication of this result can be compared to Theorem 3.3. In the context of , considered in example 3.4, the above result states that is convex provided it is included in the interior of an th of a sphere. In comparison, Theorem 3.3 expresses that a variance inequality 3 may hold for a measure supported on the whole sphere minus a neighbourhood of the cut locus of the barycenter.
Example 3.11 (Theorem 1 in [Bcl94]).
If is a measured space and , then is convex.
4 Further examples
We describe additional examples, different from the usual barycenter problem, where our results apply. In these examples, we focus mainly on functionals over the Wasserstein space. Paragraphs 4.1 addresses the case of divergences. Subsection 4.2 discusses interaction energy. In 4.3 we consider several examples related to the approximation of barycenters.
4.1 divergences
We call divergence a functional of the form
(4.1) 
for some convex function . Such functionals are also known as internal energy, relative functionals or Csiszár divergences. Specific choices for function give rise to well known examples. For instance,
gives rise to the KullbackLeibler divergence (or relative entropy). Minimisers of the average KullbackLeibler divergence, or its symmetrised version, have been considered for instance in
[veldhuis2002centroid] for speech synthesis. Other functions like or lead respectively to the chisquared divergence or the total variation. The next results present sufficient conditions under which 1, 2 and 3 hold in this case. First, suppose where is a convex set. Note that 1 holds if there exists and a reference measure such that all have a density with respect to this measure with values in on their support. Next, we show that 2 holds under related conditions.Theorem 4.1.
Suppose that all measures have a density with respect to some reference measure. Suppose there exist such that all take values in . Suppose in addition that there exists such that all are Lipschitz on . Assume finally that is differentiable and that is Lipschitz on . Then, for all ,
so that 2 holds for and .
The next result is devoted to condition 3 and is a slight modification of Theorem 9.4.12 in [ambrosio2008gradient]. Below we say that is logconcave if for some convex .
Theorem 4.2.
Consider . Suppose that is convex and that there exists such that, for all and all , . Suppose finally that is a geodesically convex subset of . Then the following holds.

For any logconcave , the functional is geodesically convex.

Let and suppose is supported on logconcave measures in . Then for any minimizer
and any ,
and thus 3 holds with and .
Note that the requirements made on in Theorem 4.2 are compatible with for . However, these requirements exclude examples such as
Comments
There are no comments yet.