On the choice of weight functions for linear representations of persistence diagrams

07/10/2018 ∙ by Divol Vincent, et al. ∙ University of California-Davis Inria 0

Persistence diagrams are efficient descriptors of the topology of a point cloud. As they do not naturally belong to a Hilbert space, standard statistical methods cannot be directly applied to them. Instead, feature maps (or representations) are commonly used for the analysis. A large class of feature maps, which we call linear, depends on some weight functions, the choice of which is a critical issue. An important criterion to choose a weight function is to ensure stability of the feature maps with respect to Wasserstein distances on diagrams. We improve known results on the stability of such maps, and extend it to general weight functions. We also address the choice of the weight function by considering an asymptotic setting; assume that X_n is an i.i.d. sample from a density on [0,1]^d. For the Čech and Rips filtrations, we characterize the weight functions for which the corresponding feature maps converge as n approaches infinity, and by doing so, we prove laws of large numbers for the total persistences of such diagrams. Both approaches lead to the same simple heuristic for tuning weight functions: if the data lies near a d-dimensional manifold, then a sensible choice of weight function is the persistence to the power α with α≥ d.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Topological data analysis, or TDA (see [chazal2017introduction]

for a survey) is a recent field at the intersection of computational geometry, statistics and probability theory that has been successfully applied to various scientific areas, including biology

[yao2009topological], chemistry [nakamura2015persistent], material science [lee2017quantifying] or the study of time series [seversky2016time]. It consists of an array of techniques aimed at understanding the topology of a -dimensional manifold based on an approximating point cloud

. For instance, clustering can be seen as the estimation of the connected components of a given manifold. Persistence diagrams are one of the tools used most often in TDA. They are efficient descriptors of the topology of a point cloud, consisting in a multiset

of points in (see Section 2 for a more precise definition). The space of persistence diagrams is not naturally endowed with a Hilbert or Banach space structure, making statistical inference rather awkward. A common scheme to overcome this issue is to use a representation or feature map , where

is some Banach space: classical machine learning techniques are then applied to

instead of , where it is assumed that an entire set (or sample) of persistence diagrams is observed. A natural way to create such feature maps is to consider a function and to define

(1.1)

A multiset can equivalently be seen as a measure. Therefore we let also denote the measure with denoting Dirac measure in . With this notation, is equal to , the integration of against the measure . Representations as in (1.1) are called linear as they define linear maps from the space of finite signed measures to the Banach space . In the following, a representation will always be considered linear. Many linear representations exist in the literature, including persistence surfaces and its variants [chen2015statistical, reininghaus2015stable, kusano2018kernel, adams2017persistence], persistence silhouettes [chazal2014stochastic] or accumulated persistence function [biscio2016accumulated]. Notable non-linear representations inlude persistence landscapes [bubenik2015statistical], and sliced Wasserstein kernels [carriere2017sliced].

In machine learning, a possible way to circumvent the so-called ”curse of dimensionality” is to assume that the data lies near some low-dimensional manifold

. Under this assumption, the persistence diagram of the data set (built with the Čech filtration, for instance) is made of two different types of points: points far away from the diagonal, which estimate the diagram of the manifold , and points close to the diagonal, which are generally considered to be ”topological noise” (see Figure 1). This interpretation is a consequence of the stability theorem for persistence diagrams; see [cohen2007stability]. If the relevant information lies in the structure of the manifold, then the topological noise indeed represents true noise, and representations of the form are bound to fail if is dominating . A way to avoid such behaviour is to weigh the points in diagrams by means of a weighting function . If is chosen properly, i.e. small enough when close to the diagonal, then one can hope that can be separated from . The weight functions are typically chosen as functions of the persistence , a choice which will be made here also. Of course, it is not clear what ”small enough” really means, and there are several ways to address the issue.

A first natural answer is to look at the problem from a stability point of view. Indeed, as data are intrinsically noisy, a statistical method has to be stable with respect to some metric in order to be meaningful. Standard metrics on the space of diagrams are Wasserstein distances , which under mild assumptions (see [cohen2010lipschitz]) are known to be stable with respect to the data on which diagrams are built. The task therefore becomes to find representations that are continuous with respect to some Wasserstein distance. Recent work in [kusano2018kernel] shows that when sampling from a -dimensional manifold, a weight function of the form with ensures that a certain class of representations are Lipschitz. Our first contribution is to show that, for a general class of weight functions, a choice of is enough to make all linear representations continuous (even Hölderian of exponent ).

Figure 1. The persistence diagram for homology of degree 1 of the Rips filtration of i.i.d. points uniformly sampled on a torus. There are two distinct points in the diagram, corresponding to the two equivalence classes of one-dimensional holes of the torus.

Our second (and main) contribution is to evaluate closeness to the diagonal from an asymptotic point of view. Assume that a diagram is built on a data set of size . For which weight functions is none-divergent? Of course, for this question to make sense, a model for the data set has to be specified. A simple model is given by a Poisson (or binomial) process of intensity in a cube of dimension . We denote the corresponding diagrams built on a filtration with respect to -dimensional homology by , with either the Rips or Čech filtration. A precise definition is given below in Section 2. In this setting, there are no ”true” topological features (other than the trivial topological feature of being connected), and thus the diagram based on the sampled data is uniquely made of topological noise. A first promising result is the vague convergence of the measure , which was recently proven in [hiraoka2018limit] for homogeneous Poisson processes in the cube and in [goel2018strong] for binomial processes on manifolds. However, vague convergence is not enough for our purpose, as neither nor have good reasons to have compact support. Our main result, Theorem 4.4 extends result of [goel2018strong], for processes on the cube, to a stronger convergence, allowing test functions to have both non-compact support (but to converge to near the diagonal) and to have polynomial growth. As a corollary of this general result, the convergence of the -th total persistence, which plays an important role in TDA, is shown. The -th total persistence is defined as .

Theorem 1.1.

Let and let be a density on such that . Let be either a binomial process with parameters and or a Poisson process of intensity in the cube . Define to be the persistence diagram of for -dimensional homology, built with either the Rips or the Čech filtration. Then, with probability one, as

(1.2)

for some non-degenerate Radon measure on .

If is built on a point cloud of size on a -dimensional manifold, one can expect to behave in a similar fashion to that of for a -sample on a -dimensional cube (a manifold looking locally like a cube). Therefore, for , the quantity should be close to and it can be expected to converge to if and only if the weight function is such that . The same heuristic is found through both the approaches (stability and convergence): a weight function of the form with is sensible if the data lies near a -dimensional object.

Further properties of the process are also shown, namely non-asymptotic rates of decays for the number of points in said diagrams, and the absolute continuity of the marginals of with respect to the Lebesgue measure on .

1.1. Related work

Techniques used to derive the large sample results indicated above are closely related to the field of geometric probability, which is the study of geometric quantities arising naturally from point processes in . A classical result in this field, see [steele1988growth], proves the convergence of the total length of the minimum spanning tree built on i.i.d. points in the cube. This pioneering work can be seen as a -dimensional special case of our general results about persistence diagrams built for homology of dimension . This type of result has been extended to a large class of functionals in the works of J. E. Yukich and M. Penrose (see for instance [mcgivney1999asymptotics, yukich2000asymptotics, penroseLLN] and [penrose2003random] or [yukich2006probability] for monographs on the subject).

The study of higher dimensional properties of such processes is much more recent. Known results include convergence of Betti numbers for various models and under various asymptotics (see [Kahle2011, kahle2013limit, Yogeshwaran2017, bobrowski2017random]). The paper [bobrowski2015maximally] finds bounds on the persistence of cycles in random complexes, and [hiraoka2018limit] proves limit theorems for persistence diagrams built on homogeneous point processes. The latter is extended to non-homogeneous processes in [trinh2017remark], and to processes on manifolds in [goel2018strong]. Note that our results constitute a natural extension of [trinh2017remark]. In [skraba2017randomly], higher dimensional analogs of minimum spanning trees, called minimal spanning acycles, were introduced. Minimal spanning acycles exhibits strong links with persistence diagrams and our main theorem can be seen as a convergence result for weighted minimal spanning acycle on geometric random complexes. [skraba2017randomly] also proves the convergence of the total -persistence for Linial-Meshulam random complexes, which are models of random simplicial complexes of a combinatorial nature rather than a geometric nature.

1.2. Notations

  •          Euclidean distance on .

  •        supremum-norm of a function.

  •       open ball of radius centered at .

  • diam    diameter of a set , defined as .

  •           total variation of a measure.

  •              cardinality of a set.

  •        Lipschitz constant of a Lipschitz function .

The rest of the paper is organized as follows. In Section 2, some background on persistent homology is briefly described. The stability results are then discussed in Section 3 whereas the convergence results related to the asymptotic behavior of the sample-based linear representations are stated in Section 4. Section 5 presents some discussion. Proofs can be found in Section 6.

2. Background on persistence diagrams

Persistent homology deals with the evolution of homology through a sequence of topological spaces. We use the field of two elements to build the homology groups. A filtration is an increasing right-continuous sequence of topological spaces : iff and . For any , the inclusion of spaces give rise to linear maps between corresponding homology groups . The persistence diagram of the filtration is a succinct way to summarize the evolution of the homology groups. It is a multiset of points in 111Persistence diagrams are in all generality multiset of points in . We only consider diagrams which do not contain points ”at infinity” throughout the paper., so that each point corresponds informally to a -dimensional ”hole” in the filtration that appears (or is born) at and disappears (or dies) at . The persistence of is defined as and is understood as the lifetime of the corresponding hole. Persistence diagrams are known to exist given mild assumptions on the filtration (see [chazal2016structure, Section 3.8]). Some basic descriptors of persistence diagrams include the -th total persistence of a diagram, defined as

(2.1)

and the persistent Betti numbers, defined as

(2.2)

Also, for , define

(2.3)

Given a subset of a metric space , standard constructions of filtrations are the Čech filtration and the Rips filtration :

(2.4)
(2.5)

where the abstract simplicial complexes on the right are identified with their geometric realizations. The dimension of a simplex is equal to . If is a simplicial complex, the set of its simplexes of dimension is denoted by .

The space of persistence diagrams is the set of all finite multisets in . Wasserstein distances are standard distances on . For , they are defined as:

(2.6)

where is the diagonal of and is a bijection. The definition is extended to by

(2.7)

which is called the bottleneck distance.

The use of Wasserstein distances is motivated by crucial stability properties they satisfy. Let be two continuous functions on a triangulable space . Assuming that the persistence diagrams and of the filtrations defined by the sublevel sets of and exist and are finite (a condition called tameness222Tameness holds under simple conditions, see [chazal2016structure, Section 3.9], which we will always assume to hold in the following), the stability property of [cohen2007stability, Main Theorem] asserts that , i.e. the diagrams are stable with respect to the functions they are built with. The functions and have to be thought of as representing the data: for instance, if the Čech filtration is built on a data set , then where is the distance function to , i.e. . When , similar stability results have been proved under more restrictive conditions on the ambient space , which we now detail.

Definition 2.1.

A metric space is said to have bounded -th total persistence if there exists a constant such that for all tame 1-Lipschitz functions and, for all , .

This assumption holds, for instance, for a -dimensional manifold when with

(2.8)

being a constant depending only on (see [cohen2010lipschitz]). The stability theorem for the -th Wasserstein distances claims:

Theorem 2.2 (Section 3 of [cohen2010lipschitz]).

Let be a compact triangulable metric space with bounded -th total persistence for some . Let be two tame Lipschitz functions. Then, for ,

(2.9)

for , where .

3. Stability results for linear representations

In [kusano2018kernel, Corollary 12], representations of diagrams are shown to be Lipschitz with respect to the Wasserstein distance for weight functions of the form with , provided the diagrams are built with the sublevels of functions defined on a space having bounded -th total persistence. The stability result is proved for a particular function defined by , with a bounded Lipschitz kernel and the associated RKHS (short for Reproducing Kernel Hilbert Space, see [aronszajn1950theory] for a monograph on the subject). We present a generalization of the stability result to (i) general weight functions , (ii) any bounded Lipschitz function and (iii) we only require .

Consider weight functions of the form for a differentiable function satisfying , and, for some , ,

(3.1)

Examples of such functions include for and . We denote the class of such weight functions by . In contrast to [kusano2018kernel], the function does not necessarily take its values in a RKHS, but simply in a Banach space (so that its Bochner integral –see for instance [diestel1984sequences, Chapter 4]– is well defined).

Theorem 3.1.

Let be a Banach space, and let be a Lipschitz continuous function. Furthermore, for with let and for two persistence diagrams and let . Then, for and (and using the conventions and ), we have

(3.2)

The quantity can often be controlled. For instance, if the diagrams are built with Lipschitz continuous functions and is a space having bounded -th total persistence.

Corollary 3.2.

Let and consider a compact triangulable metric space having bounded -th total persistence for some . Suppose that are two tame Lipschitz continuous functions, , and . Then, for such that , if and is the maximum persistence in the two diagrams :

(3.3)

where and .

If and , then the result is similar to Theorem 3.3 in [kusano2018kernel]. However, Corollary 3.2 implies that the representations are still continuous (actually Hölder continuous) when , and this is the novelty of the result. Indeed, for such an , one can always chose small enough so that the stability result (3.3) holds. The proofs of Theorem 3.1 and Corollary 3.2 consist of adaptations of similar proofs in [kusano2018kernel]. They can be found in Section 6.

Remark 3.3.

(a) One cannot expect to obtain an inequality of the form (3.2) without quantities (or other quantities depending on the diagrams) appearing on the right-hand side. For instance, in the case , it is clear that adding an arbitrary number of points near the diagonal will not change the bottleneck distance between the diagram, whereas the distance between representations can become arbitrarily large.
(b) Laws of large numbers stated in the next section (see also Theorem 1.1 already stated in the introduction), show that Theorem 3.1 is optimal: take and . If is a sample on the -dimensional cube (which has bounded -th total persistence for ), then . The quantity does not converge to for (it even diverges if ), whereas the bottleneck distance between and the empty diagram does converge to .

The following corollary to the stability result also is a contribution to the asymptotic study of next section. It presents rates of convergence of representations in a random setting. Let be a -sample of i.i.d. points from a distribution on some manifold . We are interested in the convergence of representations to the representations . The nerve theorem asserts that for any subspace , where is the distance from to . We obtain the following corollary, whose proof is found in Section 6:

Corollary 3.4.

Consider a -dimensional compact Riemannian manifold , and let be a -sample of i.i.d. points from a distribution having a density with respect to the -dimensional Hausdorff measure on . Assume that . Let for some and let be a Lipschitz function. Then, for , and for large enough,

(3.4)

where is a constant depending on and the density .

4. Convergence of total persistence

Consider again the i.i.d. model: let be i.i.d. observations of density with respect to the -dimensional Hausdorff measure on some -dimensional manifold . The general question we are addressing in this section is the convergence of the observed diagrams to , with either the Rips or the Čech filtration. Of course, the question has already been answered in some sense. For instance, Theorem 2.2 affirms that the sequence of observed diagrams will always converge to for the bottleneck distance, if is the Čech filtration333A similar result states that the bottleneck distance between two diagrams, each built with the Rips filtration on some space, is controlled by the Hausdorff distance between the two spaces (see Theorem 3.1 [chazal2009gromov]). As the Rips filtration, contrary to the Čech filtration, cannot be seen as the filtration of the sublevel sets of some function, this stability is not a consequence of Theorem 2.2.. However, this is not informative with respect to the convergence of the representations introduced in the previous section, which is related to a weak convergence of measure: For which functions does converge to ?

The stability theorem for the bottleneck distance asserts that, for small and for large enough, can be decomposed into two separate sets of points: a set of fixed size that is -close to points in and the remaining part of the diagram, , usually consisting of a large number of points, which have persistence smaller than , i.e. these are the points that lie close to the diagonal. A Taylor expansion of shows that the difference between and is of the order of for some . The latter quantities are therefore of utmost interest to achieve our goal. Instead of directly studying for on a -dimensional manifold, we focus on the study of the quantity for in a cube .

Contributions to the study of quantities of the form have been made in [hiraoka2018limit], where is considered to be the restriction of a stationary process to a box of volume in . Specifically, [hiraoka2018limit] shows the vague convergence of the rescaled diagram to some Radon measure . The two recent papers [trinh2017remark, goel2018strong] prove that a similar convergence actually holds for a binomial sample on a manifold. However, vague convergence deals with continuous functions with compact support, whereas we are interested in functions of the type , which are not even bounded. Our contributions to the matter consists in proving, for samples on the cube , a stronger convergence, allowing test functions to have non-compact support and polynomial growth. As a gentle introduction to the formalism used later, we first recall some known results from geometric probability on the study of Betti numbers, and we also detail relevant results of [hiraoka2018limit, trinh2017remark, goel2018strong].

4.1. Prior work

In the following, refers to either the Čech or the Rips filtration. Let be a density on such that:

(4.1)

Note that the cube could be replaced by any compact convex body (i.e. the boundary of an open bounded convex set). However, the proofs (especially geometric arguments of Section 6.4) become much more involved in this greater generality. To keep the main ideas clear, we therefore restrict ourselves to the case of the cube. We indicate, however, when challenges arise in the more general setting.

Let

be a sequence of i.i.d. random variables sampled from density

and let be an independent sequence of Poisson variables with parameter . In the following denotes either , a binomial process of intensity and of size , or , a Poisson process of intensity . The fact that the binomial and Poisson processes are built in this fashion is not important for weak laws of large numbers (only the law of the variables is of interest), but it is crucial for strong laws of large numbers to make sense.

The persistent Betti numbers are denoted more succinctly by . When , we use the notation .

Theorem 4.1 (Theorem 1.4 in [trinh2017remark]).

Let and . Then, with probability one, converges to some constant. The convergence also holds in expectation.

The theorem is originally stated with the Čech filtration but its generalization to the Rips filtration (or even to more general filtrations considered in [hiraoka2018limit]) is straightforward. The proof of this theorem is based on a simple, yet useful geometric lemma, which still holds for the persistent Betti numbers, as proven in [hiraoka2018limit]. Recall that for , denotes the -skeleton of the simplicial complex .

Lemma 4.2 (Lemma 2.11 in [hiraoka2018limit]).

Let be two subsets of . Then

(4.2)

In [hiraoka2018limit], this lemma was used to prove the convergence of expectations of diagrams of stationary point processes. As indicated in [goel2018strong, Remark 2.4], this lemma can also be used to prove the convergence of the expectations of diagrams for non-homogeneous binomial processes on manifold. Let be the set of functions with compact support. We say that a sequence of measures on converges -vaguely to if , . Note that this does not include the function or the function . Vague convergence is denoted by . Set . Remark 2.4 in [goel2018strong] implies the following theorem.

Theorem 4.3 (Remark 2.4 in [goel2018strong] and Theorem 1.5 in [hiraoka2018limit]).

Let

be a probability density function on a

-dimensional compact manifold , with for . Then, for , there exists a unique Radon measure on such that

(4.3)

and

(4.4)

The measure is called the persistence diagram of intensity for the filtration .

4.2. Main results

A function is said to vanish on the diagonal if

(4.5)

Denote by the set of all such functions. The weight functions of Section 3 all lie in . We say that a function has polynomial growth if there exist two constants , such that

(4.6)

The class of functions in with polynomial growth constitutes a reasonable class of functions one may want to build a representation with. Our goal is to extend the convergence of Theorem 4.3 to this larger class of functions. Convergence of measures to with respect to , i.e. , , is denoted by . Note that this class of functions is standard: it is for instance known to characterize -th Wasserstein convergence in optimal transport (see [villani2008optimal, Theorem 6.9]).

Theorem 4.4.

(i) For , there exists a unique Radon measure such that and, with probability one, . The measure is called the -th persistence diagram of intensity for the filtration . It does not depend on whether is a Poisson or a binomial process, and is of positive finite mass.
(ii) The convergence also holds pointwise for the distance: for all , and for all , . In particular, .

Remark 4.5.

(a) Remark 2.4 together with Theorem 1.1 in [goel2018strong] imply that the measure has the following expression:

(4.7)

where is the -th persistence diagram of uniform density on , appearing in Theorem 4.3, and the expectation is taken with respect to a random variable having a density .

(b) Assume and . Then, the persistence diagram is simply the collection of the intervals where is the order statistics of . The measure can be explicitly computed: it converges to a measure having density with respect to the Lebesgue measure on , where has density . Take the uniform density on : one sees that this is coherent with the basic fact that the spacings of a homogeneous Poisson process on

are distributed according to an exponential distribution. Moreover, the expression (

4.7) is found again in this special case.
(c) Theorem 1.9 in [hiraoka2018limit] states that the support of is . Using equation (4.7), the same holds for .
(d) Theorem 1.1 is a direct corollary of Theorem 4.4. Indeed, we have

a quantity which converges to . The relevance of Theorem 1.1 is illustrated in Figure 2, where Čech complexes are computed on random samples on the torus.

The core of the proof of Theorem 4.4 consists in a control of the number of points appearing in diagrams. This bound is obtained thanks to geometric properties satisfied by the Čech and Rips filtrations. Finding good requirements to impose on a filtration for this control to hold is an interesting question. The following states some non-asymptotic controls of the number of points in diagrams which are interesting by themselves.

Weight function
Figure 2. For or points uniformly sampled on the torus, persistence images [adams2017persistence] for different weight functions are displayed. For , the mass of the topological noise is far larger than the mass of the true signal, the latter being comprised by the two points with high-persistence. For , the two points with high-persistence are clearly distinguishable. For , the noise has also disappeared, but so has one of the point with high-persistence.
Proposition 4.6.

Let and define . Then, there exists constants (which can be made explicit) depending on and such that, for any ,

(4.8)

As an immediate corollary, the moments of the total mass

are uniformly bounded. However, the proof of the almost sure finiteness of is much more intricate. Indeed, we are unable to control directly this quantity, and we prove that a majorant of satisfies concentration inequalities. The majorant arises as the number of simplicial complexes of a simpler process, whose expectation is also controlled.

It is natural to wonder whether has some density with respect to the Lebesgue measure on : it is the case for the for , and it is shown in [chazal_et_al:LIPIcs:2018:8739] that also has a density. Even if those elements are promising, it is not clear whether the limit has a density in a general setting. However, we are able to prove that the marginals of have densities.

Proposition 4.7.

Let (resp. ) be the projection on the -axis (resp. -axis). Then, for , the pushforwards and have densities with respect to the Lebesgue measure on . For , has a density.

5. Discussion

The tuning of the weight functions in the representations of persistence diagrams is a critical issue in practice. When the statistician has good reasons to believe that the data lies near a -dimensional structure, we give, through two different approaches, an heuristic to tune this weight function: a weight of the form with is sensible. The study carried out in this paper allowed us to show new results on the asymptotic structure of random persistence diagrams. While the existence of a limiting measure in a weak sense was already known, we strengthen the convergence, allowing a much larger class of test functions. Some results about the properties of the limit are also shown, namely that it has a finite mass, finite moments, and that its marginals have densities with respect to the Lebesgue measure. Challenging open questions include:

  • Convergence of the rescaled diagrams with respect to some transport metric: The main issue consists in showing that one can extend, in a meaningful way, the distance to general Radon measures. This is the topic of a recent work (see Section 5.1 in [divol2019understanding]).

  • Existence of a density for the limiting measure: An approach for obtaining such results would be to control the numbers of points of a diagram in some square .

  • Convergence of the number of points in the diagrams: The number of points in the diagrams is a quantity known to be not stable (motivating the use of bottleneck distances, which is blind to them). However, experiments show that this number, conveniently rescaled, converges in this setting. An analog of Lemma 4.2 for the number of points in the diagrams with small persistence would be crucial to attack this problem.

  • Generalization to manifolds: While the vague convergence of the rescaled diagrams is already proven in [goel2018strong], allowing test functions without compact support seems to be a challenge. Once again, the crucial issue consists in controlling the total number of points in the diagrams.

  • Dimension estimation: We have proved that the total persistence of a diagram built on a given point cloud depends crucially on the intrinsic dimension of such a point cloud. Inferring the dependence of the total persistence with respect to the size of the point cloud (through subsampling) leads to estimators of this intrinsic dimension. Studying the properties of such estimators is the topic of an on-going work of Henry Adams and co-authors (personal communication).

6. Proofs

6.1. Proof of Theorem 3.1

We only treat the case , the proof being easily adapted to the case . Introduce for two measures of mass on , the Monge-Kantorovitch distance between and :

(6.1)

Fix two persistence diagrams and . Denote (resp. ) the measure having density with respect to (resp. ). For a matching attaining the -th Wasserstein distance between and , denote . We have

(6.2)

We bound the two terms in the sum separately. Let us first bound . The Monge-Kantorovitch distance is also the infimum of the costs of transport plans between and (see [villani2008optimal, Chapter 2] for details), so that

Define such that . As condition (3.1) implies that , the distance is bounded by

(6.3)

We now treat the first part of the sum in (6.2). For , in with , define the path with unit speed by

so that it satisfies . The quantity is bounded by

For and , using the convexity of , it is easy to see that . Define , and . We have,