Large Scale computation of Means and Clusters for Persistence Diagrams using Optimal Transport

05/22/2018 ∙ by Théo Lacombe, et al. ∙ Google Inria 0

Persistence diagrams (PDs) are now routinely used to summarize the underlying topology of sophisticated data encountered in challenging learning problems. Despite several appealing properties, integrating PDs in learning pipelines can be challenging because their natural geometry is not Hilbertian. In particular, algorithms to average a family of PDs have only been considered recently and are known to be computationally prohibitive. We propose in this article a tractable framework to carry out fundamental tasks on PDs, namely evaluating distances, computing barycenters and carrying out clustering. This framework builds upon a formulation of PD metrics as optimal transport (OT) problems, for which recent computational advances, in particular entropic regularization and its convolutional formulation on regular grids, can all be leveraged to provide efficient and (GPU) scalable computations. We demonstrate the efficiency of our approach by carrying out clustering on PDs at scales never seen before in the literature.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Topological data analysis (TDA) has been used successfully in a wide array of applications, for instance in medical (Nicolau et al., 2011) or material (Hiraoka et al., 2016)

sciences, computer vision 

(Li et al., 2014)

or to classify NBA players 

(Lum et al., 2013). The goal of TDA is to exploit and account for the complex topology (connectivity, loops, holes, etc.) seen in modern data. The tools developed in TDA are built upon persistent homology theory (Edelsbrunner et al., 2000; Zomorodian & Carlsson, 2005; Edelsbrunner & Harer, 2010) whose main output is a descriptor called a persistence diagram (PD) which encodes in a compact form—roughly speaking, a point cloud in the upper triangle of the square —the topology of a given space or object at all scales.

Statistics on PDs. Persistence diagrams have appealing properties: in particular they have been shown to be stable with respect to perturbations of the input data (Cohen-Steiner et al., 2007; Chazal et al., 2009, 2014). This stability is measured either in the so called bottleneck metric or in the

-th diagram distance, which are both distances that compute optimal partial matchings. While theoretically motivated and intuitive, these metrics are by definition very costly to compute. Furthermore, these metrics are not Hilbertian, preventing a faithful application of a large class of standard machine learning tools (

-means, PCA, SVM) on PDs.

Related work.

To circumvent the non-Hilbertian nature of the space of PDs, one can of course map diagrams onto simple feature vectors. Such features can be either finite dimensional 

(Carrière et al., 2015; Adams et al., 2017), or infinite through kernel functions (Reininghaus et al., 2015; Bubenik, 2015; Carrière et al., 2017). A known drawback of kernel approaches on a rich geometric space such as that formed by PDs is that once PDs are mapped as feature vectors, any ensuing analysis remains in the space of such features (the “inverse image” problem inherent to kernelization). They are therefore not helpful to carry out simple tasks in the space of PDs, such as that of averaging PDs, namely computing the Fréchet mean of a family of PDs. Such problems call for algorithms that are able to optimize directly in the space of PDs, and were first addressed by Mileyko et al. (2011) and Turner (2013). Turner et al. (2014) provides an algorithm that converges to a local minimum of the Fréchet function by successive iterations of the Hungarian algorithm. However, the Hungarian algorithm does not scale well with the size of diagrams, and non-convexity yields potentially convergence to bad local minima.

Contributions. We reformulate the computation of diagram metrics as an optimal transport (OT) problem, opening several perspectives, among them the ability to benefit from entropic regularization (Cuturi, 2013). We provide a new numerical scheme to bound OT metrics, and therefore diagram metrics, with additive guarantees. Unlike previous approximations of diagram metrics, ours can be parallelized and implemented efficiently on GPUs. These approximations are also differentiable, leading to a scalable method to compute barycenters of persistence diagrams. In exchange for a discretized approximation of PDs, we recover a convex problem, unlike previous formulations of the barycenter problem for PDs. We demonstrate the scalability of these two advances (accurate approximation of the diagram metric at scale and barycenter computation) by providing the first tractable implementation of the -means algorithm in the space of PDs.

Figure 1: Illustration of differences between Fréchet means with Wasserstein and Euclidean geometry. The top row represents input data, namely persistence diagrams (left), discretization of PDs as histograms (middle), and vectorization of PDs as persistence images in (right) (Adams et al., 2017). The bottom row represents the estimated barycenters (orange scale) with input data (shaded), using the approach of Turner et al. (2014) (left), our optimal tranport based approach (middle) and the arithmetic mean of persistence images (right).

Notations for matrix and vector manipulations. When applied to matrices or vectors, operators , division, are always meant element-wise. denotes element-wise multiplication (Hadamard product) while denotes the matrix-vector product of and .

2 Background on OT and TDA

OT.

Optimal transport is now widely seen as a central tool to compare probability measures 

(Villani, 2003, 2008; Santambrogio, 2015). Given a space endowed with a cost function , we consider two discrete measures and on , namely measures that can be written as positive combinations of diracs, with weight vectors satisfying and all in . The cost matrix and the transportation polytope define an optimal transport problem whose optimum

can be computed using either of two linear programs, dual to each other,

(1)

where is the Frobenius dot product and is the set of pairs of vectors in

such that their tensor sum

is smaller than , namely . Note that when and all weights and are uniform and equal, the problem above reduces to the computation of an optimal matching, that is a permutation (with a resulting optimal plan taking the form ). That problem has clear connections with diagram distances, as shown in §3.

Entropic Regularization. Solving the optimal transport problem is intractable for large data. Cuturi proposes to consider a regularized formulation of that problem using entropy, namely:

(2)
(3)

where and . Because the negentropy is 1-strongly convex, that problem has a unique solution which takes the form, using first order conditions,

(4)

where (term-wise exponentiation), and is a fixed point of the Sinkhorn map (term-wise divisions):

(5)

Note that this fixed point is the limit of any sequence , yielding a straightforward algorithm to estimate . Cuturi considers the transport cost of the optimal regularized plan, , to define a Sinkhorn divergence between (here is the term-wise multiplication). One has that as , and more precisely converges to the optimal transport plan solution of (1) with maximal entropy. That approximation can be readily applied to any problem that involves terms in , notably barycenters (Cuturi & Doucet, 2014; Solomon et al., 2015; Benamou et al., 2015).

Eulerian setting. When the set is finite with cardinality , and are entirely characterized by their probability weights and are often called histograms in a Eulerian setting. When is not discrete, as when considering the plane , we therefore have a choice of representing measures as sums of diracs, encoding their information through locations, or as discretized histograms on a planar grid of arbitrary granularity. Because the latter setting is more effective for entropic regularization (Solomon et al., 2015), this is the approach we will favor in our computations.

Figure 2: Sketch of persistent homology. and so that sublevel sets of are unions of balls centered at the points of . First (resp second) coordinate of points in the persistence diagram encodes appearance scale (resp disappearance) of cavities in the sublevel sets of . The isolated red point accounts for the presence of a persistent hole in the sublevel sets, inferring the underlying spherical geometry of the input point cloud.

Persistent homology and Persistence Diagrams. Given a topological space and a real-valued function , persistent homology provides—under mild assumptions on , taken for granted in the remaining of this article—a topological signature of built on its sublevel sets , and called a persistence diagram (PD), denoted as . In practice, it is of the form , namely a point measure with finite support included in . Each point in can be understood as a topological feature (connected component, loop, hole…) which appears at scale and disappears at scale in the sublevel sets of . Comparing the persistence diagrams of two functions measures their difference from a topological perspective: presence of some topological features, difference in appearance scales, etc. The space of PDs is naturally endowed with a partial matching metric defined as ():

(6)

where is the set of all partial matchings between points in and points in and denotes the orthogonal projection of an (unmatched) point to the diagonal . The mathematics of OT and diagram distances share a key idea, that of matching, but differ on an important aspect: diagram metrics can cope, using the diagonal as a sink, with measures that have a varying total number of points. We solve this gap by leveraging an unbalanced formulation for OT.

3 Fast estimation of diagram metrics using Optimal Transport

In the following, we start by explicitly formulating (6) as an optimal transport problem. Entropic smoothing provides us a way to approximate (6) with controllable error. In order to benefit mostly from that regularization (matrix parallel execution, convolution, GPU—as showcased in (Solomon et al., 2015)), implementation requires specific attention, as described in propositions 2, 3, 4.

PD metrics as Optimal Transport. The main differences between (6) and (1) are that PDs do not generally have the same mass, i.e. number of points (counted with multiplicity), and that the diagonal plays a special role by allowing to match any point in a given diagram with its orthogonal projection onto the diagonal. Guittet’s formulation for partial transport (2002) can be used to account for this by creating a “sink” bin corresponding to that diagonal and allowing for different total masses. The idea of representing the diagonal as a single point already appears in the bipartite graph problem of Edelsbrunner & Harer (2010) (Ch.VIII). The important aspect of the following proposition is the clarification of the partial matching problem (6) as a standard OT problem (1).

Let be extended with a unique virtual point encoding the diagonal. We introduce the linear operator which, to a finite non-negative measure supported on , associates a dirac on with mass equal to the total mass of , namely .

Proposition 1.

Let and be two persistence diagrams with respectively points and points . Let . Then:

(7)

where is the cost matrix with block structure

(8)

where , for .

The proof seamlessly relies on the fact that, when transporting point measures with the same mass (number of points counted with multiplicity), the optimal transport problem is equivalent to an optimal matching problem (see §2). Details are left to the supplementary material.

Entropic approximation of diagram distances. Following the correspondance established in Proposition 1, entropic regularization can be used to approximate the diagram distance . Given two persistence diagrams with respective masses and , let , , , and where is the output after iterations of the Sinkhorn map (5). Adapting the bounds provided by Altschuler et al. (2017), we can bound the error of approximating by :

(9)

where (that is, error on marginals).

Dvurechensky et al. (2018) prove that iterating the Sinkhorn map (5) gives a plan satisfying in iterations. Given (9), a natural choice is thus to take for a desired precision , which lead to a total of iterations in the Sinkhorn loop. These results can be used to pre-tune parameters and to control the approximation error due to smoothing. However, these are worst-case bounds, controlled by max-norms, and are often too pessimistic in practice. To overcome this phenomenon, we propose on-the-fly error control, using approximate solutions to the smoothed primal (2) and dual (3) optimal transport problems, which provide upper and lower bounds on the optimal transport cost.

Upper and Lower Bounds. The Sinkhorn algorithm, after at least one iteration (), produces feasible dual variables (see below (1) for a definition). Their objective value, as measured by , performs poorly as a lower bound of the true optimal transport cost (see Fig. 3 and §5 below) in most of our experiments. To improve on this, we compute the so called -transform of (Santambrogio, 2015, §1.6), defined as:

Applying a -transform on , we recover two vectors . One can show that for any feasible , we have that (Peyré & Cuturi, 2018, Prop 3.1)

When ’s top-left block is the squared Euclidean metric, this problem can be cast as that of computing the Moreau envelope of . In a Eulerian setting and when is a finite regular grid which we will consider, we can use either the linear-time Legendre transform or the Parabolic Envelope algorithm (Lucet, 2010, §2.2.1,§2.2.2) to compute the -transform in linear time with respect to the grid resolution .

Unlike dual iterations, the primal iterate does not belong to the transport polytope after a finite number of iterations. We use the rounding_to_feasible algorithm provided by Altschuler et al. (2017) to compute efficiently a feasible approximation of that does belong to . Putting these two elements together, we obtain

(10)

Therefore, after iterating the Sinkhorn map (5) times, we have that if is below a certain criterion , then we can guarantee that is a fortiori an -approximation of . Observe that one can also have a relative error control: if one has , then . Note that might be negative but can always be replaced by since we know has non-negative entries (and therefore ), while is always non-negative.

Figure 3: (Left) (red) and (green) as a function of , the number of iterations of the Sinkhorn map ( ranges from to , with fixed ). (Middle) Final (red) and (green) provided by Alg.1, computed for decreasing s, ranging from to . For each value of , Sinkhorn loop is run until . Note that the -axis is flipped. (Right) Influence of -transform for the Sinkhorn dual cost. (orange) The dual cost , where are Sinkhorn dual variables (before the -transform). (green) Dual cost after -transform, i.e. with . Experiment run with and iterations.

Discretization. For simplicity, we assume in the remaining that our diagrams have their support in . From a numerical perspective, encoding persistence diagrams as histograms on the square offers numerous advantages. Given a uniform grid of size on , we associate to a given diagram a matrix-shaped histogram such that is the number of points in belonging to the cell located at position in the grid (we transition to bold-faced small letters to insist on the fact that these histograms must be stored as square matrices). To account for the total mass, we add an extra dimension encoding mass on . We extend the operator to histograms, associating to a histogram its total mass on the -th coordinate. One can show that the approximation error resulting from that discretization is bounded above by (see the supplementary material).

Convolutions. In the Eulerian setting, where diagrams are matrix-shaped histograms of size , the cost matrix has size . Since we will use large values of to have low discretization error (typically ), instantiating is usually intractable. However, Solomon et al. (2015) showed that for regular grids endowed with a separable cost, each Sinkhorn iteration (as well as other key operations such as evaluating Sinkhorn’s divergence ) can be performed using Gaussian convolutions, which amounts to performing matrix multiplications of size , without having to manipulate matrices. Our framework is slightly different due to the extra dimension , but we show that equivalent computational properties hold. This observation is crucial from a numerical perspective. Our ultimate goal being to efficiently evaluate (11), (12) and (14), we provide implementation details.

Let be a pair where is a matrix-shaped histogram and is a real number accounting for the mass located on the virtual point . We denote by the column vector obtained when reshaping . The cost matrix and corresponding kernel are given by

where , . and as defined above will never be instantiated, because we can rely instead on defined as and .

Proposition 2 (Iteration of Sinkhorn map).

The application of to can be performed as:

(11)

where denotes the Froebenius dot-product in .

We now introduce and ( denotes term-wise multiplication).

Proposition 3 (Computation of ).

Let . The transport cost of can be computed as:

(12)

where the first term can be computed as:

(13)

Finally, consider two histograms , let be the rounded matrix of (see the supplementary material or (Altschuler et al., 2017)). Let denote the first and second marginal of respectively. We introduce (using term-wise min and divisions):

along with and the marginal errors:

Proposition 4 (Computation of upper bound ).

The transport cost induced by can be computed as:

(14)

Note that the first term can be computed using (12)

Parallelization and GPU. Using a Eulerian representation is particularly beneficial when applying Sinkhorn’s algorithm, as shown by Cuturi (2013). Indeed, the Sinkhorn map (5) only involves matrix-vector operations. When dealing with a large number of histograms, concatenating these histograms and running Sinkhorn’s iterations in parallel as matrix-matrix product results in significant speedup that can exploit GPGPU to compare a large number of pairs simultaneously. This makes our approach especially well-suited for large sets of persistence diagrams.

  Input: Pairs of PDs , smoothing parameter , grid step , stopping criterion, initial .
  Output: Approximation of all , upper and lower bounds if wanted.
  init Cast as histograms , on a grid
  while stopping criterion not reached do
     Iterate in parallel (5) using (11)
  end while
  Compute all using (12)
  if Want a upper bound then
     Compute in parallel using (14)
  end if
  if Want a lower bound then
     Compute using (Lucet, 2010)
  end if
Algorithm 1 Sinkhorn divergence for persistence diagrams

We can now estimate distances between persistence diagrams with Alg. 1 in parallel by performing only -sized matrix multiplications, leading to a computational scaling in where is the grid resolution parameter. Note that a standard stopping threshold in Sinkhorn iteration process is to check the error to marginals , as motivated by (9).

4 Smoothed barycenters for persistence diagrams

OT formulation for barycenters. We show in this section that the benefits of entropic regularization also apply to the computation of barycenters of PDs. As the space of PD is not Hilbertian but only a metric space, the natural definition of barycenters is to formulate them as Fréchet means for the metric, as first introduced (for PDs) in (Mileyko et al., 2011).

Definition.

Given a set of persistence diagrams , a barycenter of is any solution of the following minimization problem:

(15)

where is defined as in (8) with (but our approach adapts easily to any finite ), and denotes the set of non-negative finite measures supported on . is the energy of .

Let denotes the restriction of to the space of persistence diagrams (finite point measures). Turner et al. (2014) proved the existence of minimizers of and proposed an algorithm that converges to a local minimum of the functional, using the Hungarian algorithm as a subroutine. Their algorithm will be referred to as the B-Munkres Algorithm. The non-convexity of can be a real limitation in practice since can have arbitrarily bad local minima (see Lemma 1 in the supplementary material). Note that minimizing instead of will not give strictly better minimizers (see Proposition 6 in the supplementary material). We then apply entropic smoothing to this problem. This relaxation offers differentiability and circumvents both non-convexity and numerical scalability.

Figure 4: Barycenter estimation for different s with a simple set of PDs (red, blue and green). The smaller the , the better the estimation ( decreases, note the -axis is flipped on the right plot), at the cost of more iterations in Alg. 2. The mass appearing along the diagonal is a consequence of entropic smoothing: it does not cost much to delete while it increases the entropy of transport plans.

Entropic smoothing for PD barycenters. In addition to numerical efficiency, an advantage of smoothed optimal transport is that is differentiable. In the Eulerian setting, its gradient is given by centering the vector where is a fixed point of the Sinkhorn map (5), see (Cuturi & Doucet, 2014). This result can be adapted to our framework, namely:

Proposition 5.

Let be PDs, and the corresponding histograms on a grid. The gradient of the functional is given by

(16)

where denotes the adjoint operator and is a fixed point of the Sinkhorn map obtained while transporting onto .

  Input: PDs , learning rate , smoothing parameter , grid step .
  Output: Estimated barycenter
  Init: uniform measure above the diagonal.
  Cast each as an histogram on a grid
  while  changes do
     Iterate defined in (5) in parallel between all the pairs and , using (11).
     
     
  end while
  if Want energy then
     Compute using (12)
  end if
  Return
Algorithm 2 Smoothed approximation of PD barycenter

As in (Cuturi & Doucet, 2014), this result follows from the envelope theorem, with the added subtlety that appears in both terms depending on and . This formula can be exploited to compute barycenters via gradient descent, yielding Algorithm 2. Following (Cuturi & Doucet, 2014, §4.2), we used a multiplicative update. This is a particular case of mirror descent (Beck & Teboulle, 2003) and is equivalent to a (Bregman) projected gradient descent on the positive orthant, retaining positive coefficients throughout iterations.

As it can be seen in Fig. 4, the barycentric persistence diagrams are smeared. If one wishes to recover more spiked diagrams, quantization and/or entropic sharpening (Solomon et al., 2015, §6.1) can be applied, as well as smaller values for that impact computational speed or numerical stability. We will consider these extensions in future work.

A comparison with linear representations. When doing statistical analysis with PDs, a standard approach is to transform a diagram into a finite dimensional vector—in a stable way—and then perform statistical analysis and learning with an Euclidean structure. This approach does not preserve the Wasserstein-like geometry of the diagram space and thus loses the algebraic interpretability of PDs. Fig. 1 gives a qualitative illustration of the difference between Wasserstein barycenters (Fréchet mean) of PDs and Euclidean barycenters (linear means) of persistence images (Adams et al., 2017), a commonly used vectorization for PDs (Makarenko et al., 2016; Zeppelzauer et al., 2016; Obayashi et al., 2018).

5 Experiments

All experiments are run with , but would work with any finite . This choice is consistent with the work of Turner et al. (2014) for barycenter estimation.

Figure 5: Comparison of scalings of Hera and Sinkhorn (Alg. 1) as the number of points in diagram increases. - scale.

A large scale approximation. Iterations of Sinkhorn map (5) yield a transport cost whose value converges to the true transport cost as and the number of iterations (Cuturi, 2013). We quantify in Fig. 3 this convergence experimentally using the upper and lower bounds provided in (10) through and for decreasing . We consider a set of pairs of diagrams randomly generated with to points in each diagrams, and discretized on a grid. We run Alg. 1 for different ranging from to along with corresponding upper and lower bounds described in (10). For each pair of diagrams, we center our estimates by removing the true distance, so that the target cost is across all pairs. We plot median, top % and bottom % percentiles for both bounds. Using the -transform provides a much better lower bound in our experiments. This is however inefficient in practice: despite a theoretical complexity linear in the grid size, the sequential structure of the algorithms described in (Lucet, 2010) makes them unsuited for GPGPU to our knowledge.

We then compare the scalability of Alg. 1 with respect to the number of points in diagrams with that of Kerber et al. (2017) which provides a state-of-the-art algorithm with publicly available code—referred to as Hera—to estimate distances between diagrams. For both algorithms, we compute the average time to estimate a distance between two random diagrams having from to points where ranges from to . In order to compare their scalability, we plot in Fig. 5 the ratio of both algorithms, with in Alg. 1.

Fast barycenters and -means on large PD sets. We compare our Alg. 2 (referred to as Sinkhorn) to the combinatorial algorithm of Turner et al. (2014) (referred to as B-Munkres). We use the script munkres.py provided on the website of K.Turner for their implementation. We record in Fig. 6 running times of both algorithms on a set of diagrams having from to points, ranging from to , on Intel Xeon 2.3 GHz (CPU) and P100 (GPU, Sinkhorn only). When running Alg. 2, the gradient descent is performed until , with and . Our experiment shows that Alg. 2 drastically outperforms B-Munkres as the number of points increases. We interrupt B-Munkres at , after which computational time becomes an issue.

Figure 6: Average running times for B-Munkres (blue) and Sinkhorn (red) algorithms (log-log scale) to average 10 PDs.

Aside the computational efficiency, we highlight the benefits of operating with a convex formulation in Fig. 7. Due to non-convexity, the B-Munkres algorithm is only guaranteed to converge to a local minima, and its output depends on initialization. We illustrate on a toy set of diagrams how our algorithm avoids local minima thanks to the Eulerian approach we take.

Figure 7: Qualitative comparison of B-Munkres and our Alg 2. Input set of diagrams with points each. Output of B-Munkres when initialized on the blue diagram (orange squares) and input data (grey scale). Output of B-Munkres initialized on the green diagram. Output of Alg. 2 on a grid, , learning-rate , Sinkhorn stopping criterion (distance to marginals): , gradient descent performed until .—As one can see, localization of masses is similar. Initialization of B-Munkres is made on one of the input diagram as indicated in (Turner et al., 2014, Alg. 1), and leads to convergence to different local minima. Our convex approach (Alg. 2) performs better (lower energy). As a baseline, the energy of the naive arithmetic mean of the three diagrams is .

We now merge Alg. 1 and Alg. 2 in order to perform unsupervised clustering via -means on PDs. We work with the 3D-shape database provided by Sumner & Popović and generate diagrams in the same way as in (Carrière et al., 2015), working in practice with diagrams with to points each. The database contains classes: camel, cat, elephant, horse, head and face. In practice, this unsupervised clustering algorithm detects two main clusters: faces and heads on one hand, camels and horses on the other hand are systematically grouped together. Fig. 8 illustrates the convergence of our algorithm and the computed centroids for the aforementioned clusters.

Figure 8: Illustration of our -means algorithm. From left to right: diagrams extracted from horses and camels plot together (one color for each diagram); the centroid they are matched with provided by our algorithm; 20 diagrams of head and faces; along with their centroid; decrease of the objective function. Running time depends on many parameters along with the random initialization of -means. As an order of magnitude, it takes from to minutes with this PD dataset on a P100 GPU.

6 Conclusion

In this work, we took advantage of a link between PD metrics and optimal transport to leverage and adapt entropic regularization for persistence diagrams. Our approach relies on matrix manipulations rather than combinatorial computations, providing parallelization and efficient use of GPUs. We provide bounds to control approximation errors. We use these differentiable approximations to estimate barycenters of PDs significantly faster than existing algorithm, and showcase their application by clustering thousand diagrams built from real data. We believe this first step will open the way for new statistical tools for TDA and ambitious data analysis applications of persistence diagrams.

Acknowledgments.

We thank the anonymous reviewers for the fruitful discussion. TL was supported by the AMX, École polytechnique. MC acknowledges the support of a Chaire d’Excellence de l’Idex Paris-Saclay.

References

  • Adams et al. (2017) Adams, H., Emerson, T., Kirby, M., Neville, R., Peterson, C., Shipman, P., Chepushtanova, S., Hanson, E., Motta, F., and Ziegelmeier, L. Persistence images: a stable vector representation of persistent homology. Journal of Machine Learning Research, 18(8):1–35, 2017.
  • Agueh & Carlier (2011) Agueh, M. and Carlier, G. Barycenters in the wasserstein space. SIAM Journal on Mathematical Analysis, 43(2):904–924, 2011.
  • Altschuler et al. (2017) Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems, pp. 1961–1971, 2017.
  • Anderes et al. (2016) Anderes, E., Borgwardt, S., and Miller, J. Discrete wasserstein barycenters: optimal transport for discrete data. Mathematical Methods of Operations Research, 84(2):389–409, 2016.
  • Beck & Teboulle (2003) Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  • Benamou et al. (2015) Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and Peyré, G. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
  • Bubenik (2015) Bubenik, P. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research, 16(1):77–102, 2015.
  • Carlier et al. (2015) Carlier, G., Oberman, A., and Oudet, E. Numerical methods for matching for teams and wasserstein barycenters. ESAIM: Mathematical Modelling and Numerical Analysis, 49(6):1621–1642, 2015.
  • Carrière et al. (2015) Carrière, M., Oudot, S. Y., and Ovsjanikov, M. Stable topological signatures for points on 3d shapes. In Computer Graphics Forum, volume 34, pp. 1–12. Wiley Online Library, 2015.
  • Carrière et al. (2017) Carrière, M., Cuturi, M., and Oudot, S. Sliced wasserstein kernel for persistence diagrams. In 34th International Conference on Machine Learning, 2017.
  • Chazal et al. (2009) Chazal, F., Cohen-Steiner, D., Glisse, M., Guibas, L. J., and Oudot, S. Y. Proximity of persistence modules and their diagrams. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pp. 237–246. ACM, 2009.
  • Chazal et al. (2014) Chazal, F., De Silva, V., and Oudot, S. Persistence stability for geometric complexes. Geometriae Dedicata, 173(1):193–214, 2014.
  • Cohen-Steiner et al. (2007) Cohen-Steiner, D., Edelsbrunner, H., and Harer, J. Stability of persistence diagrams. Discrete & Computational Geometry, 37(1):103–120, 2007.
  • Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, pp. 2292–2300, 2013.
  • Cuturi & Doucet (2014) Cuturi, M. and Doucet, A. Fast computation of wasserstein barycenters. In International Conference on Machine Learning, pp. 685–693, 2014.
  • Dvurechensky et al. (2018) Dvurechensky, P., Gasnikov, A., and Kroshnin, A. Computational optimal transport: Complexity by accelerated gradient descent is better than by sinkhorn’s algorithm. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1367–1376, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/dvurechensky18a.html.
  • Edelsbrunner & Harer (2010) Edelsbrunner, H. and Harer, J. Computational topology: an introduction. American Mathematical Soc., 2010.
  • Edelsbrunner et al. (2000) Edelsbrunner, H., Letscher, D., and Zomorodian, A. Topological persistence and simplification. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pp. 454–463. IEEE, 2000.
  • Fréchet (1948) Fréchet, M. Les éléments aléatoires de nature quelconque dans un espace distancié. In Annales de l’institut Henri Poincaré, volume 10, pp. 215–310. Presses universitaires de France, 1948.
  • Guittet (2002) Guittet, K. Extended Kantorovich norms: a tool for optimization. PhD thesis, INRIA, 2002.
  • Hiraoka et al. (2016) Hiraoka, Y., Nakamura, T., Hirata, A., Escolar, E. G., Matsue, K., and Nishiura, Y. Hierarchical structures of amorphous solids characterized by persistent homology. Proceedings of the National Academy of Sciences, 113(26):7035–7040, 2016.
  • Kerber et al. (2017) Kerber, M., Morozov, D., and Nigmetov, A. Geometry helps to compare persistence diagrams. Journal of Experimental Algorithmics (JEA), 22(1):1–4, 2017.
  • Li et al. (2014) Li, C., Ovsjanikov, M., and Chazal, F. Persistence-based structural recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 1995–2002, 2014.
  • Lucet (2010) Lucet, Y. What shape is your conjugate? a survey of computational convex analysis and its applications. SIAM review, 52(3):505–542, 2010.
  • Lum et al. (2013) Lum, P., Singh, G., Lehman, A., Ishkanov, T., Vejdemo-Johansson, M., Alagappan, M., Carlsson, J., and Carlsson, G. Extracting insights from the shape of complex data using topology. Scientific reports, 3:1236, 2013.
  • Makarenko et al. (2016) Makarenko, N., Kalimoldayev, M., Pak, I., and Yessenaliyeva, A. Texture recognition by the methods of topological data analysis. Open Engineering, 6(1), 2016.
  • Mileyko et al. (2011) Mileyko, Y., Mukherjee, S., and Harer, J. Probability measures on the space of persistence diagrams. Inverse Problems, 27(12):124007, 2011.
  • Moreau (1965) Moreau, J.-J. Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France, 93(2):273–299, 1965.
  • Nicolau et al. (2011) Nicolau, M., Levine, A. J., and Carlsson, G. Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.
  • Obayashi et al. (2018) Obayashi, I., Hiraoka, Y., and Kimura, M. Persistence diagrams with linear machine learning models. Journal of Applied and Computational Topology, 1(3-4):421–449, 2018.
  • Peyré & Cuturi (2018) Peyré, G. and Cuturi, M. Computational Optimal Transport, 2018. URL http://arxiv.org/abs/1803.00567.
  • Reininghaus et al. (2015) Reininghaus, J., Huber, S., Bauer, U., and Kwitt, R. A stable multi-scale kernel for topological machine learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4741–4748, 2015.
  • Santambrogio (2015) Santambrogio, F. Optimal transport for applied mathematicians. Birkäuser, NY, 2015.
  • Schrijver (1998) Schrijver, A. Theory of linear and integer programming. John Wiley & Sons, 1998.
  • Solomon et al. (2015) Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., and Guibas, L. Convolutional wasserstein distances: Efficient optimal transportation on geometric domains. ACM Transactions on Graphics (TOG), 34(4):66, 2015.
  • Sumner & Popović (2004) Sumner, R. W. and Popović, J. Deformation transfer for triangle meshes. In ACM Transactions on Graphics (TOG), volume 23, pp. 399–405. ACM, 2004.
  • Turner (2013) Turner, K. Means and medians of sets of persistence diagrams. arXiv preprint arXiv:1307.8300, 2013.
  • Turner et al. (2014) Turner, K., Mileyko, Y., Mukherjee, S., and Harer, J. Fréchet means for distributions of persistence diagrams. Discrete & Computational Geometry, 52(1):44–70, 2014.
  • Villani (2003) Villani, C. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • Villani (2008) Villani, C. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.
  • Zeppelzauer et al. (2016) Zeppelzauer, M., Zieliński, B., Juda, M., and Seidl, M. Topological descriptors for 3d surface analysis. In International Workshop on Computational Topology in Image Context, pp. 77–87. Springer, 2016.
  • Zomorodian & Carlsson (2005) Zomorodian, A. and Carlsson, G. Computing persistent homology. Discrete & Computational Geometry, 33(2):249–274, 2005.

7 Supplementary material

7.1 Omitted proofs from Section 3

Diagram metrics as optimal transport:

We recall that we consider and two persistence diagrams with respectively points and points , , and is the cost matrix with block structure

Proof of Prop. 1.

Let and . Since are point measures, that is discrete measures of same mass with integer weights at each point of their support, finding is an assignment problem of size as introduced in §2. It is equivalent to finding an optimal matching representing some permutation for the cost matrix built from by repeating the last line in total times, the last column in total times, and replacing the lower right corner by a matrix of zeros. The optimal defines a partial matching between and , defined by iff , . Such pairs of points induce a cost , while other points (referred to as unmatched) induce a cost . Then:

Error control due to discretization:

Let be two diagrams and their respective representations as histograms. For two histograms, where are diagrams deduced from respectively by moving any mass located at to , inducing at most an error of for each point. We identify and in the following. We recall that is a distance over persistence diagrams and thus satisfy triangle inequality, leading to:

Thus, the error made is upper bounded by .

Propositions 2, 3, 4:

We keep the same notations as in the core of the article and give details regarding the iteration schemes provided in the paper.

Proof of prop 2.

Given an histogram and a mass , one can observe that (see below):

(17)

In particular, the operation can be perform by only manipulating matrices in . Indeed, observe that:

so we have:

Thus we have in our case:

where designs the Froebenius dot product between two histograms . Note that these computations only involves manipulation of matrices with size . ∎

Proof of prop 3.

Thus, we finally have:

And finally, taking the bin into considerations,

Remark: First term correspond to the cost of effective mapping (point to point) and the two others to the mass mapped to the diagonal. ∎

To address the last proof, we recall below the rounding_to_feasible algorithm introduced by Altschuler et al.; and denotes respectively the first and second marginal of a matrix .

1:  Input: , desired marginals .
2:  Output: close to .
3:  
4:  
5:  
6:  
7:  
8:  return
Algorithm 3 Rounding algorithm of Altschuler et al. (2017)
Proof of prop 4.

By straightforward computations, the first and second marginals of are given by:

Observe that and can be computed using Proposition 2.

Now, the transport cost computation is:

The first term is the transport cost induced by a rescaling of and can be computed with Prop 3. Consider now the second term. Without considering the additional bin , we have:

so when we consider our framework (with ), it comes:

Putting things together finally proves the claim. ∎

7.2 Omitted proofs from Section 4

We first observe that does not have local minimum (while does). For , we extend the Euclidean norm by the distance from to its orthogonal projection onto the diagonal . In particular, . We denote by the corresponding cost function (continuous analogue of the matrix defined in (8)).111Optimal transport between non-discrete measures was not introduced in the core of this article for the sake of concision. It is a natural extension of notions introduced in §2 (distances, primal and dual problems, barycenters). We refer the reader to (Santambrogio, 2015; Villani, 2008) for more details.

Proposition (Convexity of ).

For any two measures and , we have:

(18)
Proof.

We denote by the dual variables involved when computing the optimal transport plan between and . Note that maximum are taken over the set (with ):