1 Introduction11footnotetext: Computer Science Department, University College London, WC1E 6BT London, United Kingdom22footnotetext:
Computational Statistics and Machine Learning - Istituto Italiano di Tecnologia, 16100 Genova, Italy33footnotetext: Electrial and Electronics Engineering Department, Imperial College London, SW7 2BT, United Kingdom.
Aggregating and summarizing collections of probability measures is a key task in several machine learning scenarios. Depending on the metric adopted, the properties of the resulting average (or barycenter) of a family of probability measures vary significantly. By design, optimal transport metrics are better suited at capturing the geometry of the distribution than Euclidean distance or other -divergence [cuturi14]. In particular, Wasserstein barycenters have been successfully used in settings such as texture mixing [rabin2011wasserstein]srivastava2018scalable], imaging [gramfort2015fast], or model ensemble [dognin2019wasserstein].
The notion of barycenter in Wasserstein space was first introduced by [AguehC11] and then investigated from the computational perspective for the original Wasserstein distance [staib2017parallel, stochwassbary] as well as its entropic regularizations (e.g. Sinkhorn) [cuturi14, BenamouCCNP15, decentralized2018]. Two main challenges in this regard are: ) how to efficiently identify the support of the candidate barycenter and ) how to deal with continuous (or infinitely supported) probability measures. The first problem is typically addressed by either fixing the support of the barycenter a-priori [staib2017parallel, decentralized2018] or by adopting an alternating minimization procedure to iteratively optimize the support point locations and their weights [cuturi14, stochwassbary]. While fixed-support methods enjoy better theoretical guarantees, free-support algorithms are more memory efficient and practicable in high dimensional settings. The problem of dealing with continuous distributions has been mainly approached by adopting stochastic optimization methods to minimize the barycenter functional [stochwassbary, staib2017parallel, decentralized2018]
In this work we propose a novel method to compute the barycenter of a set of probability distributions with respect to the Sinkhorn divergence [genevay2018learning] that does not require to fix the support beforehand. We address both the cases of discrete and continuous probability measures. In contrast to previous free-support methods, our algorithm does not perform an alternate minimization between support and weights. Instead, we adopt a Frank-Wolfe (FW) procedure to populate the support by incrementally adding new points and updating their weights at each iteration, similarly to kernel herding strategies [bach2012equivalence] and conditional gradient for sparse inverse problem [bredies2013inverse, boyd2017alternating]. Upon completion of this paper, we found that an idea with similar flavor, concerning the application a Frank-Wolfe scheme in conjunction with Sinkhorn functionals has been very recently considered in distributional regression settings for the case of Sinkhorn negentropies . However, the analysis in this paper focuses on the theoretical properties of the proposed algorithm, specifically for the case of an inexact Frank-Wolfe procedure, which becomes critical in the case of continuous measures. In particular, we prove the convergence and rates of the proposed optimization method for both finitely and infinitely supported distribution settings. A central result in our analysis is the characterization of regularity properties of Sinkhorn potentials (i.e. the dual solutions of the Sinkhorn divergence problem), which extends recent work in [feydy2018interpolating, genevay2018sample] and which we consider of independent interest. We empirically evaluate the performance of the proposed algorithm.
The analysis of the proposed algorithm hinges on the following contributions: ) we show that the gradient of the Sinkhorn divergence is Lipschitz continuous on the space of probability measures with respect to the Total Variation. This grants us convergence of the barycenter algorithm in finite settings. ) We characterize the sample complexity of Sinkhorn potentials of two empirical distributions sampled from arbitrary probability measures. This latter result allows us to ) provide a concrete optimization scheme to approximately solve the barycenter problem for arbitrary probability measures with convergence guarantees. ) A byproduct of our analysis is the generalization of the FW algorithm to settings where the objective functional is defined only on a set with empty interior, which is the case for Sinkhorn divergence barycenter problem.
The rest of the paper is organized as follows: Sec. 2 reviews standard notions of optimal transport theory. Sec. 3 introduces the barycenter functional, and proves the Lipschitz continuity of its gradient. Sec. 4 describes the implementation of our algorithm and Sec. 5 studies its convergence rates. Finally, Sec. 6 evaluates the proposed methods empirically and Sec. 7 provides concluding remarks.
The aim of this section is to recall definitions and properties of Optimal Transport theory with entropic regularization. Throughout the work, we consider a compact set and a symmetric cost function . We set and denote by the space of probability measures on (positive Radon measures with mass ). For any , the Optimal Transport problem with entropic regularization is defined as follow [peyre2017computational, cuturi2013sinkhorn, genevay2016]
where is the Kullback-Leibler divergence between the candidate transport plan and the product distribution , and , with the projector onto the -th component and the push-forward operator. The case corresponds to the classic Optimal Transport problem introduced by Kantorovich [kantorovich1942transfer]. In particular, if for , then is the well-known -Wasserstein distance [villani2008optimal]. Let . Then, the dual problem of creftype 1, in the sense of Fenchel-Rockafellar, is [chizat2018scaling, feydy2018interpolating]
where denotes the space of real-valued continuous functions on , endowed with . Let . We denote by the map such that, for any ,
Pairs satisfying creftype 4 exist [knopp1968note] and are referred to as Sinkhorn potentials. They are unique - a.e. up to additive constant, i.e. is also a solution for any . In line with [genevay2018sample, feydy2018interpolating] it will be useful in the following to assume to be the Sinkhorn potentials such that: for an arbitrary anchor point and creftype 4 is satisfied pointwise on the entire domain . Then, is a fixed point of the map (analogously for ). This suggests a fixed point iteration approach to minimize creftype 2, yielding the well-known Sinkhorn-Knopp algorithm which has been shown to converge linearly in [sinkhorn1967, knopp1968note] . We recall a key result characterizing the differentiability of in terms of the Sinkhorn potentials that will be useful in the following.
Proposition 1 (Prop in [feydy2018interpolating]).
Let be such that,
Then, , the directional derivative of along is
where denotes the canonical pairing between the spaces and .
Note that is not a gradient in the standard sense. In particular note that the directional derivative in creftype 6 is not defined for any pair of signed measures, but only along feasible directions .
The fast convergence of Sinkhorn-Knopp algorithm makes (with ) preferable to from a computational perspective [cuturi2013sinkhorn]. However, when the entropic regularization introduces a bias in the optimal transport problem, since in general . To compensate for this bias, [genevay2018learning] introduced the Sinkhorn divergence
which was shown in [feydy2018interpolating] to be nonnegative, bi-convex and to metrize the convergence in law under mild assumptions. We characterize the gradient of for a fixed , which will be key to derive our optimization algorithm for computing Sinkhorn barycenters.
Let be the first component of (informally, the of the Sinkhorn potentials). As in Prop. 1, for any the gradient of is
with and the Sinkhorn potentials of and respectively.
3 Sinkhorn barycenters with Frank-Wolfe
Given and a set of weights such that , the main goal of this paper is to solve the following Sinkhorn barycenter problem
Although the objective functional is convex, its domain has empty interior in the space of finite signed measure . Hence standard notions of Fréchet or Gâteaux differentiability do not apply. This, in principle causes some difficulties in devising optimization methods. To circumvent this issue, in this work we adopt the Frank-Wolfe (FW) algorithm. Indeed, one key advantage of this method is that it is formulated in terms of directional derivatives along feasible directions (i.e., directions that locally remain inside the constraint set). Building upon [dem1967, dem1968, dunn1978conditional], which study the algorithm in Banach spaces, we show that the “weak” notion of directional differentiability of (and hence of ) in Remark 2 is sufficient to carry out the convergence analysis. While full details are provided in Appendix A, below we give an overview of the main result.
Frank-Wolfe in dual Banach spaces
Let be a real Banach space with topological dual and let be a nonempty, convex, closed and bounded set. For any denote by the set of feasible direction of at (namely with and ). Let be a convex function and assume that there exists a map (not necessarily unique) such that for every . In Alg. 1 we present a method to minimize G. The algorithm is structurally equivalent to the standard FW [dunn1978conditional, jaggi2013revisiting] and accounts for possible inaccuracies in solving the minimization in step . This will be key in Sec. 5 when studying the barycenter problem for with infinite support. The following result (see proof in Appendix A) shows that under the additional assumption that is Lipschitz-continuous and with sufficiently fast decay of the errors, the above procedure converges in value to the minimum of G with rate . Here denotes the diameter of with respect to the dual norm.
Under the assumptions above, suppose in addition that is -Lipschitz continuous with . Let be obtained according to Alg. 1. Then, for every integer ,
Frank-Wolfe Sinkhorn barycenters
Optimization domain. Let , with dual . The constraint set is convex, closed, and bounded.
Lipschitz continuity of the gradient.. This is the most critical condition and is addressed in the following theorem.
Thm. 4 is one of the main contributions of this paper. It can be rephrased by saying that the operator that maps a pair of distributions to their Sinkhorn potentials is Lipschitz continuous. This result is significantly deeper than the one given in [decentralized2018, Lemma 1], which establishes the Lipschitz continuity of the gradient in the semidiscrete case. The proof (given in Appendix D) relies on non-trivial tools from Perron-Frobenius theory for Hilbert’s metric [lemmens2012nonlinear], which is a well-established framework to study Sinkhorn potentials [peyre2017computational]. We believe this result is interesting not only for the application of FW to the Sinkhorn barycenter problem, but also for further understanding regularity properties of entropic optimal transport.
4 Algorithm: practical Sinkhorn barycenters
According to Sec. 3, FW is a valid approach to tackle the barycenter problem creftype 9. Here we describe how to implement in practice the abstract procedure of Alg. 1 to obtain a sequence of distributions minimizing . A main challenge in this sense resides in finding a minimizing feasible direction for . According to Remark 2, this amounts to solve
with not depending on . In general creftype 12 would entail a minimization over the set of all probability distributions on . However, since the objective functional is linear in and is a weakly- compact convex set, we can apply Bauer maximum principle (see e.g., [aliprantis2006, Thm. 7.69]). Hence, solutions are achieved at the extreme points of the optimization domain, namely Dirac’s deltas for the case of [choquet1969, p. 108]. Now, denote by the Dirac’s delta centered at . We have for every . Hence creftype 12 is equivalent to
Once the new support point has been obtained, the update in Alg. 1 corresponds to
In particular, if FW is initialized with a distribition with finite support, say for some , then also every further iterate will have at most support points. According to creftype 13, the inner optimization for FW consists in minimizing the functional over . In practice, having access to such functional poses already a challenge, since it requires computing the Sinkhorn potentials and , which are infinite dimensional objects. Below we discuss how to estimate these potentials when the have finite support. We then address the general setting.
Computing for probability distributions with finite support
Let , with a probability measure with finite support, with nonnegative weights summing up to . It is useful to identify with the pair , where is the matrix with -th column equal to . Let now be the pair of Sinkhorn potentials associated to and in Prop. 1, recall that . Denote by the evaluation vector
evaluation vectorof the Sinkhorn potential , with -th entry . According to the definition of in creftype 3, for any
since the integral reduces to a sum over the support of . Hence, the gradient of (i.e. the potential ), is uniquely characterized in terms of the finite dimensional vector v collecting the values of the potential on the support of . We refer as SinkhornGradient to the routine which associates to each triplet the map .
Sinkhorn barycenters: finite case
Alg. 2 summarizes FW applied to the barycenter problem creftype 9 when the ’s have finite support. Starting from a Dirac’s delta , at each iteration the algorithm proceeds by: finding the corresponding evaluation vectors ’s and p of the Sinkhorn potentials for and respectively, via the routine SinkhornKnopp (see [cuturi2013sinkhorn, feydy2018interpolating] or Algorithm B.2). This is possible since both and have finite support and therefore the problem of approximating the evaluation vectors and p reduces to an optimization problem over finite vector spaces that can be efficiently solved [cuturi2013sinkhorn]; obtain the gradients and via SinkhornGradient; minimize over to find a new point (we comment on this meta-routine Minimize below); finally update the support and weights of according to creftype 14 to obtain the new iterate .
A key feature of Alg. 2 is that the support of the candidate barycenter is updated incrementally by adding at most one point at each iteration, a procedure similar in flavor to the kernel herding strategy in [bach2012equivalence, lacoste2015sequential]. This contrasts with previous methods for barycenter estimation [cuturi14, BenamouCCNP15, staib2017parallel, decentralized2018], which require the support set, or at least its cardinality, to be fixed beforehand. However, indentifying the new support point requires solving the nonconvex problem creftype 13, a task addressed by the meta-routine Minimize. This problem is typically smooth (e.g., a linear combination of Gaussians when ) and first or second order nonlinear optimization methods can be adopted to find stationary points. We note that all free-support methods in the literature for barycenter estimation are also affected by nonconvexity since they typically require solving a bi-convex problem (alternating minimization between support points and weights) which is not jointly convex [cuturi14, stochwassbary]. We conclude by observing that if we restrict to the setting of [staib2017parallel, decentralized2018] with fixed support set, then Minimize can be solved exactly by evaluating the functional in creftype 13 on each candidate support point.
Sinkhorn barycenters: general case
When the ’s have infinite support, it is not possible to apply Sinkhorn-Knopp in practice. In line with [genevay2018sample, staib2017parallel], we can randomly sample empirical distributions from each and apply Sinkhorn-Knopp to in Alg. 1 rather than to the ideal pair . This strategy is motivated by [feydy2018interpolating, Prop 13], where it was shown that Sinkhorn potentials vary continuously with the input measures. However, it opens two questions: whether this approach is theoretically justified (consistency) and how many points should we sample from each to ensure convergence (rates). We answer these questions in Thm. 7 in the next section.
5 Convergence analysis
We finally address the convergence of FW applied to both the finite and infinite settings discussed in Sec. 4. We begin by considering the finite setting.
The result follows by the convergence result of FW in Thm. 3 applied with the Lipschitz constant computed in Thm. 4, and recalling that with respect to the Total Variation. We note that Thm. 5 assumes SinkhornKnopp and Minimize in Alg. 2 to yield exact solutions. In Appendix D we comment how approximation errors in this context affect the bound in creftype 16.
As mentioned in Sec. 4, when the ’s are not finitely supported we adopt a sampling approach. More precisely we propose to replace in Alg. 2 the ideal Sinkhorn potentials of the pairs with those of , where each is an empirical measure randomly sampled from . In other words we are performing the FW algorithm with a (possibly rough) approximation of the correct gradient of . According to Thm. 3, FW allows errors in the gradient estimation (which are captured into the precision in the statement). To this end, the following result quantifies the approximation error between and in terms of the sample size of .
Theorem 6 (Sample Complexity of Sinkhorn Potentials).
Suppose that with . Then, there exists a constant such that for any and any empirical measure of a set of points independently sampled from , we have, for every
with probability at least , where and .
Thm. 6 is one of the main results of this work. We point out that it cannot be obtained by means of the Lipschitz continuity of in Thm. 4, since empirical measures do not converge in to their target distribution [devroye1990no]. Instead, the proof consists in considering the weaker Maximum Mean Discrepancy (MMD) metric associated to a universal kernel [song2008learning], which metrizes the topology of the convergence in law of [sriperumbudur2011universality]. Empirical measures converge in MMD metric to their target distribution [song2008learning]. Therefore, by proving the Lipschitz continuity of with respect to MMD in Proposition E.5 we are able to conclude that creftype 17 holds. This latter result relies on higher regularity properties of Sinkhorn potentials, which have been recently shown [genevay2018sample, Thm.2] to be uniformly bounded in Sobolev spaces under the additional assumption . For sufficiently large , the Sobolev norm is in duality with the MMD [muandet2017kernel] and allows us to derive the required Lipschitz continuity. We conclude noting that while [genevay2018sample] studied the sample complexity of the Sinkhorn divergence, Thm. 6 is a sample complexity result for Sinkhorn potentials. In this sense, we observe that the constants appearing in the bound are tightly related to those in [genevay2018sample, Thm.3] and have similar behavior with respect to . We can now study the convergence of FW in continuous settings.
Suppose that with . Let and be empirical distributions with support points, each independently sampled from . Let be the -th iterate of Alg. 2 applied to . Then for any , the following holds with probability larger than
The proof is shown in Appendix E. A consequence of Thm. 7 is that the accuracy of FW depends simultaneously on the number of iterations and the sample size used in the approximation of the gradients: by choosing we recover the rate of the finite setting, while for we have a rate of , which is reminiscent of typical sample complexity results, highlighting the statistical nature of the problem.
Remark 8 (Incremental Sampling).
The above strategy requires sampling the empirical distributions for beforehand. A natural question is whether it is be possible to do this incrementally, sampling new points and updating accordingly, as the number of FW iterations increase. To this end, one can perform an intersection bound and see that this strategy is still consistent, but the bound in Thm. 7 worsens the logarithmic term, which becomes .
In this section we show the performance of our method in a range of experiments 111https://github.com/GiulsLu/Sinkhorn-Barycenters.
Discrete measures: barycenter of nested ellipses
We compute the barycenter of randomly generated nested ellipses on a grid similarly to [cuturi14]. We interpret each image as a probability distribution in D. The cost matrix is given by the squared Euclidean distances between pixels. Fig. 2 reports samples of the input ellipses and the barycenter obtained with Alg. 2. It shows qualitatively that our approach captures key geometric properties of the input measures.
Continuous measures: barycenter of Gaussians
We compute the barycenter of Gaussian distributions in , with mean and covariance randomly generated. We apply Alg. 2 to empirical measures obtained by sampling points from each , . Since the (Wasserstein) barycenter of Gaussian distributions can be estimated accurately (see [AguehC11]), in Fig. 2 we report both the output of our method (as a scatter plot) and the true Wasserstein barycenter (as level sets of its density). We observe that our estimator recovers both the mean and covariance of the target barycenter. See the supplementary material for additional experiments also in the case of mixtures of Gaussians.
Image “compression” via distribution matching
Similarly to [stochwassbary], we test Alg. 2 in the special case of computing the “barycenter” of a single measure . While the solution of this problem is the distribution itself, we can interpret the intermediate iterates of Alg. 2 as compressed version of the original measure. In this sense would represent the level of compression since is supported on at most points. Fig. 4 (Right) reports iteration of Alg. 2 applied to the image in Fig. 4 (Left) interpreted as a probability measure in D. We note that the number of points in the support is : indeed, Alg. 2 selects the most relevant support points points multiple times to accumulate the right amount of mass on each of them (darker color = higher weight). This shows that FW tends to greedily search for the most relevant support points, prioritizing those with higher weight
k-means on MNIST digits
We tested our algorithm on a -means clustering experiment. We consider a subset of random images from the MNIST dataset. Each image is suitably normalized to be interpreted as a probability distribution on the grid of pixels with values scaled between and . We initialize centroids according to the -means++ strategy [kmeans++]. Fig. 4 deipcts the centroids obtained by performing -means with Alg. 2. We see that the structure of the digits is successfully detected, recovering also minor details (e.g. note the difference between the centroids).
Real data: Sinkhorn propagation of weather data
We consider the problem of Sinkhorn propagation similar to the one in [Solomon:2014:WPS]. The goal is to predict the distribution of missing measurements for weather stations in the state of Texas, US by “propagating” measurements from neighboring stations in the network. The problem can be formulated as minimizing the functional over the set with: the subset of stations with missing measurements, the whole graph of the stations network, a weight inversely proportional to the geographical distance between two vertices/stations . The variable denotes the distribution of measurements at station of daily temperature and atmospheric pressure over one year. This is a generalization of the barycenter problem creftype 9 (see also [peyre2017computational]). From the total , we randomly select or to be available stations, and use Alg. 2 to propagate their measurements to the remaining “missing” ones. We compare our approach (FW) with the Dirichlet (DR) baseline in [Solomon:2014:WPS] in terms of the error between the covariance matrix of the groundtruth distribution and that of the predicted one. Here is the geodesic distance on the cone of positive definite matrices. The average prediction errors are: (FW), (DR) for , (FW), (DR) for and (FW), (DR) for . Fig. 5 qualitatively reports the improvement of our method on individual stations: a higher color intensity corresponds to a wider gap in our favor between prediction errors, from light green to red . Our approach tends to propagate the distributions to missing locations with higher accuracy.
We proposed a Frank-Wolfe-based algorithm to find the Sinkhorn barycenter of probability distributions with either finitely or infinitely many support points. Our algorithm belongs to the family of barycenter methods with free support since it adaptively identifies support points rather than fixing them a-priori. In the finite settings, we were able to guarantee convergence of the proposed algorithm by proving the Lipschitz continuity of gradient of the barycenter functional in the Total Variation sense. Then, by studying the sample complexity of Sinkhorn potential estimation, we proved the convergence of our algorithm also in the infinite case. We empirically assessed our method on a number of synthetic and real datasets, showing that it exhibits good qualitative and quantitative performance. While in this work we have considered FW iterates that are a convex combination of Dirac’s delta, models with higher regularity (e.g. mixture of Gaussians) might be more suited to approximate the barycenter of distributions with smooth density. Hence, future work will investigate how the perspective adopted in this work could be extended also to other barycenter estimators.
Below we give an overview of the structure of the supplementary material and highlight the main novel results of this work.
Appendix A: abstract Frank-Wolfe algorithm in dual Banach spaces
This section contains full details on Frank-Wolfe algorithm. The novelty stands in the relaxation of the differentiability assumptions.
Appendix B: DAD problems and convergence of Sinkhorn-Knopp algorithm
This section is a brief review of basic concepts from the nonlinear Perrom-Frobeius theory, DAD problems, and applications to the study of Sinkorn algorithm.
Appendix C: Lipschitz continuitity of the gradient of the Sinkhorn divergence with respect to Total Variation
Appendix D: Frank-Wolfe algorithm for Sinkhorn barycenters
This section contains the complete analysis of FW algorithm for Sinkhorn barycenters, which takes into account the error in the computation of Sinkhorn potentials and the error in their minimization. The main result is the convergence of the Frank-Wolfe scheme for finitely supported distributions in Theorem D.2.
Appendix E: Sample complexity of Sinkhorn potential and convergence of Alg. 2 in case of continuous measures
Appendix F: additional experiments
This section contains additional experiment on barycenters of mixture of Gaussian, barycenter of a mesh in 3D (dinosau) and additional figures on the experiment on Sinkhorn propagation described in Sec. 6.
Appendix A The Frank-Wolfe algorithm in dual Banach spaces
In this section we detail the convergence analysis of the Frank-Wolfe algorithm in abstract dual Banach spaces and under mild directional differentiablility assumptions so to cover the setting of Sinkhorn barycenters described in Sec. 3 of the paper.
Let be a real Banach space and let be its topological dual. Let be a nonempty, closed, convex, and bounded set and let be a convex function. We address the following optimization problem
assuming that the set of solutions is nonemtpy.
We recall the concept of the tangent cone of feasible directions.
Let . Then the cone of feasible directions of at is and the tangent cone of at is
is the cone generated by , and it is a convex cone. Indeed, if and , then . Moreover, if , then there exists and such that , . Thus,
So, is a closed convex cone too.
Let and . Then, the directional derivative of G at in the direction is
The above definition is well-posed. Indeed, since is a feasible direction of at , there exists and such that ; hence
Moreover, since G is convex, the function is increasing, hence
It is easy to prove that the function
is positively homogeneous and sublinear (hence convex), that is,
We make the following assumptions about G:
the function is finite, that is, .
The curvature of G is finite, that is,
For every , we have
This follows from (A.2) with and ().
The (inexact) Frank-Wolfe algorithm is detailed in Algorithm A.1.
Let be such that and, for every , (i.e., ). Let and be such that is nondecreasing. Then
Algorithm A.1 does not require the sub-problem to have solutions. Indeed it only requires computing a -minimizer of on , which always exists.
Since is weakly- compact (by Banach-Alaoglu theorem), if is weakly- continuous on , then the sub-problem admits solutions. Note that this occurs when the directional derivative is linear and can be represented in . This case is addressed in the subsequent Proposition A.7.
Let be defined according to Algorithm A.1. Then, for every integer ,