Clustering distributional data according to their spatial similarities has been a core issue in machine learning. Numerous theories and algorithms for clustering problems have been developed to help understand the structure of the data and to discover homogeneous groups in their embedding spaces. Clustering algorithms also apply to unsupervised learning problems that pass information from known centroids to unknown empirical samples. Occasionally, researchers regard clustering as finding the optimal semi-discrete correspondence between distributional data or vice versa.
Optimal transportation (OT) techniques have gained increasing popularity in the past two decades for measuring the distance between distributional data as well as aligning them together. Rooted in the OT theories, several OT-based clustering algorithms have emerged in recent years as alternatives, thanks to their efficiency and robustness. In these works, the researchers discovered the connections between different clustering problems and the OT problem through the Wasserstein barycenter (WB) formulation which computes a “mean” of one or multiple distributions. However, most of them deliver the results as soft assignments that need to be further discretized.
In this paper, we propose to compute the Wasserstein barycenter based on Monge OT and explore its natural connections to different clustering problems that prefer hard assignments
. We base our OT solver on variational principles and coin our method as variational Wasserstein barycenters. We study the metric properties of WBs and use them to explain and solve different clustering-related problems such as regularized K-means clustering, co-clustering, and vector quantization and compression. We also show its immunity to unbalanced measures and its extension to measures on spherical domains. We discuss our method from different angles through comparison with other barycenter methods. We show the advantages of Monge OT-based barycenters in solving geometric clustering problems. We are among the first few that compute Monge barycenters and discover its connections to clustering problems.
2 Related Work and Our Contributions
Computational clustering algorithms date back to (lloyd1982least; forgy1965cluster)
for solving K-means problems. From then, researchers have proposed different formulations and algorithms such as spectral clustering and density-based clustering. Mixture modeling, especially Gaussian mixture modeling, is also considered to be a robust solution to clustering problems. Hierarchical clustering and co-clustering also attracted much attention in the machine learning community.(xu2005survey) surveys some classic clustering algorithms. The term “geometric clustering” appeared in the early literature, such as (murtagh1983survey; quigley2000fade), referring to clustering samples into subspaces according to their location in the metric space, usually the Euclidean space. In (applegate2011unsupervised), the authors discuss the connection between K-means and another famous problem – the OT distance, or the Wasserstein distance.
The transportation problem has attracted many mathematicians since its very birth. Monge first raised the problem (monge1781memoire)
as finding a measure-preserving map between probability measures; Kantorovich extended the problem to finding a joint probability measure(kantorovich1942translocation); Brenier further connected the OT problem to fluid dynamics and convex geometry (brenier1991polar)
. It’s early applications include comparing 1D histograms for image retrieval(rubner2000earth). Thanks to efficient OT solvers, e.g., (cuturi2013sinkhorn), OT has become a popular tool in machine learning with which we compare distributional data.
Meanwhile, by regarding the OT distance as a metric, we can interpolate in the space of probability measure.(mccann1997convexity) laid the foundation; (agueh2011barycenters) developed the problem into a general scenario and coined the term “Wasserstein barycenters”. (cuturi2014fast; ho2017multilevel; mi2018variational) relate WBs to K-means like clustering problems and (leclaire2019fast; lee2019hierarchical) explored the use of OT for hierarchical clustering. (claici2018stochastic) is among the latest work on scalable semi-discrete Wasserstein barycenters. Most of them follow Kantorovich’s static OT; few of them follow Monge’s, or Brenier’s, dynamic version that regards OT as a gradient flow in the probability space.
Compared to previous work, our contribution is three-fold: 1) We derive the WB based on Monge’s OT formulation and explore its connections to different clustering problems; 2) We prove the metric properties of our WB and propose it as a metric for evaluating multi-marginal clustering algorithms; 3) We explore the advantages and disadvantages of Monge WB through empirical comparison with other methods.
3 Primer on Optimal Transportation
We begin by iterating key concepts of optimal transportation (OT), variational OT, and Wasserstein barycenters (WBs). Suppose are Borel probability distributions supported in Polish spaces , , respectively. Let
be the set of all probability distributions on. Then, we denote by the set of all transportation maps between and . Thus,
is also the joint distribution ofand and specifies the mass transported across and . In addition, we use to specify the transportation cost between and .
3.1 Optimal Transportation
The OT problem is to minimize the total transportation cost:
indicates the moment of the cost function. Then, we call this minimum thep-Wasserstein distance:
The above is the well-known Kantorovich’s OT formulation that allows a partial map that splits the measure during transportation. In Monge’s original version, each location has a unique correspondence . If we define such a map as , then we have and Monge OT:
pushes forward to , i.e. ; more rigorously, for any measurable set . We direct readers to (villani2003topics; peyre2019computational) for more on OT. In this paper, we focus on Monge OT. In particular, we narrow our discussion to , , and unless specified otherwise. Hence, we compute .
3.2 Variational Optimal Transportation
Directly computing a Monge map is highly intractable and variational methods have been adopted by many researchers. (de2012blue; gu2013variational; levy2015numerical) offer three variational formulations. We follow (gu2013variational) and in this paper refer to it as variational OT or VOT.
Suppose is supported on discrete atoms . The problem becomes semi-discrete OT. VOT starts with a piece-wise linear function . Each associates with a height . The gradient, where induces the maximum, serves as a map from to . It induces a graph: . For simplicity, we remove and use instead. We introduce an energy:
whose gradient, , also integrates to
Meanwhile, the Lagrangian duality of Monge OT (1) is
where . (4) simplifies to
which coincides with a power Voronoi diagram.
We provide detailed derivation for above formulas in Appendix and then prove their following connections.
Therefore, we “variationally” minimize , (2), for a height vector and that will produce a Monge map .
3.3 Wasserstein Barycenters
The Wasserstein distance (WD) satisfies all metric properties. The fréchet mean of a collection of distributions w.r.t the WD is called the Wasserstein barycenter (WB). It is the minimizer of the weighted average:
for and . We simplify (6) by assuming uniform weights and rewrite it as
OT for all . Suppose the barycenter is supported on discrete atoms . If we fix and only allow updating , then readers can notice that (7) is simultaneously solving constrained K-means problems using the same set of centroids with fixed capacity, . serves as the optimal assignment function in each K-means problem. Note that is a hard assignment that has only one target because we solve Monge OT.
To clarify notation, we use to denote the probability distribution whether continuous or discrete. If it is discrete, namely a collection of Dirac measures, then we use and to denote its supports and measures. and specify the location and measure of the th Dirac measure.
4 Variational Wasserstein Barycenters
Solving the WB problem relies on alternatively solving OT problems and updating the barycenter, . Eventually, minimizes the average WD between empirical distributions and the barycenter. A discrete distribution consists of support and measure . Updating both of them, e.g., (ye2017fast), is difficult and even troublesome in some cases (see Appendix). In this paper, we follow (cuturi2014fast) and only update one of them while fixing the other throughout the optimization.
4.1 Discrete Barycenters via VOT
We first solve VOT problems (2):
Its derivative w.r.t. the VOT optimizer is
which, in practice, can be replaced by its stochastic version,
where ’s are now Monte Carlo samples. Then, we can naturally adopt the gradient descent (GD) update:
For completeness, we give the second-order derivative in Appendix. Its computation, however, involves integrating over the Voronoi facets and thus is intractable in general.
To solve for , we rewrite the objective of the WB (7) as
. The critical point of this quadratic energy w.r.t. each has a closed form:
which is the center of mass of its correspondence across all measures. The latter expression is the “stochastic” version.
The last step is to derive the update rule for the measure . (10) is not differentiable w.r.t. . Still, we follow (cuturi2014fast; mi2018regularized) and give the critical point and include the derivation in Appendix.
where . coincides with the result of Lloyd’s K-means algorithm in which the measure on each centroid accumulates all its assigned empirical measures.
Now that we have derived the rules for updating and , we summarize our algorithm for computing the VWB of a collection of measures in Appendix. As for the initial guess of the barycenter, if not specified, we can either run Lloyd’s algorithm on all the measures as a whole and adopt the resulting centroids or uniformly sample the space . The choice of the measure on the centroids depends on the specific application. A ubiquitous choice is uniform Dirac measures, i.e. . Figure 1 suggests that by regarding the WD as the metric, we can find a mean shape on the same manifold, if there exists one.
Our method does converge since we follow coordinate descent and every step is convex (grippo2000convergence), given the assumption we made in 3.1 that , , and . There are in total variables for computing Monge maps , and variables as the support and variables as the measure
. We implemented VWB with PyTorch(paszke2019pytorch). The code to reproduce the figures in this paper is at https://github.com/icemiliang/pyvot.
4.2 Metric Properties of (V)WBs
In spite of extensive studies on metric properties of OT over the past century, the metric properties of Wasserstein barycenters have yet been fully explored. Some pioneer work includes (papadakis2019approximation; auricchio2018computing).
However, most of them focus on the barycenter of two measures (). We show in the following that WBs in general () induce a generalized metric (n-metric). First, let us define the total Wasserstein distance between the barycenter and all the marginal Borel measures:
Then, we raise the following two propositions and prove them in Appendix.
defines a generalized metric among , . Specifically, satisfies the following properties.
1) Non-negativity: .
2) Symmetry: , where and are different permutations of the set .
3) Identity: .
4) Triangle inequality: .
The bound of the triangle inequality in Proposition 2 can be tightened by a linear factor. Specifically, we have .
The VWB , as a special case of WBs, certainly inherits the metric properties because there is not such a restriction on the continuity of . If we denote the total WD for the VWB with , then we have:
induces an n-metric over all ’s. In particular, the equal signs in 1) non-negativity and 4) inequality hold if and only if all ’s and have the same number of supports with positive Dirac measures .
4.3 Approximate WDs with VWBs – Transshipment
We consider the transshipment problem as finding a Monge map from the source to the target that passes through a relay measure in the middle (see Figure 2). We solve it by VWBs. Our discussion comes directly from the conclusions in 4.2:
As a special case of Corollary 1, induces a (2-)metric between and . It is lower-bounded by when .
Appendix reveals the proof. Then, we can use a VWB to connect two measures and regard the total WD as an approximation to the true WD between them. We name it the variational Wasserstein distance, or VWD:
We use the toy data above to evaluate the approximation against the number of supports, . The two Gaussian measures share the same covariance matrix; their means differ by 1. Thus, the true WD is
. We use the results from linear programming (LP) and Sinkhorn algorithms for reference. Figure3 shows that VWD is still accurate with few supports. For each number of supports in the experiments, we run our algorithm 10 times with different random initial locations. We draw the error band with light color. Until supports, ratio
, our algorithm produces stable approximations that have almost zero variance.
4.4 On Unbalanced Measures
When measures are not probabilities or, more generally their integrals do not equal, we are solving unbalanced OT. (benamou2003numerical) first explored the problem. Researchers since then have offered various formulations and perspectives to approach it, e.g. (liero2018optimal) adding -divergences as regularizers instead of constraints on the marginals. Here, we discuss VOT and VWBs for unbalanced measures. Without loss of generality, let us assume . We denote the mass in each power Voronoi cell by . Inspired by the discussion in (peyre2019computational), we propose to penalize the quadratic mismatch of the mass for each cell .
In the following, we discuss (12) in two cases: and a more general one, .
Case 1: . It is trivial to verify that minimizing the second term in (12) over under its equality constraint yields all ’s equal to each other, i.e. . On the other hand, the gradient of the VOT energy (2) has the form . The question now is whether minimizes (2). If so, then we can instead solve (2) to minimize the second term in (12).
When , the gradient becomes constant and thus is being translated at a constant rate. Certainly, translation does not modify a power Voronoi diagram as specified in (5). Therefore, saturates. For any other partition such that , we have
Case 2: . It is also trivial to verify that minimizing the second term in (12) over yields (replace with ). also triggers the convergence of VOT as in Case 1.
At this point, we claim that VOT, (2), minimizes the total transportation cost regardless of the measures equal or not. We leave rigorous proofs to future work. We illustrate the convergence in Figure 4. The top half shows VOT between balanced measures and the bottom half shows unbalanced measures, . Note that the gradient of the VOT and VWB, (8), correlates to the absolute measure values. Thus, we should scale the step size, in (9), for each VOT according to the difference of the measure, i.e. , assuming the total for is . Figure 4 shows that under the same (scaled) GD step size, VOT in two cases follows the same trend.
4.5 On Spherical Domains
Optimal transport on geometric domains other than the Euclidean domain extends its applications (solomon2015convolutional; staib2017parallel; cui2019spherical). (cui2019spherical) relates spherical power Voronoi diagram to OT on unit spheres. Inspired by that, we study our VWB on spherical domains and its metric properties.
Let us define a new ground metric on a unit sphere, , as and the OT distance:
s.t. for all non-negative .
Following (cui2019spherical), we define the power distance on a sphere as and thus the power Voronoi diagram in the spherical domain . is the weight of each power cell, it relates to the VOT minimizers by . Then, the derivation in 3.2 gives us the Monge map.
does not satisfy triangle inequality but the other three metric properties. Thus, inherits those properties. We notice that the proof for Proposition 2 does not require triangle inequality. Therefore, the n-metric properties still hold for the barycenter w.r.t. .
5 Geometric Clustering via VWBs
In this section, we further connect VWBs to several clustering problems. We consider a fixed number of clusters, , the quadratic Euclidean distance as the ground metric, and mainly the spatial relation between samples. We refer to this scenario as geometric clustering. From now on, we discretize the measures: and assume that .
5.1 Regularized K-Means Clustering
In light of the discovery of VWBs for unbalanced measures in 4.4, we now introduce a relaxed version of the constrained K-means problem. We call it regularized K-means.
The classic K-means problem has the objective as follows:
where is the number of samples supported in . By adding the marginal constraint with pre-defined, fixed measures , we turn (14) into the constrained K-means problem (bradley2000constrained; cuturi2014fast), or the Wasserstein Means problem coined in (ho2017multilevel). As discussed in Section 4.4, when the total measures do not equal, such constraints instead become regularizers. Then, we define the objective of the regularized K-means clustering problem as:
where . If , (15) becomes K-means; if , (15) becomes Monge OT. As practiced in (cuturi2014fast; mi2018variational), we can alternatively solve for and . The energy (15) will monotonically decrease and eventually converge into a cycle of one. Figure 7 illustrates the regularized K-means result which informally looks like an interpolation between K-means and constrained K-means.
5.2 Co-clustering Spatial Features in
Extending the Wasserstein clustering procedure to multiple targets induces the co-clustering problem. In this section, we discuss the connection between co-clustering problems and VWBs. In particular, because we use quadratic Euclidean distances as the ground metric, we focus on co-clustering spatial features embedded in the Euclidean space.
Given multiple distributional data, the goal of co-clustering is to simultaneously partition each domain to 1) minimize the pairwise variance in the same cluster and 2) minimize the pairwise variance for each cluster across domains. We assume all samples reside in equipped with , then:
where is the number of samples in ; specifies the correspondence of the clusters across different domains. Thus, and . As for K-means, we can simplify the pairwise variance with the mean of each cluster at each domain, :
where is the cluster center for each cluster at each domain. The first term of (16) is solving K-means problems. The second term is solving K-means problems but with the cluster centroids at other domains. Thus, we can further simplify the problem into:
is some constant. Thus, we convert co-clustering to -means problems with the same set of centroids.
Then, we can naturally impose a constraint on the weights, i.e , to turn the problem into a VWB problem which is also an constrained K-means problem. Note, that it is trivial to extend it into a generalized VWB problem, by instead inserting the weighted constraint into the main objective as we did in 5.1.
5.2.1 Regularized VWBs for Co-Clustering
In addition to purely clustering feature domains according to Wasserstein losses, we can regularize the correspondences based on prior knowledge. Inspired by (alvarezmelis2019towards; mi2018regularized), we regularize the correspondence by global invariances. Directly regularizing Monge correspondences is highly intractable because Monge maps are basically binary permutations and thus not differentiable. Therefore, we instead regularize the centroid update process.
To this end, instead of using the average of the centroids as we did in (18
), we estimate the rigid transformation (isometry) between the VWB and the centroids of each domain by minimizing, subject to composing a rotation and a translation, i.e.
. This can be done by singular value decomposition (SVD) with minimum computational costs. After that, we average all the transformations by separately averaging rotations and translations. With the abuse of notation, we simply denote the process by, but as we know, we need to factorize the rotations into quaternions before averaging them. The final location for the supports is given by .
5.3 Vector Quantization and Data Compression
Lloyd’s K-means algorithm was initially proposed for vector quantization and has been a fundamental choice for data compression. It centers at using fewer samples to approximate the entire distribution. In light of the connection between VWBs and K-means, we raise the problem of compressing multiple distributional data as a whole with Wasserstein barycenters and propose the VWB as a natural choice. It shares the same objective as the WB. Intuitively, we use sum of WDs to measure the compression error.
By using VOT, we obtain a surjection from each domain to the barycenter. Because we optimize over the height vector (9), given empirical samples and the barycenter, we can fully recover the surjection by only using at the negligible expense of computing the power distance as in (5). In this way, for a barycenter of size of empirical distributions each having samples, we reduce the storage burden from , as it would be for Sinkhorn distance-based methods, to . This is particularly useful when is large and when we need to store multiple interpolations between marginals.
Furthermore, with the VWB, we do not even need the original distributions to parameterize the compression maps because our method is based on the geometry of the data and given the height vector and barycenter supports we can uniquely partition each original domain with a power Voronoi diagram ; or, equivalently, the graph of the piece-wise linear function .
We demonstrate the use of VWBs with point cloud interpolation and image compression.
6.1 Point Cloud Interpolation with Global Invariance
Shape interpolation is a typical application of Wasserstein barycenter techniques. We compute the barycenter that has the minimum weighted average WD to all the marginal shapes. When the marginals are congruence to each other, we can leverage the congruency to regularize the process to update the barycenter. We adopt the approach in 5.2.1 and compute the VWB that has the minimum VWD to two marginal shapes. The correspondences are regularized by a rigid transformation in order to preserve the global structure of the shape. Ideally, we can obtain a “mean” shape that lies at the middle of the marginals and the rotations to the marginals share the same angles but in opposite directions. Figure 8 shows the result that verifies our hypothesis.
In this experiment, we are given two Kittens off by an unknown rigid transformation. Our goal is to interpolate, by computing a regularized Wasserstein barycenter, a new Kitten in between that is rigid to the original Kittens and the amount of translation and rotation is linear to the weights of the two original Kittens.
The marginal Kittens each have sample points. We assume all the samples have equal weights. They are apart from each other by a rigid transformation composed by a random translation vector and a random rotation matrix . In this example, they are as follows:
The barycenter Kitten w.r.t. the VWD (variational Wasserstein distance) has supporting Dirac measures. The regularization strength, , is . One of the post-processing options to transport all the samples from the marginals is that for each sample find its nearest 3 or more cluster centers and use inverse barycenter coordinates to find its new location on the target Kitten in the middle.
6.2 Image Compression
We demonstrate the use of our method for data compression by quantizing the RGB colors of an image into a fixed number of clusters. See Figure 9 for the results. The top row shows the original images of dimension . We embed all the pixels into the RGB color space . Our goal is to compute, for example, centroids that partition all the pixels into their clusters. In this way, we compress the storage for each pixel from bits to bits. The second row in Figure 9 shows resulting images of using Lloyd’s K-means(++) algorithm, and the third row shows the results of using our VOT solver. Compared to Lloyd’s, VOT well distributes the centroids into the pixel domain, resulting in a smoother transition from color to color. The correspondences in the color space we show in Appendix also confirm this. Finally, we simultaneously merge and compress the colors from all three images by using VWB. The last row shows the resulting images sharing the same color distribution that only consists of 16 discrete centroids. It has the same to each original color distribution (marginal). In Appendix, we further show the results that comes from the centroids having different ’s to each marginals, i.e. in (11). We show the RGB color distribution of each image in Appendix.
We conclude by discussing the advantages and disadvantages of VWBs and several future directions.
Algorithms solving K-means like clustering problems are in general sensitive to initial choices. Common solutions include using a subset of samples and spreading the seeds across the domain, e.g., K-means++. We tried the results from K-means++ as the initial choice for our barycenters and also tried a pre-defined Gaussian distribution whose mean is the average of the means of the marginals as prior knowledge. We did not find visible differences.
Monge maps between discrete measures may not exist, e.g. transporting 3 Dirac points to 2 Dirac points . In this case, splitting the mass becomes necessary (wang2013linear). Moreover, there might be multiple solutions, and variational solvers cannot recover any of them. An example is transporting to . There exist two one-to-one maps but VOT cannot recover either because the target measures cannot be distinguished by the piece-wise linear function , in 3.2. Therefore, when dealing with stochastic GD, having sufficient samples to represent the domain is key to stabilize VWBs. Luckily, increasing the empirical samples adds little computational burden if we parallelly update the correspondence for each empirical according to its nearest neighbor. On the other hand, Sinkhorn iteration-based OT methods produce soft correspondences that unavoidably result from the entropic regularization, making them robust for discrete measures. Occasionally, the soft correspondences are even desirable because they make the correspondences differentiable (cuturi2019differentiable); Monge correspondences, however, are basically permutations which are not differentiable. In summary, our VWB producing Monge maps is suitable for clustering or partitioning problems that require binary, sparse correspondence while Sinkhorn distance-based barycenters have been tested in numerous applications in machine learning for producing robust interpolations.
There are several future directions: 1) In the current implementation, we use exhaustive search to find the nearest centroid for each empirical sample, which takes about of our run time. A faster alternative for nearest neighbor search based on the power distance, which is not a Minkowski distance, will significantly reduce the run time of the VWB; 2) Whether VWBs or WBs for unbalanced measures still induce a generalized metric deserves an answer; 3) Whether our discussion still holds for and deserves an answer; 4) Another branch of computing Monge OT is the multi-scale approach, e.g., (merigot2011multiscale; schmitzer2016sparse; gerber2017multiscale). It also partitions the target domain into sub-domains. Computing barycenters with multi-scale OT for clustering purposes is worth exploring.