## 1 Introduction

The Earth Mover’s Distance (EMD) is a metric that is widely used in fields such as image retrieval

[rubner2000], shape matching [grauman2004, memoli2009, su2015] and mesh reconstruction [degoes2011]. It models two sets and as distributions of mass, and takes their distance to be the minimum cost of transforming one distribution into the other, where cost is measured by the amount of mass moved multiplied by the distance over which it is moved. More formally,where is the set of all mappings of mass between and and is any metric. In the case where and are sets of (weighted) points, we can rewrite this as

For unweighted point sets, the solution can be obtained by solving an assignment problem; for weighted point sets, this is an instance of a minimum cost flow problem.

Recently, much attention has been devoted to computing the Earth Mover’s Distance when both and are sets of points [agarwal2017, fox2020, khesin2019, sharathkumar2012]. In this paper we expand on this by letting and be sets of points, line segments, triangles or -dimensional simplices in . We describe a unified framework for calculating the EMD between points and segments, points and triangles, points and simplices, segments and segments, triangles and triangles and simplices and simplices. Our approach provides polynomial-time algorithms that give a -approximation to the Earth Mover’s Distance between and , for some arbitrarily small . Moreover, our algorithms produce an assignment of mass that realises this cost. For triangles and simplices, the running time also depends on the largest edge length (note that we normalise the total area/volume of each set to one, so we cannot improve the running time by scaling the input). When neither set contains points, there is a small extra additive term in our approximation. For all our algorithms, our approach is to subdivide the elements of the input into sufficiently small pieces, and then approximate each piece by a point. The approximate optimal transport plan can then be obtained by solving a transport problem on these points. Our results are summarised in Table 1

. Note that all our algorithms give the solution with high probability; this is simply a consequence of using Fox and Lu’s algorithm

[fox2020] to solve the optimal transport problem on points. Substituting a deterministic algorithm here would make our results deterministic as well.To our knowledge, these are the first combinatorial algorithms with a provable approximation ratio for the Earth Mover’s Distance when the objects are continuous rather than discrete points. We give algorithms for moving mass from points to segments (Section 5), points to triangles (Section 6), points to simplices (Section 9.1), segments to segments (Section 7), triangles to triangles (Section 8) and simplices to simplices (Section 9.2).

Objects | Running time | Additive term |
---|---|---|

Points to segments | - | |

Points to triangles | - | |

Points to simplices | - | |

Segments to segments | ||

Triangles to triangles | ||

Simplices to simplices |

## 2 Related work

The general problem of optimally moving a distribution of mass was first described by Monge in 1781 [monge1781], and was reformulated by Kantorovich in 1942 [kantorovich1942]. It is known as the Earth Mover’s Distance due to the analogy of moving piles of dirt around; it is also known as the 1-Wasserstein distance, and is a special case of the more general optimal transport problem. For a full treatment of the problem’s history and connections to other areas of mathematics, the reader is referred to Villani’s book [villani2008].

The Earth Mover’s Distance has been studied in many geometric contexts. Agarwal et al. [agarwal2017] give both exact and approximation algorithms for the case where both sets are points under some metric. When both sets are weighted points, Khesin at al. [khesin2019] give two -approximation algorithms running in and time, where is the dimension, the aspect ratio of the input, and the total mass. However, their algorithm assumes that the point weights are integers, whereas our weights can be arbitrary real numbers, as they correspond to lengths and areas. This result was improved by Fox and Lu [fox2020], using a similar method to obtain, with high probability, a -approximation in time. The EMD was also studied when the input sets may be transformed: Cabello et al. [cabello2008] give a -approximation to minimising the EMD between two point sets under rigid transformations.

For continuous distributions, rather than discrete point sets, many numerical algorithms are known (see e.g. De Goes et al. [degoes2012], Lavenant et al. [lavenant2018], Mérigot [merigot2011, merigot2018] and Solomon et al. [solomon2015]). For the case where one set contains weighted points and the other is a bounded set , Geiß et al. [geiss2013] give a geometric proof that there exists an additively weighted Voronoi diagram such that transporting mass from each point to the part of contained in its Voronoi cell is optimal. The weights of this Voronoi diagram can be determined numerically.

De Goes et al. [degoes2011]

discuss a problem similar to our own, but in the context of the reconstruction and simplification of 2D shapes. Given a set of points, they want to reconstruct a simplicial complex of a given number of vertices that closely represents the shape of the point set. They start with computing the Delaunay triangulation of the point set, then iteratively collapse the edge that minimises the increase in the EMD between the point set and the triangulation. They use a variant of the EMD in which the cost is proportional to the square of the distance (2-Wasserstein distance). This allows them to calculate this variant of the EMD between a given set of points and a given edge of the triangulation exactly, as the squared distance can be decomposed into a normal and a tangential component. However, they determine the assignment of points to edges heuristically. In this work, we show how to obtain a

-approximation to the true optimal solution.## 3 Preliminaries

We are given a set of points in the plane with weights and a set of geometric objects , with lengths, areas or volumes . It is given that . We assume the mass associated with an object is distributed uniformly over the object, and that all objects have the same mass density. For convenience, and without loss of generality, we scale the input such that the total mass in either set is one. We want to compute a “transport plan” of mass from to that minimises the cost according to the Earth Mover’s Distance. We define for each pair a function , that describes the density of mass being moved from to the point . All these functions together describe the function used in the definition of . Such a set of functions needs to satisfy the following conditions to be a valid transport plan:

We can then define the cost of a given transport plan as

where is any metric. Our problem is to find a transport plan with minimal cost.

In the following section, we give an exact algorithm to calculate an optimal transport plan between a set of weighted points and line segments when is the metric. However, the approach we use does not seem to generalise to Euclidean distances, objects with areas, or even two sets of segments. This motivated us to look towards approximation algorithms for more general versions of the EMD problem. In the rest of this paper, we describe approximation algorithms and only consider the case where is the metric.

## 4 Points to segments under the metric

When is a set of line segments and distances are measured by the metric, we can solve the problem exactly by a convex quadratic program. We first subdivide all segments on the - and -coordinates of the points; call the set of subdivided segments . Note that the horizontal and vertical strip induced by each segment is now empty of points, while the axis-aligned quadrants starting at each corner of its bounding box may contain points; see Figure 1 for an illustration. Let be the set of segments in with slope between and , and let be .

We can now label the regions of points for each segment: let and be the regions to the left of , with being the region starting at the leftmost endpoint of , and being the other. Similarly, let be the region starting at the rightmost endpoint of , and let be the other region on the right. In case of a horizontal or vertical segment, and are simply merged into and , and it does not matter if is the top or bottom region.

For all points the distance to any point on the a segment is the same as the distance via one of the corners of the axis-aligned bounding box of . Therefore, for each region, we can separately consider the cost to reach the bounding box of with a certain amount of mass, and the cost to spread that mass out over the segment. Furthermore, the order in which a segment with a given slope receives mass from the different regions in an optimal solution is fixed depending on its slope.

Let be variables containing the amount of mass moved from to , let be the precomputed distance from to the bounding box of , let and be the width and height of the bounding box of , and let and be constants such that is the difference in -coordinate when moving a distance of along segment , and is the difference in -coordinate. Writing as for convenience, for a given segment we can write the cost for moving the mass from the corner of the region to the segment as follows:

Here we use the fact that under the distance, the cost of sending mass to some connected region of a segment is the same as the cost of sending everything to the midpoint of the connected region; see Figure 2 for an illustration of the calculations of the distances to these midpoints. Note that we omit the cost of sending the mass from the points to the corners of the bounding box; this will be accounted for later. Further note that is always positive here. Symmetrically, the costs for a given segment are as follows:

Note that is always positive here. We can now formalise our problem as follows:

subject to | |||||||

Since all , and are non-negative, this is a quadratic program with a convex objective function, and as such it can be solved in (weakly) polynomial time [kozlov1980].

###### Theorem 1.

Let be a set of weighted points and be a set of line segments with equal total weight. It is possible to construct an exact optimal transport plan between and under the metric in weakly polynomial time.

## 5 Points to segments

We now describe a polynomial-time algorithm that finds a transport plan with a cost that is at most times the cost of the optimal transport plan when one set contains points and the other contains line segments. The main idea is to reduce our instance to a transport problem on two weighted sets of points. Our strategy is as follows: we subdivide each segment such that for each subsegment the ratio of the distance to the closest and furthest point on for every is at most for some appropriate choice of . We then approximate a minimum cost flow problem on a bipartite graph between and the subdivided segments, where the cost of any edge is equal to the shortest distance between a point and a subsegment. Finally, we use the solution to this flow problem to build a discrete transport plan. For an appropriate choice of , this gives a -approximation.

The naive approach to subdividing the segments would be to make all the pieces some equal, appropriately small length. However, we can reduce the number of subsegments required by subdividing the segments as follows^{†}^{†}†This reduces the total number of subsegments required from to .. We repeatedly perform the following procedure for each subsegment. If there exists a point in such that the entire subsegment lies within distance of that point, do nothing. Otherwise, if there is a point in for which the ratio of the longest and shortest distance between that point and the current subsegment is more than , cut the subsegment in half. Call the resulting set of subsegments ; see Figure 3 for an example.

We now define a complete bipartite graph , with edges between each point-subsegment pair (note that this graph is used for analysis only; our algorithm does not construct it). The cost of each edge will simply be the shortest distance between the point and segment it connects. A solution to a flow problem in can be transformed into a transport plan by assigning a piece of subsegment to a point with length equal to the amount of flow along the corresponding edge. We will show that the EMD between and is approximated by the cost of any transport plan derived from a minimum cost flow in .

First note the following general lower bound on the cost of an optimal solution:

###### Lemma 2.

The Earth Mover’s Distance between and is bounded from below by the cost of a minimum cost flow in .

###### Proof.

Consider any transport plan that minimises the Earth Mover’s Distance. If instead of spreading the mass moved from each point equally over the whole segment, we move all the mass to the point on the segment closest to , we obtain a plan with cost . Such a plan is a solution to a flow problem in , as it moves all available mass. It follows that the cost of a minimum cost flow in satisfies . ∎

We also note the following lower bound on the value of :

###### Lemma 3.

.

###### Proof.

For a given point-segment pair , consider the segments in derived from that have a point within distance of . By construction, such a segment has its furthest point at distance at most . Therefore, the total length of these segments is at most for a given and . Over all point-segment pairs, this gives a total length of at most . This means the total length of segments in with distance at least is at least . The cost of a minimum flow in is therefore at least . ∎

We calculate a transport plan between and as follows. First, we approximate each segment by a point somewhere on that segment with weight equal to the length of ; call this set of points . We obtain by calculating an optimal transport plan between and , and then spreading the mass sent to each point evenly over the segment in that point was derived from. We now bound the cost of in terms of :

###### Lemma 4.

.

###### Proof.

We first bound the cost of . In , we measured all the distances to the closest point on each subsegment. Imagine that we picked all the points in to be the furthest point on the subsegment. For the subsegments with furthest distance at least , the ratio of these distances is at most by construction. We can therefore bound all the parts of where the furthest distance is at least by . The total mass being moved over distance at most in is at most , giving a cost of at most . The total cost of is therefore at most .

Now consider the extra cost incurred when transforming into by spreading the mass out evenly over all the segments. We can use the same argument as before: for the parts of with distance at least , the cost increases by a factor of at most , and the total cost of the part within distance is at most . We can therefore bound the cost of by .

We now obtain the upper bound stated in the lemma by plugging the bound on into the bound on . The lower bound follows directly from the fact that none of the distances in are smaller than the distances between the same objects in . ∎

We now show that approximates .

###### Theorem 5.

is a -approximation to the Earth Mover’s Distance between and for .

###### Proof.

By Lemma 4, we know that

is also a lower bound on ; the ratio between the upper and lower bound is

This ratio is the largest for small values of , so we plug in the lower bound from Lemma 3:

(assuming ) |

As is also a lower bound for (Lemma 2), and can obviously not have lower cost than the optimal transport plan, this gives a -approximation. ∎

Setting gives a -approximation. Note that the bound on is not restrictive: for any constant that would require a larger value of , we can simply use the value at the cost of a constant factor in the running time of our algorithm.

### 5.1 Running time analysis

We now analyse the number of subsegments in . We will count the number of subsegments in a different subdivision of , and then show that has at most a constant factor more subsegments. The alternative subdivision of each will be as follows: project each onto the supporting line of , call this point . We construct the one-dimensional Voronoi diagram of all along the supporting line of ; let be the part of inside the Voronoi cell of . From each , we subdivide into both directions. Up to a distance of , we make subsegments of size . Moving outward, we double the size of the subsegments whenever their ratio of distances to would still be below . Let be the resulting subdivision; see Figure 4 for an example.

###### Lemma 6.

has subsegments.

###### Proof.

We define and . In the following, we only analyse the case where is on ; if it lies outside, the number of subsegments will be smaller, as the size of the subsegments increases with distance. The length covered as we add subsegments on can be written as

where is the number of times we double the size of the subsegments, and is the number of subsegments with a size that has been doubled times. The number of subsegments can then be calculated by finding the values of and . We start with , which can be found by considering the distance at which the next cell could be double the size:

We want to take the values of as small as possible, so we take . Next, we can show by induction that all are equal:

Knowing that all are equal to , we can determine the value of :

This gives a total number of subsegments of for each point-segment pair. The sum over all pairs is largest when all are equal, i.e. . This gives us a total number of subsegments for all pairs of .. ∎

###### Lemma 7.

has subsegments.

###### Proof.

Consider any subsegment . Any subsegment that overlaps with has : otherwise was subdivided unnecessarily. As the subsegments in are disjoint, it follows that can overlap with at most 5 subsegments in . As such, contains at most 5 times more subsegments than , which, by Lemma 6, is . ∎

Putting everything together, we obtain the following result:

###### Theorem 8.

Let be a set of weighted points and be a set of line segments with equal total weight, let be the cost of an optimal transport plan between them, and let be any constant . Given an algorithm that constructs a -approximation between weighted sets of points in time, we can construct a transport plan between and with cost in time.

###### Proof.

In Theorem 5, we prove that an optimal transport plan between and approximates . However, we may be able to compute a -approximation to faster than we are able to compute it exactly. It remains to be shown that this approximation also suffices.

Plugging in a -approximation to , rather than the exact value, we obtain the ratio

Following the same strategy as in the proof of Theorem 5, we derive the approximation ratio as follows:

(assuming ) |

As such, using an approximation of still gives us an approximation of , albeit with a somewhat worse dependency on . ∎

To our knowledge, the current fastest algorithm to calculate a -approximation to is that by Fox and Lu [fox2020], which runs in time. Setting , this gives the following corollary to the previous theorem:

###### Corollary 9.

For any constant , a transport plan between and with cost can be constructed in time with high probability.

## 6 Points to triangles

We consider the case where is a set of weighted points with total weight one and is a set of triangles with total area one. We denote the longest edge of any triangle by . Our strategy is similar to before: we subdivide the triangles such that for each subregion, the ratio between its shortest and longest distance to each point is at most for some appropriate choice of . We then show that a solution based on an optimal transport plan between and some points inside the subregions approximates an optimal solution.

We first overlay a uniform grid onto our triangles with grid cells of size . We can identify the cells of this grid that contain a triangle in time using point-location in a compressed quadtree where the smallest cell size is [vanderhoog2018]. As each triangle can intersect at most four cells, the total size of this set of cells is . We now recursively subdivide each cell as follows: if there is a point in such that the whole cell is within distance of it, we stop; otherwise, if for any point the ratio of distances to the furthest and closest point in this cell is more than , we subdivide this cell into four cells of one quarter the area. If the ratio holds for all points, we stop. Call the resulting set of cells ; see Figure 5 for an example.

During each subdivision, we keep track of the total area of triangles contained inside that cell. We can then once again build a complete bipartite graph , with the capacity of each vertex set to the weight of the corresponding point or the total area of triangles contained in the corresponding cell, and the weight of each edge equal to the shortest distance between the point and the cell it connects. The cost of a minimum cost flow is now once again a lower bound to the EMD, exactly as in Lemma 2. In an analogous way to Lemma 3, we obtain a lower bound on the cost of :

###### Lemma 10.

.

###### Proof.

For a given point-triangle pair , consider the cells in intersecting that have a point within distance of . By construction, such a cell has its furthest point at distance at most . Therefore, the total area of these cells is at most . Over all point-triangle pairs, this gives a total area of at most . This leaves with distance at least in . The cost is therefore at least . ∎

We now once again approximate by reducing the flow problem to a transportation problem between two sets of weighted points. Again, we pick any point in each cell and give it a weight equal to the area of triangles contained in ; call this set of points .

###### Lemma 11.

###### Proof.

Let be an optimal transport plan between and , and let be its cost. We can upper bound by measuring all distances to the furthest point in each cell. We constructed such that the ratio of the closest and furthest distance between any point-cell pair is when the furthest distance is at least . We can therefore bound all parts of where the distance is at least by . The total mass being moved over a distance at most in is at most , giving a cost of . The total cost when measuring to the furthest point is therefore .

We now turn into a transport plan between and by spreading the mass sent to each point out evenly over the parts of the triangles in the cell in that was derived from. By construction, for cells with a distance of at least , this increases the cost by at most a factor . We can therefore bound the cost of this part of by . The remaining part has a total mass of at most , giving a cost of . The total cost of is then bound by .

Plugging in the bound on obtained above, we obtain an upper bound of . The lower bound follows directly from the fact that none of the distance in are smaller than the distances between the same objects in . ∎

Putting this all together, we can show that approximates .

###### Theorem 12.

is a -approximation to the Earth Mover’s Distance between and for .

###### Proof.

By Lemma 11 have that

is also a lower bound on ; the ratio between the upper and lower bound is

This ratio is largest for small values of , so we plug in the lower bound from Lemma 10:

(assuming ) | ||||

As is also a lower bound for (Lemma 2), and can obviously not have lower cost than the optimal transport plan, this gives a -approximation. ∎

Setting gives a -approximation.

### 6.1 Running time analysis

Our analysis will be the same as in Section 5.1; we just need to determine the size of . We will once again make an alternative subdivision of each , count the number of cells in that subdivision, and then argue that differs by at most a constant factor. Our alternative subdivision is a direct adaptation of the one used in Section 5.1 to two dimensions: for each point and triangle , we fill a square with side length centred on with cells of size . From there, we add rings of cells of side length around the square, until the next full ring could have cells double the size without violating the ratio of between the shortest and longest distance to for any cell in the ring. We repeat this process until we have covered a square of size . Let be the resulting set of cells; see Figure 6 for an example. The proof is similar to Lemma 6.

###### Lemma 13.

has cells.

###### Proof.

We define and . In the following, we only analyse the case where is inside ; if it lies outside, the number of cells will be smaller, as the size of the cells increases with distance. We also analyse the number of cells in one quadrant only; the total number is simply four times as many. See Figure 6 for an illustration of a quadrant. The number of cells created as we add rings of cells on can then be written as

where is the number of times we double the size of the cells, and is the number of rings containing cells of a size that has been doubled times. The number of cells can then be calculated by finding the values of and . We take the values of to be the same as in Lemma 6 (i.e. ): along a horizontal or vertical line through these values are exact, and cells not on this line can be made to have the correct ratio through one extra subdivision.

Let

be the vector from

to the closest point on the cell. Assume w.l.o.g. that ; the other cases are symmetrical. By construction of our subdivision, we know that the cell has size at most . We will now show that by dividing the cell on extra time (i.e. to a size of ), the furthest point will have the desired ratio irrespective of the value of . See Figure 7 for an illustration.As , and the other terms on the right-hand side are positive, the inequality holds. As such, the construction described can be turned into one where all cells have the desired ratio with one extra subdivision.

Plugging the values of , , into our initial formula, we can obtain the number of cells as a function of :

We can directly calculate the value of by considering the number of doublings needed to cover a horizontal line segment of length starting at :

This gives a total number of cells of per point-triangle pair. Over all pairs, we obtain a total number of cells of . ∎

###### Lemma 14.

has cells.

###### Proof.

This leads tot the following result:

###### Theorem 15.

Let be a set of weighted points and be a set of triangles with equal total weight, let be the longest edge length in after normalising its total area to one, let be the cost of an optimal transport plan between and , and let be any constant . Given an algorithm that constructs a -approximation between weighted sets of points in time, we can construct a transport plan between and with cost can be constructed in time.

We can again calculate a -approximation to in time using the algorithm by Fox and Lu [fox2020], giving the following corollary to the previous theorem:

###### Corollary 16.

For any constant , a transport plan between and with cost can be constructed in time with high probability.

## 7 Segments to segments

In the previous section, we considered the case when one of our two input sets contains points. We now describe an algorithm to compute the EMD between two sets of line segments. Here, we cannot directly apply our general approach of subdividing: the optimal transport plan may have a cost arbitrarily close to zero. As such, if we disregard everything within some radius of one of the sets, there may be nothing left. We solve this by introducing an additive term into the approximation. The cost of a plan generated by our algorithm is , for some value depending on . This allows us to greedily match parts of the input within a small distance of each other, and then solve the remainder with our previous approach.

Let and be sets of line segments with equal total length. Our algorithm is then as follows. First, we greedily match equal-length pieces of and that are within distance of each other, until no such pieces remain; we describe this process in more detail later. Let and be the remaining parts of and , respectively. We subdivide and as before: for every , if there is an such that the ratio between the closest and furthest distance is more than , cut in half; after processing , do the same for . Call the resulting sets and . We then choose a point on each and , with a weight equal to the length of the subsegment, and solve an optimal transport problem between these two point sets. Our final transport plan is then obtained by spreading the mass moved between any two points evenly over the segments they were chosen on.

We first prove that greedily matching parts of the input within distance increases the cost of an optimal solution by at most an additive term. The proof for the approximation algorithm then follows the same structure as in the previous sections. Let be a transport plan between the parts of the input that were greedily matched, in which the longest distance is at most , and let be an optimal transport plan for the remainder of the input.

###### Lemma 17.

Let and be two subsegments with length of and , respectively. If all mass from can be transported to with distance at most , then a transport plan between and has cost at most .

###### Proof.

We will construct a transport plan in which and are removed, having cost at most . The cost of an optimal solution on the remainder is then not higher.

Let be a function describing, for a point , where its mass comes from or goes to (recall that each point sends or receives mass density one), let be defined as , and let be a mapping of points on to points on such that for all , . The cost of the part of involving segments and can then be written as

We modify by removing and , and moving all mass that each point receives in to where moved it in ; see Figure 8. We can distribute this mass in any way we like, as the total incoming and outgoing mass is one by definition. This gives a transport plan with cost

By the triangle inequality, . It follows that

∎ |

We can now bound the costs of and .

###### Lemma 18.

.

###### Proof.

Any subsegments of and with length that are greedily matched increase the cost of an optimal solution in the remaining part by at most (Lemma 17). The total length that can be greedily matched is at most one, so the total extra cost is at most . ∎

###### Lemma 19.

.

###### Proof.

By construction, the distance over which any mass is transported in is at most . The total mass transported is at most one, giving the bound. ∎

For each segment , we can straightforwardly compute a maximal subset that can be transported over distance at most . Consider each segment : the supporting lines of and can intersect inside or , outside both, or not at all. If they don’t intersect (i.e. are parallel), computation of the parts that can be transported within the required distance is trivial. If they intersect outside both, we can find the points on and furthest from the intersection point that are within the required distance, then find the largest distance we can move towards the intersection point while staying within the required distance. If they intersect inside one or both of the segments, we split the segments at the intersection point and handle both sides using the case for intersections outside the segments.

Let and be the parts of and that remain after the greedy matching, with . We subdivide and into and as described above. As before, we define a complete bipartite graph , where the weight of each edge is equal to the shortest distance between the two subsegments it connects, and the capacity of each vertex is equal to the length of the subsegment it represents. Let be a minimum cost flow in ; we observe the following lower bound on its cost:

###### Lemma 20.

.

###### Proof.

By construction, the distances in are at least . As the total mass moved is , we obtain the bound stated in the lemma. ∎

Lemma 2 also still applies to the part of the input that remains after greedy matching. We now approximate

Comments

There are no comments yet.