Log In Sign Up

On a linearization of quadratic Wasserstein distance

by   Philip Greengard, et al.

This paper studies the problem of computing a linear approximation of quadratic Wasserstein distance W_2. In particular, we compute an approximation of the negative homogeneous weighted Sobolev norm whose connection to Wasserstein distance follows from a classic linearization of a general Monge-Ampére equation. Our contribution is threefold. First, we provide expository material on this classic linearization of Wasserstein distance including a quantitative error estimate. econd, we reduce the computational problem to solving a elliptic boundary value problem involving the Witten Laplacian, which is a Schrödinger operator of the form H = -Δ + V, and describe an associated embedding. Third, for the case of probability distributions on the unit square [0,1]^2 represented by n × n arrays we present a fast code demonstrating our approach. Several numerical examples are presented.


Two-sample Test using Projected Wasserstein Distance: Breaking the Curse of Dimensionality

We develop a projected Wasserstein distance for the two-sample test, a f...

A continuation multiple shooting method for Wasserstein geodesic equation

In this paper, we propose a numerical method to solve the classic L^2-op...

Wasserstein Distance Measure Machines

This paper presents a distance-based discriminative framework for learni...

Optimal Transportation for Electrical Impedance Tomography

This work establishes a framework for solving inverse boundary problems ...

Hybrid Wasserstein Distance and Fast Distribution Clustering

We define a modified Wasserstein distance for distribution clustering wh...

The Equivalence of Fourier-based and Wasserstein Metrics on Imaging Problems

We investigate properties of some extensions of a class of Fourier-based...

1. Introduction

1.1. Introduction

Let and be probability measures supported on a bounded convex set . The quadratic Wasserstein distance is defined by


where the infimum is taken over all transference plans (probability measures on such that and for all measurable sets ). Computing the quadratic Wasserstein distance is a nonlinear problem. In this paper, we consider the case where and have smooth positive densities and with respect to Lebesgue measure: and . In this case, there is a classic local linearization of based on a weighted negative homogeneous Sobolev norm, which is derived from linearizing a general Monge–Ampére equation, see §2. In particular, the weighted negative homogeneous Sobolev norm is defined for functions such that by




That is, the space is the dual space of . Under fairly general conditions, if denotes a perturbation of , then (informally speaking) we have

see for example [Theorem 7.2.6 [29]] for a precise statement. Under stronger assumptions, if , then the error term can be shown to be , see §3.5 for details. In this paper, we study how to leverage the connection between the -norm and the metric for computational purposes. In particular, our computational approach is based on a connection between the -norm and the Witten Laplacian, which is a Schrödinger operator of the form

where is a potential that depends on , see Figure 1. We show that this connection provides a method of computation whose computational cost can be controlled by the amount of regularization used when defining the potential, see §4 for details.

Figure 1. Function (left) and its regularized potential (right), see §5.4 for details about this example.

For the case of probability distributions on the unit square represented by arrays, we present a code for computing this linearization of based on the Witten Laplacian, and present a number of numerical examples, see §5. This Witten Laplacian perspective leads to several potential applications; in particular, a method of defining an embedding discussed and methods of smoothing discussed in §6.

1.2. Background

The quadratic Wasserstein distance is an instance of Monge –Kantorovich optimal transport whose study was initiated by Monge [18] in 1781 and generalized by Kantorovich [12] in 1942. The theory of optimal transport has been developed by many authors; for a summary see the book by Villani [29]

. Recently, due to new applications in data science and machine learning, developing methods to compute and approximate optimal transport distances has become an important area of research in applied mathematics, see the surveys by Peyré and Cuturi

[22] and Santambrogio [25].

In this paper, we focus on the connection between the metric and the -norm, which can be used to approximate , see §2.3 and §3.5. The connection between and follows from the work of Brenier [3] in 1987 who discovered that under appropriate conditions the solution to Monge–Kantorovich optimal transport with a quadratic cost can be expressed using a map that pushes forward one measure to the other; moreover, this map is the gradient of a convex function, see §2.1. Brenier’s theorem reduces the problem of computing Wasserstein distance to solving a generalized Monge–Ampére equation. Linearizing this general Monge–Ampére equation gives rise to an elliptic equation, which corresponds to a weighted negative homogeneous Sobolev norm, see §2.3 below or see [§4.1.2, §7.6 of [29]]. The connection between Wasserstein distance and negative homogeneous Sobolev norms has been considered by several different authors as is discussed in the following section.

1.3. Related work

Several authors have considered the connection between negative homogeneous Sobolev norms and the metric in several different contexts. In analysis, this connection has been used to establish estimates by many authors, see for example [13, 15, 23, 26, 27]. Moreover, this connection also arises in the study of how measures change under heat diffusion, see [5, 21, 30]. The weighted negative homogeneous Sobolev norm has also been considered in connection to maximum mean discrepancy (MMD) which is a technique that can be used to compute a distance between point clouds, and has many applications in machine learning, see [1, 19, 20]. Authors have also considered this connection in papers focused on computing the metric, see [4, 9, 14]. Moreover, the negative homogeneous Sobolev norm has been considered in several applications to seismic image and image processing, see [8, 10, 31].

We emphasize three related works. First, Peyre [23] establishes estimates for the metric in terms of the unweighted negative homogeneous Sobolev norm . Let and be probability measures on with densities and with respect to Lebesgue measure. The result of Peyre says that if , then

a discussion and concise proof of this result can be found in [§5.5.2 of [25]]. Informally, this result says that if and have their mass spread out over , then the Wasserstein distance is equivalent to the negative unweighted homogeneous Sobolev norm of ; as a consequence, the weighted Sobolev norm is most interesting for probability distributions that have regions of high and low density.

Second, Engquist, Ren, and Yang [9] study the application of Wasserstein distance to inverse data matching. In particular, they compare the effectiveness of several different Sobolev norms for their applications; we note that the definitions of the norms they consider differ from the norm (2) that we consider, see the discussion in §3.6 below, but their results do indicate that negative homogeneous Sobolev norms may not be an appropriate substitute for the metric for some applications.

Third, Yang, Hu, and Lou [31] consider the implicit regularization effects of Sobolev norms in image processing. In §6 we mention a similar potential applications to image smoothing; our perspective is slightly different, but the underlying idea of this potential application is the same as [31].

2. Preliminaries and motivation

In this section, we briefly summarize material from Chapters 0.1, 0.2, 2.1, 2.3, 4.1, 4.2, and 7.6 of the book by Villani [29]. In order to make these preliminaries as concise as possible we state all definitions and theorems for our special case of interest; in particular, we assume that all probability measures have smooth densities with respect to Lebesgue measure, and restrict our attention to transport with respect to a quadratic cost function. This section is organized as follows: we consider the Monge-Kantorovich transport problem and Brenier’s theorem in §2.1, the Monge-Ampére equation in §2.2, and then discuss linearization of the metric in §2.3. The main purpose of these preliminaries is to provide background for (13), stated at the end of §2.3, which clarifies the statement that the -norm is a linearization of the metric.

2.1. Monge-Kantorovich optimal transport and Brenier’s theorem

Let be a bounded convex set, and and be probability measures on that have positive smooth densities and , respectively, with respect to the Lebesgue measure. A transference plan is a probability measure on such that

for all measurable subsets of . We denote the set of all transference plans by , and define the transportation cost with respect to the quadratic cost function by

In this case, the Monge-Kantorovich optimal transport cost is defined by


Since we are considering a quadratic transportation cost, the Monge-Kantorovich optimal transport cost is the square of the metric:


Under the above conditions, Brenier’s theorem states that the Monge-Kantorovich optimization problem (4) has a unique solution satisfying


where is a convex function such that . Here denotes the push-forward, and denotes a Dirac distribution. To be clear, given , the push-forward is defined by the relation for all measurable sets , or equivalently, by the relation

for all continuous functions on . Combining (4), (5), and (6) gives


2.2. Monge-Ampére equation

Recall that by assumption and have densities and , respectively, with respect to the Lebesgue measure. We can express (6) as

for all bounded continuous functions on . Changing variables on the right hand side and using the fact that is arbitrary gives


Upon rearranging this equation as

it is clear that is an instance of the general Monge-Ampére equation

which has been studied by many authors, see for example [16, 17, 28].

2.3. Linearization of general Monge-Ampére equation

To linearize this general Monge-Ampére equation we assume that is positive, is close to the identity, and . More precisely, assume




for some . Substituting (9) into (7) gives


which expresses in terms of . Substituting (9) and (10) into (8) gives

where are remainder functions depending on , . It follows from rearranging terms that



and denotes some remainder function depending on , , and . Thus, if satisfies , then by (11) and (12) we expect that


which, roughly speaking, says that is a linearization of the quadratic Wasserstein optimal transport problem, see §3.5 for a more precise version of (13).

3. Characterizing the operator

So far we have presented background material that motivates why the quadratic Wasserstein distance is related to the operator defined by

In this section, we discuss the connection between the operator , the negative weighted homogeneous Sobolev norm , and the quadratic Wasserstein distance in detail. The section is organized as follows: We start, in §3.1, by stating and proving a version of Green’s first identity that satisfies. Second, in §3.2 we state a result connecting the -norm to the solution of an elliptic boundary value problem involving . Third, in §3.3, we give a characterization of the -norm in terms of a divergence optimization problem. Fourth, in §3.4 we consider a characterization of the -norm involving the Witten Laplacian, which can be derived from by a change of variables. Fifth, in §3.5 we provide a more precise version of the statement that the -norm is a linearization of the metric. Finally, in §3.6

, we discuss weighted Sobolev norms in relation to the Fourier transform.

3.1. Green’s first identity analog for

Let be a bounded convex domain in , be a once differentiable function, and be a twice differentiable function. Green’s first identity states that


where , and is an exterior unit normal to the surface element .

Proposition 3.1.

Suppose that is a measure on which has a density with respect to Lebesgue measure: . Further assume that is once differentiable, on , and . Then,


whenever on or on .


By the definition of and the fact that we have

Since we assumed or vanishes on it follows from (14) that

Finally, observe that expanding with the product rule and canceling terms gives:

which establishes (15). ∎

3.2. Elliptic boundary value problem

Let be a bounded convex domain in , and suppose that is a measure on which has a density with respect to Lebesgue measure: . Assume that is once differentiable and on . For functions such that we define the -norm by


The following result characterizes the -norm in terms of an elliptic boundary value problem involving the operator .

Proposition 3.2.

We have


where is a solution to the elliptic boundary value problem

Proof of Proposition 3.2.

The second and third equalities in (16) are a direct consequence of the definition of and Proposition 3.1, so we only need to show that . Assume that is a solution to (17) substituting . It follows from (15) that


Using the Cauchy-Schwarz inequality gives

which implies that

On the other hand, from (18) we have

so we conclude that as was to be shown. ∎

3.3. Divergence formulation


-norm can also be formulated as an optimization problem over vector fields satisfying a divergence condition. We note that our computational approach does not directly use this divergence formulation, and that our purpose of stating the following result is for completeness and its connection to other methods.

Proposition 3.3.

We have

where the minimum is taken over vector fields with continuous first-order partial derivatives that satisfy


Note that it will become clear from the proof that it would suffice to assume that (19) holds in a weak sense.


First, observe that if is a solution to the elliptic boundary value problem (17), then is admissible to the minimization since

and . And from Proposition 3.2 we have

To complete the proof it suffices to show that

If we define , then satisfies

Expanding gives

To complete the proof we will show that . Observe that

The first integral on the right hand side is zero since in . The second integral on the right hand side is zero since by the Gauss divergence theorem

and by assumption. This completes the proof. ∎

3.4. Witten Laplacian formulation

The -norm can also be defined in terms of a boundary value problem involving the Witten Laplacian , which is a Schrödinger operator of the form


where is a potential depending on defined by

Alternatively, the Witten Laplacian can be defined by the similarity transform


which symmetrizes in the sense that the resulting operator is self-adjoint with respect to ; for a discussion of the Witten Laplacian and some spectral estimates see [6]. In the following Proposition, we give an elliptic equation involving that can be used to compute the -norm; this proposition is an immediate consequence of the fact that the Schrödinger operator definition (20) is consistent with the similarity transform definition (21).

Proposition 3.4.

Using the notation and we have

where is the solution to the elliptic boundary value problem:


This formulation is an immediate consequence of the identity

which is straightforward to verify: expanding gives

and after canceling terms we have

From the above calculation, it is also clear that

since all terms involving cancel. ∎

Remark 3.1 (Alternate form of the potential).

Using the fact that

we can write as

where .

3.5. Linearization remainder estimate

Recall that previously in §2.3 we gave the informal estimate

where satisfies

The purpose of this section, is to make this informal statement more precise; we emphasize that the result proved in this section is for illustrative purposes: results involving weaker assumptions and weaker regularity conditions are possible.

Let and be probability measures with densities and with respect to the Lebesgue measure: and . Assume that

where is a smooth function satisfying on . Further, assume that

where is a smooth function. Assume that satisfies the nonlinear equation


Let be the solution to the elliptic boundary value problem

We have the following result.

Proposition 3.5.

Under the assumptions of §3.5 we have

where can be chosen in terms of almost everywhere upper bounds on:

where denotes the magnitude of the gradient of , denotes the operator norm of the Hessian of , denotes the measure of , and

denotes the smallest positive eigenvalue of


We demonstrate this result numerically in §5.3.

Proof of Proposition 3.5.

By the Lagrange remainder formulation of Taylor’s Theorem, and (22) we have


where the remainder functions can be expressed by

where are points on the line segment between and . It follows that


where the remainder function consists of all terms in the expansion of the right hand side of (23) that include to power at least . By the definition of and we can rewrite (24) as

Multiplying both sides of this equation by and integrating over gives

Using (15) to rewrite the left hand side gives

By (9) and (16) it follows that


We can bound the norm of by the norm of and the inverse of the smallest positive eigenvalue of ; therefore, we can complete the proof by using Cauchy-Schwarz and almost everywhere bounds on all other quantities. ∎

Remark 3.2.

We note that it is possible to obtain various estimates on from (25) in terms of norms of the quantities and instead of almost everywhere bounds. Moreover, results that guarantee bounds on the solution of general Monge-Ampére equations could be used to provide bounds for and in terms of and , for example see [17].

3.6. Fourier transform and weighted Sobolev norms

In this section, we discuss a family of weighted Sobolev norms defined by Engquist, Ren, and Yang [9] using the Fourier transform

In [9] the authors compare the metric to a family of weighted Sobolev norms defined in Fourier domain for the purpose of inverse data matching; in particular, in [Eq. 8 and Remark 2.1 of [9]] they define


where and denote the Fourier transform of and , respectively, and denotes convolution

Here we refer to the norm defined in (26) as the -norm to avoid confusion with the -norm, which we defined in (3) by

and the -norm defined in (2) by

also see [§7.6 of [29]]. First, observe that if and is the constant function, then is the Dirac delta distribution and

where the final inequality follows from the Plancherel theorem. However, if and is arbitrary, then in general

where denotes the inverse Fourier transform of , and is assumed to have density with respect to Lebesgue measure. It does follow from the Plancherel theorem that

However, in general, the functions and are not equal, and thus in general their integrals against are not equal; in particular, their integrals against can be very different when is localized in space. Roughly speaking, the issue is that taking the absolute value of does not commute with taking the convolution. It is possible to define the -norm in Fourier domain by defining ; however, this does not seem to lead to a viable way to compute the dual -norm except in dimension ; we note that quadratic Wasserstein distance also has a simple characterization in -dimension, see Remark 3.3.

Remark 3.3 ( in -dimension).

Let and be measures on with densities and with respect to Lebesgue measure: and . Let

denote the cumulative distribution functions:

If and are the pseudo-inverse of and defined by


see for example [Remark 2.30 of [22]].

4. Computation and regularization

In this section, we consider the connection between the -norm and the Witten Laplacian, see §3.4, from a computational point of view. Recall that the Witten Laplacian is a Schrödinger operator of the form

where is a potential. Roughly speaking, the advantage of considering this formulation is that all the complexity of the problem has been distilled into the potential , which can be regularized to manage the computational cost. This section is organized as follows. First, in §4.1 we consider a spectral decomposition of

by its Neumann eigenfunctions and define the fractional Laplacian

. Second, in §4.2, we change variables using fractional Laplacians to precondition our elliptic equation involving . Third, in §4.3, we observe how using the heat equation to define a smoothed version of can control the condition number of our problem. Finally, in §4.4 we discuss the computational cost of the described method.

4.1. Spectral decomposition of the Laplacian

Suppose that is a bounded convex domain. Recall that is a Neumann eigenvalue of the Laplacian on if there is a corresponding eigenfunction such that

where denotes the boundary of , and denotes an exterior unit normal to the boundary. The Neumann eigenvalues of the Laplacian are nonnegative real numbers that satisfy

and the corresponding eigenfunctions form an orthogonal basis of square integrable functions on . We can use this basis of Neumann eigenfunctions to define the fractional Laplacian for and by

where we include the constant term , independent of , so that the operator is well-defined and invertible for both positive and negative . With this definition, the operator is invertible, which will become relevant in the following section; in particular, we will use the invertible operator to precondition the elliptic equation .

4.2. Preconditioning the elliptic equation

Recall that our goal is to solve the elliptic equation


with Neumann boundary conditions; note that to simplify notation we dispense with the tilde notation and from §3.4 and just write and . In order to precondition (27) we define and by

It follows that



where denotes the projection onto the space of constant functions,

and denotes the identity operator. We remark that the projection is necessary in the definition of since we have defined to preserve constant functions, while the Laplacian destroys constant functions.

Since is invertible the dimension of the null space of is the same as the dimension of the null space of , which is -dimensional. In particular, we have

Let and denote the smallest positive eigenvalue of and , respectively. If

is a normalized eigenvector associated with

, then it follows from the Courant-Fisher Theorem that

where . In the following, we treat as a fixed constant, which empirically we find is the case. Under this assumption, the condition number of on the space of functions orthogonal to is bounded by the operator norm of , which satisfies

If is an arbitrary smooth positive function, then could still take very large values, which could make our problem ill-conditioned. In the following section, we introduce a definition of that includes smoothing which can be used to control its -norm.

4.3. Smoothing when defining the potential

Recall that the potential in the definition of can be defined by

The basic idea is to run the heat equation on and use the resulting smoothed function to define the potential . Given a function , we define a -parameter family of norms parameterized by as follows. Let and denote the Neumann Laplacian eigenvalues and eigenfunctions, see §4.1. We define the Neumann heat kernel for a function by

Next, we define the smoothed potential by


and define the corresponding operator by

By §3.4, the operator defines a -norm, where is the measure with density . Observe that if then , while when then . In particular, we have