This paper is concerned with inverse problems where we intend to match certain given data to data predicted by a (discrete or continuous) mathematical model, often called the forward model. To set up the problem, we denote by a function () the input of the mathematical model that we are interested in reconstructing from a given datum . We denote by the forward operator that maps the unknown quantity to the datum , that is
where the operator is assumed to be nonlinear in general. We denote by the linearization of the operator at the background . With a bit of abuse of notation, we write to denote a linear inverse problem of the form (1) where . The space of functions where we take our unknown object , denoted by , and datum , denoted by , as well as the exact form of the forward operator , will be given later when we study specific problems.
Inverse problems for (1) are mostly solved computationally due to the lack of analytic inversion formulas. Numerical methods often reformulate the problem as a data matching problem where one takes the solution as the function that minimizes the data discrepancy, measured in a metric , between the model prediction and the measured datum . That is,
The most popular metric used in the past to measure the prediction-data discrepancy is the metric due to its mathematical and computational simplicity. Moreover, it is often the case that a regularization term is added to the mismatch functional to impose extra prior knowledge on the unknown (besides of the fact that it is an element of ) to be reconstructed.
In recent years, the quadratic Wasserstein metric [1, 27, 32] is proposed as an alternative for the metric in solving such inverse data matching problems [6, 7, 13, 18, 20, 19, 22, 23, 29, 34]. Numerical experiments suggest that the quadratic Wasserstein metric has attractive properties for some inverse data matching problems that the classical metric does not have [14, 35]. The objective of this work is trying to understand mathematically these numerical observations reported in the literature. More precisely, we attempt to characterize the numerical inversion of (1) based on the quadratic Wasserstein metric and compare that with the inversion based on the classical metric.
In the rest of the paper, we first briefly review some background materials on the quadratic Wasserstein metric and its connection to inverse data match problems in Section 2. We then study in Section 3 the Fourier domain behavior of the solutions to (1) in the asymptotic regime of the Wasserstein metric: the regime where the model prediction and the datum are sufficiently close. We show that in the asymptotic regime, the Wasserstein inverse solution tends to be smoother than the based inverse solution. We then show in Section 4 that this smoothing effect of the Wasserstein metric also exists in the non-asymptotic regime, but in a less explicit way. In Section 5, we demonstrate, in perhaps overly simplified settings, some advantages of the Wasserstein metric over the metric in solving some important inverse matching problems: inverse transportation, back-projection from (possibly partial) data, and deconvolution of highly concentrated sources. Numerical simulations are shown in Section 6 to demonstrate the main findings of our study. Concluding remarks are offered in Section 7.
2 Background and problem setup
be two probability densities onthat have the same total mass. The square of the quadratic Wasserstein distance between and , denoted as , is defined as
where is the set of measure-preserving maps from to . The map that achieves the infimum is called the optimal transport map between and . In the context of (1), the probability density depends on the unknown function . Therefore, can be viewed as a mismatch functional of for solving the inverse problem. Since the quadratic Wasserstein distance is only defined between probability measures of the same total mass, one has to normalize and and turn them into probability densities when applying them to solve inverse matching problems where and
cannot be interpreted as nonnegative probability density functions. This introduces new issues that need to be addressed.
It is well-known by now that the quadratic Wasserstein distance between and is connected to a weighted distance between them; see [32, Section 7.6] and [21, 25]. For any , let be the space of functions
denotes the Fourier transform ofand . When , is the usual Hilbert space of functions with square integrable derivatives, and . The space with is understood as the dual of . We also introduce the space , , with the (semi-) norm defined through the relation
The space is defined as the dual of via the norm
It was shown [32, Section 7.6] that asymptotically is equivalent to , where the subscript indicates that the space is defined with respect to the reference probability measure . To be precise, if is the probability measure and is an infinitesimal perturbation that has zero total mass, then
This fact allows one to show that, for two positive probability measures and with densities and that are sufficiently regular, we have the following non-asymptotic equivalence between and :
for some constants and . The second inequality is generic with [25, Theorem 1] while the first inequality, proved in [21, Proposition 2.8] and [25, Theorem 5] independently, requires further that and be bounded from above.
In the rest of this paper, we study numerical solutions to the inverse data matching problem for (1) under three different mismatch functionals:
where , denotes the convolution operation, and
Our main goal is to analyze the differences between the Fourier contents of the inverse matching results, a main motivation for us to choose the Fourier domain definition of the norms. These norms allow us to systematically study: (i) the differences between the (i.e. the special case of of ) and the , with a positive or negative , inversion results; (ii) the differences between and inversion results caused by the weight ; and (iii) the similarities and differences between and inversion results. This is our roadmap toward better understandings of the differences between -based and -based inverse data matching.
Note that since the norm is only a shift away from the corresponding norm in the Fourier representation, by replacing with , we do not introduce extra mismatch functionals for those (semi-) norms. We will, however, discuss inversions when we study the corresponding inversions.
In the definition of the objective function, we take the weight function such that is consistent with the linearization of .
We refer interested readers to [32, 21, 25] for technical discussions on the results in (5) and (6) (under more general settings than what we present here) that connect with . For our purpose, these results say that: (i) in the asymptotic regime when two signals and , both being probability density functions, are sufficiently close to each other, their distance can be well approximated by their distance; and (ii) if , then and vice versa, that is, the exact matching solutions to the model (1), if exists, are global minimizers to both and . However, let us emphasize that the non-asymptotic equivalence in (6) does NOT imply that the functional and (if we define one) have exactly the same optimization landscape. In fact, numerical evidences show that the two functionals have different optimization landscapes that are both quite different from that of the mismatch functional ; see for instance Section (5) for analytical and numerical evidences.
3 Frequency responses in asymptotic regime
We first study the Fourier-domain behavior of the solutions to (1) obtained through the minimizing the functionals we introduced in the previous section. At the solution, and are sufficiently close to each other. Therefore their distance can be replaced with their distance according to (5). In the leading order, the solution is simply the solution in this regime.
3.1 -based inverse matching for linear problems
Let us start with a linear inverse matching problem given by the model:
where denotes the datum in (1) polluted by an additive noise introduced in the measuring process. The subscript is used to denote the size (in appropriate norms to be specified soon) of the noise, that is, the size of . Besides, we assume that is still in the range of the operator . When the model is viewed as the linearization of the nonlinear model (1), should be regarded as the perturbation of the background . The model perturbation is also often denoted as . We assume that the linear operator is diagonal in the Fourier domain, that is, it has the symbol,
for some . This assumption is to make some of the calculations more concise but is not essential as we will comment on later. When the exponent , the operator is “smoothing”, in the sense that it maps a given to an output with better regularity than itself. The inverse matching problem of solving for in (10), on the other hand, is ill-conditioned (so would be the corresponding nonlinear inverse problem if is regarded as the linearization of ). The size of , to some extent, can describe the degree of ill-conditionedness of the inverse matching problem.
We assume a priori that for some . Therefore, could be viewed as an operator . We now look at the inversion of the problem under the () framework.
We seek the solution of the inverse problem as the minimizer of the functional , defined as in (7) with and replaced with . We verify that the Fréchet derivative of at in the direction is given by
where we used to denote the adjoint of the operator . The minimizer of is located at the place where its Fréchet derivative vanishes. Therefore the minimizer solves the following (modified) normal equation at frequency :
The solution at frequency is therefore
We can then perform an inverse Fourier transform to find the solution in the physical space. The result is
where the operator is defined through the relation
being the inverse Fourier transform, being the usual Laplacian operator, and being the identity operator.
Let us first remark that the calculations above can be carried out in the same way if the norm is replaced with the norm. The only changes are that should be replaced with and the operator in has to be replaced with .
When , assuming that , the above solution reduces to the classical least-square solution . Moreover, when is invertible (so will be ), the solution can be simplified to , using the fact that , which is simply the true solution to the original problem (10). Therefore, in the same manner, as the classical least-square method, the least-square method based on the norm does not change the solution to the original inverse problem when it is uniquely solvable. This is, however, not the case for the inversion in general. For instance, inversion only matches the derivative of the predicted data to the measured data.
When , is a differential operator. Applying to the datum amplifies high-frequency components of the datum. When , is a (smoothing) integral operator. Applying to the datum suppresses high-frequency components of the datum. Even though the presence of the operator in will un-do the effect of on the datum in a perfect world (when is invertible, and all calculations are done with arbitrary precision), when operated under a given accuracy, inversion with is less sensitive to high-frequency noise in the data while inversion with is more sensitive to high-frequency noise in the data, compared to the case of (that is, the classical least-square inversion). Therefore, inversion with can be seen as a “preconditioned” (by the operator ) least-square inversion.
3.2 Resolution analysis of linear inverse matching
Let be given as in (11) and an approximation to defined through its symbol:
Let be the norm of the additive noise in . Then the reconstruction error , with obtained as the minimizer of , is bounded by
when we select
Following classical results in , it is straightforward to verify that the difference between the true solution and the approximated noisy solution is
We then verify that operators and have the following norms respectively
This allows us to conclude that
One main message carried by the theorem above is that reconstruction based on the mismatch has a spatial resolution
under the conditions in the theorem. At a fixed noise level , for fixed and , the optimal resolution of the inverse matching result degenerates when gets smaller. The case of corresponds to the usual reconstruction in the framework. The optimal resolution one could get in this case is decided by . When (), the best resolution one could get is better than the case in a perfect world. When , the reconstructions in the framework provides an optimal resolution that is worse than the case. In other words, the reconstructions based on the negative norms appear to be smoother than optimal reconstructions in this case. See Section 6 for numerical examples that illustrate this resolution analysis.
However, we should emphasize that the above simple calculation only provides the best-case scenarios. It does not tell us exactly how to achieve the best results in a general setup (when the symbol of
, i.e. the singular value decomposition ofin the discrete case, is not precisely known). Nevertheless, the guiding principle of the analysis is well demonstrated: least-square with a stronger (than the ) norm yield higher resolution reconstructions while least-square with a weaker norm yield lower (again compared to the case) resolution reconstructions in the best case.
3.3 -based inverse matching for linear problems
Inverse matching with the weighted norm can be analyzed in the same manner to study the impact of the weight on the inverse matching result. The solution to (10) in this case is sought as the minimizer of the functional defined in (8) with and . This means that the weight in our definition of the objective function.
Following the same calculation as in the previous subsection, we find that the minimizer of the functional solves the following normal equation at frequency :
where is the adjoint of the operator defined through the relation .
We first observe that the right-hand side of (17) and that of (12) are different. In (12), the -th Fourier mode of the datum is amplified or attenuated, depending on the sign of , by a factor of . While in (17), this mode is further convoluted with other modes of the datum after the amplification/attenuation. The convolution induces mixing between different modes of the datum. Therefore, inverse matching with the weighted norm cannot be done mode by mode as what we did for the unweighted norm, even when we assume that the forward operator is diagonal. However, main effect of the norm on the inversion, the smoothing/sharpening effect introduced by the factor (half of which come from the factor in front of while the other half come from the factor in ), are the same in both the unweighted and the weighted norms.
The inverse matching solution, in this case, written in physical space, is:
We can again compare this with the unweighted solution in (13). The only difference is the introduction of the inhomogeneity, which depends on the datum , in the preconditioning operator by replacing it with . When (), and are (local) differential operators. Roughly speaking, compared to , emphasizes places where is small, be reminded that , or the -th order derivative of is large. At those locations, amplifies the same modes of the datum more than does. When , and are non-local operators. The impact of is more global (as we have seen in the previous paragraph in the Fourier domain). It is hard to precisely characterize the impact of without knowing its form explicitly. However, we can still see, for instance, from (17), that the smoother is, the smoother the inverse matching result will be (since has fast decay and the convolution will further smooth out ). If is very rough, say that it behaves like Gaussian noise, then decays very slow. The convolution, in this case, will not smooth out as much as in the previous case. The main effect of on the inverse matching result in this case mainly comes from the norm, not the weight.
Thanks to the asymptotic equivalence between and in (5), the smoothing effect we observe in this section for the inverse matching (and therefore inverse matching since is only different from on the zeroth moment in the Fourier domain) is also what we observe in the
on the zeroth moment in the Fourier domain) is also what we observe in theinverse matching. This observation will be demonstrated in more detail in our numerical simulations in Section 6.
3.4 Iterative solution of nonlinear inverse matching
The simple analysis in the previous sections based on the linearized quadratic Wasserstein metric, i.e., a weighted norm, on the inverse matching of linear model (10) does not translate directly to the case of inverse matching with the fully nonlinear model (1). Nevertheless, the analysis does provide us some insights.
Let us consider an iterative matching algorithm for the nonlinear problem, starting with a given initial guess , characterized by the following iteration:
where is a chosen descent direction of the objective functional at iteration , and is the step length at this iteration. For simplicity, let us take the steepest descent method where the descent direction is taken as the negative gradient direction. Following the calculations in the previous section, we verify that the Fréchet derivative of at the current iteration in the direction is given by
assuming that the forward model is Fréchet differentiable at the with derivative . This leads to the following descent direction chosen by a gradient descent method:
Let us compare this with the descent direction resulted from the least-square functional:
We see that the iterative process of the inverse matching can be viewed as a preconditioned version of the corresponding iteration. The preconditioning operator, , depends on the datum but is independent of the iteration. When the iteration is stopped after a finite step, the effect we observed for linear problems, that is, the smoothing effect of in the case of or its de-smoothing effect in the case of , is carried to the solution of nonlinear problems.
Wasserstein smoothing in the asymptotic regime.
To summarize, when the model predictions and the measured data are sufficiently close to each other, inverse matching with the quadratic Wasserstein metric, or equivalently the metric, can be viewed as a preconditioned -based inverse matching. The preconditioning operator is roughly the inverse Laplacian operator with a coefficient given by the datum. The optimal resolution of the inversion result from the Wasserstein metric, with data at a given noise level is roughly of the order ( being the order of the operator at the optimal solution and ) instead of as given in the case. The shape of the datum distorts the Wasserstein matching result slightly from the inverse matching result with a regular (semi-) norm.
4 Wasserstein iterations in non-asymptotic regime
As we have seen from the previous sections, in the asymptotic regime, the smoothing effect of the quadratic Wasserstein metric in solving inverse matching problems can be characterized relatively precise thanks to the equivalence between and given in (5). The demonstrated smoothing effect makes -based inverse matching very robust against high-frequency noise in the measured data. This phenomenon has been reported in the numerical results published in recent years [6, 13, 14, 34, 35] and is one of the main reasons that is considered as a good alternative for -based matching methods. In this section, we argue that the smoothing effect of can also be observed in the non-asymptotic regime, that is, a regime where signals and are sufficiently far away from each other. The smoothing effect in the non-asymptotic regime implies that the landscape of the objective functional is smoother than that of the classical objective functional.
To see the smoothing effect of in non-asymptotic regime, we analyze the inverse matching procedure described by the iterative scheme (19) for the objective functional , defined in (9). For the sake of being technically correct, we assume that the data we are comparing in this section are sufficiently regular. More precisely, we assume that and for some . We also assume that the map () is Fréchet differentiable at any admissible and denote by the derivative in direction . We can then write down the Fréchet derivative of at the current iteration in the direction following the differentiability result of with respect to . It is,
where denotes the optimal transport map at iteration (that is, for ), and denotes the derivative of with respect to (not ) in the direction .
Following the optimal transport theorem of Brenier , the optimal transport map at the current iteration , , is given as where is the unique (up to a constant) convex solution to the Monge-Ampère equation:
Here is the Hessian operator defined through the Hessian matrix (with the notation ).
Let be the Fréchet derivative of at in the direction , we then verify that solves the following second-order elliptic equation to the leading order:
where while depend on the dimension. When , () and (). When , we have
Let be the solution to the (adjoint) equation:
It is then straightforward to verify, following standard adjoint state calculations , that update direction can be written as
where denotes the adjoint of the operator .
We first observe that unlike in the classical case where is applied directly to the residual , that is, , the descent direction here depends on the model prediction and the datum only implicitly through the transfer map and its derivative with respect to . Only in the asymptotic regime of being very close to can we make the connection between and the normalized residual. This is where the approximation to comes from.
From Caffarelli’s regularity theory (c.g. [32, Theorem 4.14]), which states that when and we have that the Monge-Ampère solution , we see that is at least . Therefore the solution to the adjoint problem, , is in
by standard theory for elliptic partial differential equations when the problem is not degenerate and inif it is degenerate. Therefore, is one derivative smoother than and (and therefore the residual). This is exactly what the preconditioning operator (with ) did to the residual in the asymptotic regime, for instance, as shown in (13). This shows that inverse matching has smoothing effect even in the non-asymptotic regime.
In one-dimensional case, we can see the smoothing effect more explicitly since we are allowed to construct the optimal map explicitly in this case. Let and be the cumulative density functions for and respectively. The optimal transportation theorem in one-dimensional setting (c.g. [32, Theorem 2.18]) then says that the optimal transportation map from to is given by . This allows us to verify that, the gradient of at in direction , given in (23), is simplified to:
where the function is defined as . Therefore the descent direction (27) simplifies to
It is clear from (29) that the Fréchet derivative of at iteration depends only on the anti-derivatives of , and , through and . Therefore, it is smoother than the Fréchet derivative of in general, whether or not the signals and are close to each other. This shows why the smoothing effect of exists also in non-asymptotic regime.
To provide some numerical evidences, we show in Figure 1 and Figure 2 some gradients of the and objective functions, with respect to the absorption coefficient (i.e. ), for the inverse diffusion problem we study in Section 6.2, in one- and two-dimensional domains and respectively. The synthetic data, generated by applying the forward operator to the true absorption coefficient and then adding multiplicative random noise, contains roughly of random noise. We intentionally choose initial guesses to be relatively far away from the true coefficient so that the model prediction is far from the data to be matched. We are not interested in a direct quantitative comparison between the gradient of the Wasserstein objective function and that of the objective function since we do not have a good basis for the comparison. However, it is evident from these numerical results that the gradient of the Wasserstein functional is smoother, or contains fewer frequencies to be more precise, compared to the corresponding case.
5 Wasserstein inverse matching for transportation and convolution
Its robustness against high-frequency noise in data, resulted from its smoothing effect we demonstrated in the previous two sections, is not the only reason why is thought as better than for many inverse data matching problems. We show in this section another advantage of the distance in studying inverse matching problems: its convexity with respect to translation and dilation of signals.
5.1 convexity with respect to affine transformations
For a given probability density function with finite moments and , we define:
where with symmetric and positive-definite, and . This is simply a translation (by ) and dilation (by ) of the function . We verify that .
Let be generated from with symmetric and positive-definite. Then we check that the optimal transport map from to is given by . In other words, the function is a convex solution to the Monge-Ampère equation (24) with this pair. This observation allows us to find that,
This calculation shows that is convex with respect to and for rather general probability density function .