Geophysical Inversion and Optimal Transport

We propose a new approach to measuring the agreement between two oscillatory time series, such as seismic waveforms, and demonstrate that it can be employed effectively in inverse problems. Our approach is based on Optimal Transport theory and the Wasserstein distance, with a novel transformation of the time series to ensure that necessary normalisation and positivity conditions are met. Our measure is differentiable, and can readily be employed within an optimization framework. We demonstrate performance with a variety of synthetic examples, including seismic source inversion, and observe substantially better convergence properties than achieved with conventional L_2 misfits. We also briefly discuss the relationship between Optimal Transport and Bayesian inference.



page 5

page 10

page 13

page 19

page 20

page 23

page 27

page 42


Transform-based particle filtering for elliptic Bayesian inverse problems

We introduce optimal transport based resampling in adaptive SMC. We cons...

Optimal Transport Based Seismic Inversion: Beyond Cycle Skipping

Full waveform inversion (FWI) is today a standard process for the invers...

Optimal Copula Transport for Clustering Multivariate Time Series

This paper presents a new methodology for clustering multivariate time s...

An optimal transport approach to data compression in distributionally robust control

We consider the problem of controlling a stochastic linear time-invarian...

Autoregressive Optimal Transport Models

Series of distributions indexed by equally spaced time points are ubiqui...

Bayesian Learning with Wasserstein Barycenters

In this work we introduce a novel paradigm for Bayesian learning based o...

Remote measurement of sea ice dynamics with regularized optimal transport

As Arctic conditions rapidly change, human activity in the Arctic will c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In seismological studies—at global, regional and exploration scales—we often set out to find the earth structure and/or seismic source parameters that can explain observed seismic waveforms. In order to do so, we must first introduce some measure of the similarity between two waveforms, and hence define what it means for a synthetic seismogram to ‘agree with’ observed data. The choice we make here can have far-reaching consequences, influencing not only the characteristics of the solution we obtain, but also the efficiency of the algorithms by which we obtain it.

By far the most commonly-used measure of similarity (or ‘misfit function’) is the squared difference between the two waveforms, integrated over time: in other words, the square of the Euclidean (i.e., ) distance between the two. This is conceptually straightforward, mirroring our everyday notion of ‘distance’. Moreover, in cases where the waveforms are linear in the parameters sought (or may be approximated as such), this choice enables a highly-efficient solution strategy—the well-known ‘least-squares algorithm’. However, in many cases the relationship between Earth parameters and seismic waveform amplitudes is known to be non-linear; perhaps highly so. As a result, the Euclidean misfit can become complex with multiple minima, leading to difficult optimization problems.

One pragmatic approach may be to consider only small perturbations in parameters, allowing problems to be ‘linearized’, and almost all seismic waveform inversion is to some extent based on linearization. However, a key issue is whether one has sufficient information about the unknown Earth or source parameters to allow an initial guess that is close enough to the best fit solution to render linearization useful in practice. In particular, for gradient-based optimization to converge, the predicted waveforms must be closely-aligned with the corresponding phase of the observed waveforms. This is illustrated in Fig. 1: in the top panel two double Ricker wavelets are sufficiently close that minimization of will likely result in an optimal alignment, whereas in the second panel the first peak on the blue waveform is closer to the second peak of the orange waveform, which suggests that the same approach might result in the incorrect alignment of peaks.

Various strategies have been developed to mitigate potential alignment issues, typically through preprocessing steps. Examples include first fitting arrival times of phases in body wave studies, or fitting seismic envelopes in surface wave studies. In reflection seismology it is usually necessary to first obtain the correct long wavelength velocity structure of a region, e.g. by travel time tomography, in order to align phases, as a prelude to full waveform inversion (see e.g. Sirgue & Pratt, 2004). Without such alignment, gradient based methods that minimize the norm of the waveform difference can fail to converge, or converge to local minima in the misfit function. In exploration seismology this is commonly known as cycle skipping, the solution of which has been a long-standing area of research with many innovative approaches proposed, typically involving alternative strategies for measuring waveform misfit (see e.g. Luo & Schuster, 1991), or multiple fitting procedures using different criteria. For a discussion of these issues in the context for full waveform inversion in exploration seismology see Gauthier et al. (1986); Virieux & Operto (2009), and for a recent review of proposed solutions see Pladys et al. (2021).

Figure 1: Two scenarios for alignments of double Ricker wavelets. In the top panel the blue predicted wavelet is within half of one period of the noisy orange target so that local minimization of a squared waveform misfit for the time offset results in global alignment. Here the orange curve has Gaussian noise added with correlation length 0.03s and amplitude of of the maximum signal height. In the second panel the mis-alignment is sufficiently large to cause convergence to a secondary minimum, where the first peak of the blue is closer to the second peak of the orange. The bottom two panels show the squared waveform, misfit and two types of Wasserstein distance, and (see text), as a function of offset. The time offsets of the waveforms in the top panels are indicated by coloured dots on the lower two panels. The waveform misfit has many local minima while both the Wasserstein distances have a single global minimum.

In this paper we explore a different measure of waveform similarity, based on Optimal Transport (Villani, 2003, 2008), which we believe can reduce or eliminate the need for such ad hoc approaches. The theory of Optimal Transport can be traced back to the work of Monge (1781), who sought to understand the most efficient way to distribute raw materials from sources to sinks; over the past two decades, it has undergone a renaissance and received much attention in the mathematics literature (e.g. Ambrosio, 2003; Santambrogio, 2015). It was first introduced into geophysics by Engquist & Froese (2014) who pioneered its use in exploration seismology. Their work demonstrated its efficacy as a seismic misfit function for full waveform inversion, addressing the cycle skipping problem for high-frequency reflection waveforms. Optimal Transport-based misfit functions were found to increase the effective convexity of the objective function and hence the range of time shifts over which gradient-based algorithms would converge (Engquist & Froese, 2014; Engquist et al., 2016; Yang & Engquist, 2018; Yang et al., 2018). These results encouraged other authors to adopt the idea and develop variations on the theme in applications to full waveform imaging (Métivier et al., 2016a, b, c, d). Although methods and details vary between studies, these contributions have now established Optimal Transport as a powerful new approach for the construction of novel seismic waveform misfit functions in reflection seismology. More recently, the application of Optimal Transport has been explored for other geophysical problems including seismic receiver function inversion (Hedjazian et al., 2019), gravity inversion (Huang et al., 2019), and simulation of broad-band ground motions (Okazaki et al., 2021). A number of useful surveys of the mathematical and computational aspects of Optimal Transport have been published, including those of Ambrosio (2003), Santambrogio (2015), Kolouri et al. (2017), Levy & Schwindt (2018) and Peyré & Cuturi (2019).

Inspired by this work, the present paper proposes a fairly general framework for seismic waveform fitting using Optimal Transport, which we believe is particularly suited to regional or global settings, where observed and predicted waveform windows may be significantly mis-aligned in time and have large differences in amplitude. The approach may also have applications to other time series fitting problems that arise in the geosciences. We depart from previous methods developed for reflection seismology in both how Optimal Transport is applied to measure waveform discrepancy, and also the approach to solving the resulting Optimal Transport problem itself. Here the emphasis is on utilizing computationally efficient analytical solutions and calculating exact derivatives for use in nonlinear inversion.

In the following section we provide a brief introduction to Optimal Transport, particularly highlighting its application in 1D problems where convenient analytical solutions exist. The focus is on algorithms for exact calculation of Optimal Transport misfit functions, known as Wasserstein distances. In subsequent sections we present our new formulation for seismic waveform fitting together with our approach to Wasserstein distance calculation including its derivatives. In order to give the reader a concise description of our proposed algorithm, we focus on the main steps in the body of the paper, and leave many of the mathematical details to appendices. We illustrate our approach through application to two waveform fitting problems. The first, motivated by earlier work, is a toy problem where the amplitude, time shift and dominant frequency of double-Ricker wavelets are solved for by minimizing different Wasserstein distances between a noisy and noiseless signal. The second is an example drawn from coupled seismic source location and moment tensor inversion of high rate GPS displacement waveforms, previously studied by

O’Toole & Woodhouse (2011); O’Toole et al. (2012).

2 Optimal Transport

The original formulation of Optimal Transport by Monge (1781) focussed on finding efficient strategies for distributing resources (for example, coal) from sources (i.e. mines) to consumers. We introduce one 1D density function, , to represent the initial distribution, and another, , to represent the distribution of consumers—also known as the target distribution. We assume that the overall volume of production is matched to consumption, and hence (with an appropriate choice of units) we can write


Monge assumed that all the material initially in an interval about point would be transported to some interval about another point , with all . Moreover, he assumed that this Transport Map, , is one-to-one, i.e. no splitting of mass can occur. If this is to satisfy the needs of consumers, the Transport Map needs to be chosen such that


This expression represents the conservation of mass in transforming one density, to another, . In general, there may be many maps satisfying this constraint. To find the optimal strategy, we need to determine the cost associated with any particular choice. We introduce to represent the cost of transporting a unit volume of material from point to point ; thus, the overall cost associated with distributing resources according to map is given by


Monge focused on the case where the cost function has general form , and sought the optimal Transport Map that minimised the total cost functional, .

Subsequent work has proven that for the choice , there is always a unique optimal Transport Map provided and are absolutely continuous with respect to the Lebesgue measure (Brenier, 1991). In general, continuous formulations of this kind require difficult non-linear functional optimization. Several methods for the solution of Optimal Transport problems exist, depending on the choice of the distance function and whether the density functions and

are restricted to 1D (as in our description here), or 2D (as required for many practical problems). One class of approach is via the numerical solution of the Monge-Ampere partial differential equation

(De Philippis & Figalli, 2014; Benamou & Brenier, 2000). This also corresponds to and was used extensively in the earliest studies in geophysics (Engquist & Froese, 2014; Engquist et al., 2016; Yang & Engquist, 2018; Yang et al., 2018).

Following Monge’s initial formulation, Kantorovich (1942) posed a more general problem by allowing mass to be split and combined during transport. Thus, Kantorovich’s work is based upon the concept of a transport plan, , chosen such that


Again—as illustrated in Fig. 2—there are may be multiple plans that satisfy these requirements. The transport plan can be thought of as a recipe for redistribution of mass from onto , where an infinitesimal region at gets spread along the axis according to the corresponding conditional distribution, . The overall cost of the plan is given by


and again one may seek the plan with minimal cost.

Figure 2: Two examples of transport plans, , that have the same 1D marginal densities, and, . The transport plan redistributes the mass from the source to the target . In the left panel the 2D density, , is almost a single line, corresponding to a transport map, while on the right, mass is distributed from each point in to multiple regions along the axis, corresponding to a transport plan.

If and are functions of continuous variables , and we henceforth refer to this as the continuous case, then it can be shown that the optimal plan, i.e. , becomes a one to one map, for all between and . This is illustrated in the left-hand panel of Fig. 2 and this corresponds to Monge’s Transport Map, . Hence they each lead to the same result. For the discrete case Kantorovich (1942) showed that, for any , the problem of finding

reduces to one of Linear Programming, details of which appear in Appendix

A. This was important because the existence of a solution in Monge’s nonlinear approach, (3), is not guaranteed, whereas a solution can always be found with Kantorovich’s Linear Programming (LP) formulation. This approach is also completely general, in that it can find solutions for any value with and in any number of dimensions. The disadvantage of the LP formulation is that it becomes computationally prohibitive as the number of discrete elements in and increases, not least because the number of unknowns to be solved for in the transport matrix, , is equal to the product of the number of elements in and . The computational cost can be reduced somewhat with primal-dual optimization methods. For this is equivalent to minimization of the Kantorovich-Rubenstein norm between density functions (Lellmann et al., 2014), which has also been extensively applied in seismic full waveform inversion of reflection data (Métivier et al., 2016b, c, a, d) and more recently to receiver function inversion (Hedjazian et al., 2019) as well as body and surface waves in land seismics (He et al., 2019). Another method for 2D discrete cases with involves using a regularized Linear Programming formulation through addition of an entropy term. This introduces an element of smoothing to the transport plan but allows more efficient algorithms (Cuturi, 2013; Solomon et al., 2015; Solomon, 2015), and is popular within the computer science literature.

Overall, we see a variety of different approaches to solving Optimal Transport problems have been developed, reflecting different choices regarding the value of , the dimension of the density functions, and whether the problems are continuous or discrete. In this paper our formulation of the problem for time series means that we only need to solve Optimal Transport problems in 1D. We now briefly describe some results for the 1D case which we will use later as the basis of an efficient discrete solution algorithm.

2.1 Analytical solutions in 1D

When and represent 1D densities, and and are real variables, a unique Transport Map solution exists for all distance functions of the form , for (Villani, 2003, 2008). Typically this is expressed through the -Wasserstein distance, which we have loosely referred to above, and may now be defined from (3) as


where represents the space of all transport plans satisfying (2). The Wasserstein distance, , is a measure of the work done in redistributing to . Substituting in the general cost functions we have


where we use to denote the optimal Transport Map. Villani (2003) has shown that for the 1D case there exists a simple analytical solution for with two alternative forms


or equivalently,


where and

are the cumulative distribution functions of

and respectively


and and are the inverse cumulative distribution functions of the variable , where .

The expression on the right hand side of (9) is the integral of the absolute difference between the inverse cumulative distributions to the power of . For the integral (9) corresponds to the area between the two curves, and since the integration may be performed along either axis then we also have an equivalent expression


which avoids the calculation of the inverse cumulative distributions. Furthermore by comparing (7) and (8) one can see that for the 1D continuous case the Transport map is given under any -norm by


3 Application to time series

We will shortly discuss how the preceding results may be implemented as numerical algorithms. However, it is convenient to first discuss the problem of applying Optimal Transport to the comparison of two arbitrary time series. As the discussion above shows, Optimal Transport is conceived as a method for transforming between density functions, which are always positive and integrate to one. Time series are generally oscillatory, with both positive and negative values; they may or may not integrate to any known value. Some scheme is therefore required to transform time series into a format amenable to application of Optimal Transport theory, and previous authors have adopted a variety of strategies. This is a key issue in the effectiveness of any approach, and Mainini (2012) critically evaluates the merits of various transforms from time series to density-like functions. The simplest option is to add an arbitrary scalar to the amplitudes of all waveforms to ensure positivity. However, this is unsuitable for an inversion application, since the amplitude of the predicted seismic waveforms depend on unknown Earth model parameters and cannot be known in advance.

Two other popular choices, both considered by Engquist & Froese (2014), are firstly taking the absolute value of waveforms; and secondly, the separate transport of positive to positive and negative to negative components between pairs of waveforms. Drawbacks of the first approach include a loss of polarity information in the waveform, and a lack of differentiability at times where the time series crosses the zero axis, both of which may be detrimental in an inversion context. In the second transform two Wasserstein distances are calculated, i.e. one between positive parts of observed and predicted waveforms and a second between negative parts of the waveforms. This has the undesirable quality of creating an artificial decorrelation between positive and negative parts of the signal. It also does not preserve mass, i.e. integrals of the waveform over the time window (Mainini, 2012).

Of the other transforms reviewed by Mainini (2012) the most popular seems to be the global strategy that performs transport between two composite signals, constructed by combining the positive part of each waveform with the (inverted) negative part of the other. This is popular among some authors because it is both symmetric and preserves mass, but it has the disadvantage (in the context of waveform fitting for inversion) that it blurs the distinction between observed and predicted waveforms. In some circumstances—such as if the original waveforms happen to be separated significantly in time—it is possible that transport will occur between the positive and negative components of the same signal, i.e. the Wasserstein distance of misfit might reflect the transport of the positive parts of the predicted waveform to the negative parts of the predicted waveform rather than the observed waveform. It can be shown that this global strategy underlies the primal-dual optimization () approach employed in the full waveform inversion studies of Métivier et al. (2016b, c, a, d). In another, more recent contribution, Sun & Alkhalifah (2019) have proposed using Optimal Transport to compare a matching filter, calculated from the deconvolution of a predicted and observed waveform, with a representation of the Dirac delta function. In this case the objects being compared are both naturally positive and hence lend themselves to application of Optimal Transport.

It is clear that there is no universally accepted procedure for imposing positivity on time series, or more generally creating density functions for use with Optimal Transport. Different studies use differing approaches and each have drawbacks. In a departure from earlier trends, Métivier et al. (2018, 2019) proposed a novel idea called ‘Graph space OT’, or GSOT. In their study they assign a series of delta functions along both the observed and predicted waveforms and perform transport between the 2D density functions formed by their sum in the time-amplitude plane. In this way the positivity issue is circumvented by recasting the waveform transport problem as one of integer assignment between two point clouds in 2D, which can be solved with the ‘Auction’ algorithm of Bertsekas & Castanon (1989). Projecting the problem into 2D is appealing because the positive or negative amplitudes of the original waveforms are treated equally and merely determine the location of points in time-amplitude space. It does, however, depend on choices made in defining a Euclidean distance in time-amplitude space. This approach continues to be applied to an increasing range of full waveform inversion problems (see Górszczyk et al., 2021).

In our approach below, we also recast the transport problem from 1D to 2D, but rather than using a discrete point cloud we build a 2D density function across the time-amplitude plane and use this as the basis of Wasserstein calculations. As will be seen, the shape of our 2D density function extends the character of a waveform across the time-amplitude plane without the need to treat positive and negative amplitudes separately. For computational convenience we project our 2D density function transport problem back to 1D through marginalisation, which allows use of the exact Optimal Transport solutions given in section 2. To aid clarity, we describe our step-by-step procedure below, with some mathematical details confined to the appendices. When these steps are combined, they form a procedure for calculating the Wasserstein distances between an observed waveform, , and its synthetic counterpart,

, which exist in independent time-amplitude windows that may, or may not, overlap. Importantly, each step is designed to be a differentiable transform of the waveforms, so that application of the chain rule allows evaluation of the gradient of the Wasserstein distance with respect to model parameters—as is required if the misfit function is to be the basis of any gradient-based optimisation procedure.

3.1 Time-Amplitude windows

The first step of our approach is to define a finite time-amplitude window for both the observed and predicted waveforms, as shown in Fig. 3. We assume that independent time windows, and , are constructed for observed and predicted waveforms respectively, which may or may not be overlapping. In this case a simple affine transform in time will suffice to produce a non-dimensional time parameter . Specifically we have for the observation window


where , and clearly for the observation window, and for the predicted window. Similarly, it is straightforward to define an amplitude window that is sufficient to encompass the observed waveform, . However, as previously discussed, it cannot be guaranteed that the predicted waveform will also lie in this amplitude range, as this will depend on the model parameters chosen. We therefore wish to apply a transformation that maps the entire amplitude axis, , to a finite range, i.e. , while retaining maximum sensitivity within the interval . A convenient choice is based on the inverse tangent function,




In practice we might choose to extend the amplitude parameters beyond the amplitude range of the observed waveform, , by about , e.g. , where . Combining these two transforms produces the desired outcome of a pair of time-amplitude windows, with a common amplitude range, guaranteed to contain both observed and predicted waveforms, and with maximum sensitivity about the desired interval of the observed waveform.

In most seismic inversion applications, derivatives of waveform amplitudes with respect to Earth model, or seismic source, parameters will be required and we assume these are available from a given forward model. Since (14)-(15) transform the original waveform amplitudes, , to non-dimensional amplitudes, , then we will require the derivative of the transformed amplitude with respect to the untransformed amplitude. This is straightforward and given by

Figure 3: a) & b) show two example double Ricker wavelets, and , respectively, with differing dominant frequency and amplitudes, and windows displaced in time by 7s. The wavelet, , has Gaussian noise added with correlation length 0.03s and amplitude of of the maximum signal height. For each window we have , and the Time-Amplitude transform (13) - (15) produces non-dimensional variables and a window for the observed waveform. c) & d) show 2D nearest distance fields calculated for each curve as described in the main text. These have the appearance of fingerprints and represent the information content of the waveform extended across the time-amplitude plane. The shaded density functions on the amplitude and time axis for panels c) & d) are 1D marginals, derived from 2-D density fields described in section 3.3, which are used to represent misfit between the two curves. In the experiments of section 4.1 the best time shift, amplitude and frequency parameters are found by minimizing the Wasserstein distances between ‘observed’ and ‘predicted’ marginals.

3.2 Seismogram fingerprints

The second step of our approach is to convert each 1D transformed waveform, , into a representative 2D function, over its respective transformed time-amplitude window. A convenient choice is the distance from each point within the time-amplitude window to the nearest point on the waveform, . In practice this calculation is discretised onto a regular grid and represented as . Figs. 3c & d and 4 show examples of a contoured nearest distance field for double Ricker wavelets and a seismic receiver function, respectively. As can be seen, this has the appearance of a ‘fingerprint’ that contains within it the original waveforms, as the zero distance contour, but spreads the influence of each amplitude variation across the entire plane. The contours of the 2D nearest distance field contain the full information content of the oscillatory waveform in a higher dimensional object, and avoid the need to specifically manipulate negative parts of the waveform. This idea is motivated by the work of Sethian (1999) who found that using nearest distance fields constructed from handwriting examples, helped improve the success of optical character recognition algorithms. In our case, it is appealing because the information content previously confined to a complex 1D object (a waveform) has been converted to a 2D image. Since the distance field, , is calculated in the transformed time-amplitude plane, then clearly the relative influence of amplitude versus time variations on the shape of its contours (Fig. 3c & d) will depend on the choice of the ratio .

It transpires that to find the nearest distance field, , it is not necessary to exhaustively search over the complete discretized waveform for the minimum distance to each grid point, as there are several more efficient methods available. Indeed this calculation may be familiar to a seismologist because it is identical to that of constructing a seismic travel time field with geometric ray theory. In particular, precisely the same distance contours are obtained if we consider the receiver function in Fig. 4 as a seismic wavefront at zero-time, and then let it propagate both in the positive and negative amplitude directions across the plane in a homogenous medium with unit velocity. Methods based on the Fast Marching Algorithm (Sethian, 1996; Rawlinson & Sambridge, 2004) are available for this task. In our study we represent the waveform as a series of linear segments, rather than discrete points, and so is the distance from the th grid point to the nearest point on the segmented wavefront. For this higher order representation Fast Marching is not ideal and the method we employ is described in Appendix C. For this second step we also require derivatives of with respect to the non-dimensional amplitude of the waveform, i.e. , where . Given the equivalence to geometric ray theory, it turns out that the distance only depends on the waveform amplitudes, , of the segment containing the nearest point. Hence a maximum of two derivatives are non-zero for each . Furthermore, analytical expressions for the derivatives are easily obtained since they are closely related to well known expressions for derivatives of travel times with respect to seismic source location in a homogeneous medium. Full details are given in Appendix C.

3.3 Density fields

The diligent reader will have spotted that the distance field is zero on the original ray and increases away from it. Since our aim is to create a density function for use with Optimal Transport, it seems reasonable to presume that the highest density should be on the waveform, and decrease away from it, rather than the reverse. The third step of our approach is then simply to convert the distance field to a density, by taking its negative exponential. Specifically we write the unnormalised density at the grid point, , as


where is an arbitrary scale factor. Together with the ratio , the scale factor, , determines the distance metric controlling the influence of the original waveform on the density amplitudes, . The lower panel of Fig. 4 shows an example of the 2D density function formed from the receiver function in the top panel for the case . To satisfy the mass conservation conditions necessary for Optimal Transport (1), we normalise the density amplitudes


which yields a form suitable for input to a Wasserstein distance calculation. Choices other than (17) are of course possible, e.g. using the square of the distance or other function. We have not explored these possibilities, and merely content ourselves that the resulting 2D density function shown in Fig. 4 looks to be reasonable, in that it has maximum weight on the original waveform and propagates the influence of the waveform shape across the time-amplitude plane. As in the earlier steps we need to keep track of derivatives, in this case of the normalised (or unnormalised) density function amplitudes with respect to the distance field, which means we require and . Again these may be obtained analytically in a straightforward fashion, and full details are given in Appendix C.

Figure 4: Top panel shows a seismic receiver function and its associated nearest distance field contours. The lower panel shows the 2D density function created by the distance field using (17) with a scale factor of . Larger values create a broader function about the original waveform that forms a ‘ridge’ with the highest density. 2D density functions can be constructed for both observed and predicted waveforms and the Wasserstein distance between them calculated using Optimal Transport, which becomes the misfit function in our waveform fitting algorithm.

3.4 Marginal Wasserstein

The final step of our procedure is to calculate Wasserstein distances between observed and predicted 2D density functions, as in Fig. 4. While there are several methods available for 2D Wasserstein calculations, each involve the numerical solution of either a partial differential equation, as in the Monge-Ampere solvers, or numerical solution of an optimization problem, as for Linear Programming based solvers and primal-dual methods. Recently, much effort has been devoted to developing computationally and memory efficient 2D methods based on entropically-smoothed Wasserstein measures (Cuturi, 2013; Solomon et al., 2015; Solomon, 2015). However these involve the Optimal Transport between 2D density functions like that in Fig. 4, each of which is over a grid of discrete elements. Typically the computational costs of many of these methods will scale with the cube of the size of the input, hence in the worst case. This is unattractive for our purpose, where applications are likely to involve computing misfits between many pairs of waveforms. Rather than calculate the Wasserstein distance in 2D, our solution is to recast the problem as two 1D Optimal Transport problems by replacing the density function with its amplitude and time marginals. Specifically, we integrate the 2D density function in (17) along the amplitude axis to produce a 1D time marginal, , and also along the time axis to produce a 1D amplitude marginal, . We have




Fig. 3c & d, shows examples of these marginals shaded along side each axis, for the double-Ricker wavelet case. The shape of each marginal is in general dependent on both time and amplitude variations in the original waveform.

In this way Wasserstein distance calculations between observed and predicted counterparts, which we write as and , require only the solution of 1D Optimal Transport problems. This significantly reduces the computational complexity of the task, and allows us to exploit the analytical solution given by (9). Our algorithm for evaluating 1D Wasserstein distances is described in the following section.

It is clear that some information must be lost through use of marginalization, but as we shall show, this does not appear to be a disadvantage because in all cases considered here, we observe that alignment of the pair of 1D marginals uniquely corresponds to alignment of the original waveforms and importantly therefore no significant non-uniqueness is introduced in these cases. This suggests that 1D marginals provide an effective representation of the information content of the 2D density function.

Since our procedure results in two Wasserstein distances, we have the option to minimize their weighted sum with respect to Earth model parameters, which we write as


where the subscript in denotes the order- Wasserstein distance, the superscript indicates that it is to the power of , the first superscript in and indicates either the time or amplitude marginal, and reflects the choice of relative weighting, . For inversion, two sets of derivatives are relevant here to connect to the previous step. Firstly, derivatives of the 1D marginal amplitudes with respect to the 2D density amplitudes, , which are given in appendix C, and secondly the derivatives of the Wasserstein distances with respect to the marginal amplitudes, and . Calculation of Wasserstein distances between 1D density profiles is discussed in the next section, and the relevant derivatives in appendix D. Again, in all cases we are able to derive analytical expressions that may be evaluated efficiently and without approximation.

3.5 An algorithm for Wasserstein distances between density profiles

Figure 5: The top panels illustrate cumulative distribution functions (CDFs), together with their inverses, for two 1-D discrete densities, and , each consisting of a sum of 6 delta functions at locations and , respectively, where and . These correspond to eqns. (22a & b) with weights and . We make use of a variable to represent the range of the CDF, or equivalently the domain of the inverse CDF function. In the top right panel the change points in the piecewise constant inverse CDFs are indicated on the -axis by the . Note that is always unity. The lower panels show the difference between the inverse CDFs (left) as a function of and the optimal Transport Map between and (right).

In order to implement the final step of our approach we require an efficient and exact method for solving Optimal Transport problems in 1D. Our algorithm for doing so is based on evaluation of (9) for the case where and may be represented as a weighted sum of delta functions, i.e.


with the delta function representing a unit point mass at position . Here we consider the general case where , although in our application, they will both be equal to either for the time marginal, or , for the amplitude marginal. Similarly we do not assume that the locations of the delta functions, ( and ), coincide. To satisfy the normalisation constraint (1), we require


These densities have cumulative distribution functions, which take the form of piecewise continuous ‘staircases’,


where and . An example is illustrated in Fig. 5, for the case with masses.

The corresponding inverse cumulative distribution functions are also piecewise constant, as can be seen in the top right panel of Fig. 5. Note that the figure in the top right panel in Fig. 5 is the same as that in the top left with the axes swapped. Specifically we have


where is the range variable of the cumulative distribution function (CDF) of and , . Evaluation of the expression for the Wasserstein distance in (9) in this case requires the integral of the difference between two piecewise constant functions, , which is itself another piecewise constant function (see the bottom left panel of Fig. 5). We write the ordered set of points along the axis corresponding to the locations of the point masses as , (). For ease of notation we assume that the two functions do not share any discontinuities, and . The -axis of the top right panel in Fig. 5 shows an example of the relationship between and , while the bottom left panel shows the piecewise constant function and its relation to the values for the same case. For each value we introduce the variable , defined by


where the values of correspond to one of the point mass locations, , and similarly for with respect to . For convenience, we use the functions and to describe the mapping of indices and have


Inspection of Fig 5 will help make these relationships clearer. The -Wasserstein distance for the point masses from (9) corresponds to the area under the curve shown in the bottom left panel of Fig. 5, which can then be conveniently expressed in terms of the values of and . We have


where . We therefore see that for the case where and are represented by 1D point masses, evaluation of the -Wasserstein distance reduces to the dot product


where , depends only on the point mass locations, , and depends only on the point mass weights, ; and is the matrix with unit diagonal and -1 on the sub-diagonal. Note that (30) is exact for this problem and involves only one ordering of values and a dot product. The computational cost is therefore , which is extremely efficient in comparison to a Linear Programming solution that is typically . We are not aware of this result appearing elsewhere in the literature, and its computational convenience underpins the efficiency of our approach.

3.6 Calculating Transport Maps

In evaluating the Wasserstein distance for 1D problems, it has not been necessary to first calculate the Transport Map that relocates onto . However it can be constructed from the solutions already obtained. For the point mass case an analytical solution for the 1D transport plan may be obtained by comparing (29) with the Linear Programming equivalent (eq. 39) which by inspection shows that the transport matrix is given by


where the functions and indicate the non-zero components of the transport matrix defined by (28). For all other elements . In this case the transport plan also becomes piecewise constant and its non-zero elements are represented in the bottom right panel of Fig. 5. As can be seen, the particles in the distribution at give mass to multiple particles in the corresponding interval along the vertical sections of the curve. Similarly, the particles at receive mass from multiple particles in the interval of the horizontal section. This is an example of mass splitting that is common for discrete problems. By combining previous results it can be shown that the Wasserstein distance may be written in terms of the transport plan as


We see then that Wasserstein distance in the 1D discrete case may be conveniently evaluated for any using (29) or (30), while the transport plan is given by (31). In what follows we use these expressions to calculate the marginal Wasserstein distances and .

3.7 Combining derivatives

For completeness we briefly discuss the combination of the various derivative terms using the chain rule. Overall, we require the derivative of the Wasserstein distance between marginals, with respect to model parameters, . From (21) we have


Using the chain rule over all intermediate variables in the four stages we have




where expressions have been simplified by using unnormalised fingerprint density amplitudes as intermediate variables. While the multiple nested summations suggest their evaluation may become computational inefficient, in practice the calculation of the derivatives benefits from the fact that each density amplitude, is at most dependent on two waveform amplitudes , and hence all other derivatives are zero. In addition, each marginal amplitude, and is only a function of the density amplitudes in the corresponding th column or th row, respectively. Again all other derivatives are zero. This simplifies the implementation of the chain rule and in practice the Wasserstein derivatives are straightforward to calculate, while also being exact for all terms once derivatives of the forward problem are known.

3.8 Combining the four stages

Having described all for steps in our calculation of Wasserstein distances, it should now be clear to the reader that each set of output variables at one stage becomes a set of input variables for the next, starting with waveform amplitudes, , and ending with Wasserstein distances for each pair of marginals, and . While algebraically tedious, this process is nevertheless very convenient, in that it involves only the combination of exactly evaluable expressions. In addition, since we also have analytical expressions for all derivatives they too may be exactly evaluated for each stage and then combined via the chain rule of partial differentiation. Fig. 6 shows a flow map of the four stages, from right to left, mirroring the order of the chain rule expressions (34) and (35). The far right hand box represents the calculation of the predicted waveforms from model parameters, the derivatives of which are assumed to be available and depend on the forward model. Each subsequent coloured box represents one of the four stages described above, resulting in Wasserstein distances for each marginal with respect to model parameters. In each box the relevant partial derivatives are also highlighted.

In summary the parameter changes from right to left read as follows:

  1. Model parameters to waveform: The solution to the forward problem takes the model parameters, and produces a seismic waveform, . Partial derivatives of the waveform amplitudes with respect to the model parameters are also calculated . The details of these calculations depend on the nature of the forward problem and are assumed to be given.

  2. Waveform to non-dimensional space: In the second box observed and predicted waveforms, , are transformed to using the time-amplitude transformation given by (13)-(15).

  3. Non-dimensional waveform to 2D Fingerprint: In the third box the waveform, is used to calculate the nearest distance to each grid node in the time-amplitude window. Calculation details and derivatives appear above and in appendix C.

  4. Fingerprint function to Fingerprint density: In the fourth box the 2D distance field is converted to a density and normalized using (17) to (18). Here the 1D time and amplitude marginals, and are also calculated from the 2D densities. Details and derivatives appear above and in appendices B and C.

  5. Fingerprint densities to Wasserstein distance: In the final box the Wasserstein distances between 1D marginals are calculated. Details of the Wasserstein calculation are given in section 3.5, with derivatives in appendix D.

Figure 6: A flow map from right to left for the calculation of the Wasserstein distance and its derivatives showing the four stages with their intermediate variables. This ordering mirrors that of the chain rule connecting derivatives in (34) and (35). The box on the far right represents the forward problem applied to the model unknowns, . The next (purple) box represents a time-amplitude window transformation to non-dimensional variables. In each box a new set of variables is calculated from the old, culminating in the Wasserstein distance between an observed and predicted waveform. Derivatives between intermediate variables are noted in each box. Details of each step in the process appear in the main text. This procedure is used in the examples in section 4.

The reader will observe that nothing in our approach to Optimal Transport of seismic waveforms is dependent on the details of the forward model. We simply assume that one exists and derivatives of waveform amplitudes with respect to model parameters are available. Hence the same formulation may be applied to any optimization problem where predicted time series depend on model parameters and must be fit to observations. Such problems occur across the physical sciences and so the formulation here may have general applicability. Furthermore, we note that derivatives of waveform amplitudes with respect to model parameters may not be available in all cases. Indeed it is common in many large-scale problems to use adjoint methods to directly evaluate derivatives of a waveform misfit function with respect to Earth model parameters, e.g. in global waveform inversion (Tromp et al., 2005; Sieminski et al., 2007). Adjoint methods have previously been used for derivative calculations in Optimal Transport, most notably in exploration seismology (Engquist & Froese, 2014; Engquist et al., 2016), although, as noted at the outset, those formulations different from the one proposed here and so it can not be assumed that methods simply carry over. Formulation of an adjoint approach for our marginal Wasserstein distances is a straightforward extension.

Figure 7: Misfit surfaces between double Ricker wavelets as a function of amplitude and time shift parameters (see text). Top panel shows the multimodal least squares case. The middle and lower panels show the Wasserstein and misfits respectively. These were calculated using the marginal algorithm described in the text, and are globally piecewise linear and quadratic respectively. The red dashed line is the global minimum and misfit contours are plotted on the base plane.
Figure 8: a) shows the best fit of the noisy Ricker wavelets found from minimization of the Wasserstein distance, . The initial position of the predicted wavelet is shown in Fig. 3b, and has had its time, amplitude and frequency parameters optimally adjusted to fit the target waveform, Fig. 3a, by gradient minimization. Recall that 5% Gaussian noise is added to the orange curve. Panels b) and c) show the Wasserstein and the least squares misfits, respectively, at each iteration of the gradient-based minimization of . As the Wasserstein distance is progressively minimized the misfit values, of the same models, show a jump over a local maximum, demonstrating the benefit of the Wasserstein distance function that has only a global minimum for this problem (as seen in Fig. 7). Note the logarithmic scale in (b) and (c).

4 Examples

4.1 A toy problem: Double Ricker wavelet fitting

To demonstrate the utility of this marginal Wasserstein waveform misfit function, we first consider a simple toy problem that is a modification of one studied by Engquist & Froese (2014), consisting of the fitting of a double Ricker wavelet with added correlated Gaussian noise. We create a three parameter waveform fitting problem by writing a double-Ricker function as


where , , and is the total time window. The three controlling parameters are the amplitude scale factor, , the time shift, , and the reference frequency, . Together these specify the shape and position of the wavelet. The top panel of Fig. 1 (orange curve) and Fig. 3a show examples with Gaussian random noise added. For the latter case the corresponding parameter values are , , which serves as a reference ‘observed’ waveform in our synthetic tests, that is to be fit by locating the optimal values of the triplet . An example of a sub-optimal fit is the blue curve in the second panel of Fig. 1, which has the correct amplitude, and frequency, but incorrect time shift, . Another is seen in Fig. 3b which has too low an and , but too large .

We perform three separate experiments, the results of which are summarized in Figs. 1, 7 and 8, respectively. The first mirrors the problem studied by Engquist & Froese (2014), but uses our marginal Wasserstein distance algorithm as described in section 3. We sweep through a single axis in our three parameter model space, by varying the time shift parameter, , with the amplitude and frequency parameters fixed at their true values. The third panel in Fig. 1 shows the usual summed squared difference in waveform amplitudes, , as a function of . As one might expect this misfit function has multiple local minima (at least 11 in this case), caused by incorrectly aligned peaks in the double-Ricker wavelets. The bottom panel in Fig. 1 shows the marginal Wasserstein distances, and calculated with our algorithm, using a value of in (21). The parameters used are, s, with , a 20% increase of the amplitude range as above; for the fingerprint grid, and for the distance scale factor. As can be seen both the and misfit functions have a single global minimum at the true solution, with the former having a discontinuous gradient at the optimum. They are reminiscent of a simple and norm measure in model space, which is very encouraging. A figure very similar to the curve was obtained by Engquist & Froese (2014) for their Wasserstein distance tests and we can conclude here that our algorithm reproduces the same features in this case, even in the presence of noise on the signal.

As a second experiment we extend this model space sweep by comparing misfit surfaces for , and as a function of amplitude and time shift parameters. These are displayed in Fig. 7. The surface exhibits ridges of local maxima and troughs of local minima extending the local minima seen in Fig. 1 along the amplitude axis. The Wasserstein misfit surfaces for and show global uniformity for this problem, with the former again being similar to a series of linear segmented planes, and the latter apparently a global quadratic-like surface. As one would expect, the global minimum is not exactly at the true solution, given by the red dashed line, due to the presence of Gaussian noise added to the ‘observed’ Ricker function.

As a third experiment, we performed a derivative based optimization of both the and the Wasserstein measure, , for all three model parameters . For the case we calculated exact derivatives by differentiating (36) and using the usual definition of a squared waveform difference. For the Wasserstein case we used the same Ricker wavelet derivatives and combined them with (34) and (35) to get derivatives of the summed marginal Wasserstein distances (33) with respect to the three unknowns. Fig. 8a shows the best fit curves after 8 iterations minimizing with a Limited-Memory Bound-Constrained gradient descent algorithm (L-BFGS-B)(Byrd et al., 1995; Zhu et al., 1997; Virtanen et al., 2020). The starting Ricker waveform for this minimization is shown in Fig. 3b. Fig. 8b shows the reduction on at each iteration. Clearly, the solution is a close fit and convergence to the global minimum has been achieved in all three parameters. The same experiment was performed with the misfit, which converged to a secondary minimum. To allow some comparison between and we repeated the optimization and at each iteration calculated both the and misfits of waveforms. These curves are shown in Fig. 8b–c. Notice that the curve initially increases before decreasing, indicating that it has been necessary to traverse over one of the ridges in Fig. 7 to find the global minimum, which explains why the optimization failed; the optimization does not encounter this ridge and converges smoothly.

The three Ricker waveform fitting experiments provide an encouraging set of results, albeit in a simple problem, suggesting the potential of Wasserstein distance for waveform fitting generally. The following experiment also provides a demonstration of the utility of the derivative calculations described in section 3.

4.2 Coupled source and moment tensor inversion of displacement seismograms

For a more comprehensive—and geophysical—test we applied our algorithm to the problem of coupled moment tensor and source location inversion. Here the aim is to compare performance of both and in a more complex seismic waveform fitting problem, where exact solutions are known. To this end we replicate the problem of fitting high rate GPS displacement waveforms studied by O’Toole et al. (2012) for the moment tensor and spatial location of the 6.6 2005 Fukuoka earthquake.

Figure 9: Station network (triangles) used in the source inversion experiments. These correspond to a Cartesian projection of the actual GEONET GPS receivers used by O’Toole et al. (2012) to invert displacement waveforms of 2005 Fukuoka earthquake (red star).

Fig. 9 shows the location of 11 GEONET stations used in our experiment, relative to a source (shown as a star at , , ) in a local Cartesian framework. For each station we calculated the three component displacement seismograms shown in Fig. 10a with 1s sampling rate for 60s duration and origin time at s. Waveforms were calculated with an implementation of the algorithm of O’Toole & Woodhouse (2011) which produces numerically stable computation of complete synthetic seismograms including the static displacement in plane layered media (Valentine & Sambridge, 2021). All seismograms were calculated using the moment tensor solution obtained by O’Toole et al. (2012) and in the six layered seismic velocity model used by Kobayashi et al. (2006) to study this earthquake (see Table 1 of O’Toole et al., 2012). As shown in Fig. 10a, the solid red, blue and green component traces have Gaussian correlated random noise added, with amplitude equal to 6% of the maximum seismogram amplitude in each window, and a temporal correlation length of 5s (i.e. 1/12 of the window length). Note the static offset present in many cases. These are taken as ‘observed’ seismograms for our synthetic experiments.

In all experiments an waveform misfit was calculated by summing the squared amplitude difference between all 33 pairs of observed and predicted waveforms. Time-amplitude windows use the parameters % of the amplitude range of the observed waveform in each window. We calculated the two marginal Wasserstein distances, , i.e. with in (21) and with in (21), with the fingerprint grid, and distance scale factor . The method of O’Toole et al. (2012) was used to calculate exact derivatives of displacement seismogram amplitudes with respect to source location, structural parameters or moment tensor components. Together with the theory of section 3 this means that derivatives of both the waveform misfit and may be calculated exactly for all source locations and moment tensors.

Figure 10: a) shows ‘observed’ three component noisy displacement waveforms (solid, coloured) from a source at location and predicted displacement waveforms (solid, gray) for an initial guess location 56 km away at . Six of the eleven receivers are shown, numbered 1-6 in Fig. 9. b) shows ‘observed’ and best fit displacement waveforms for the same receivers obtained from the least squares misfit (dashed, black) and Wasserstein misfit (solid, gray). For the Wasserstein misfit the minimization converges to the global minimum, which is within 2.1 km from the true location and all waveforms are fit well, whereas the least squares case converges to a location 67 km from the true location and discrepancies can be seen for many receiver components pairs.

We performed three experiments with this realisation. The first was to calculate 2D planar cross sections of and Wasserstein misfit functions for different depths and inspect them for local minima. Fig. 11 shows four depth slices for both and across a km region about the true source location (marked with a black dot). These experiments were performed for a source depth at km, with the origin time and moment tensor set to the true value. One can see that the true source location does not lie at the global minimum of either misfit function due to the random noise added to the observed waveforms. The misfit contours are characterised by being predominantly a single global minimum below the true source depth of km, although there is a broad secondary minimum in the North of the region which is more apparent at shallower depths. Multiple local minima are clearly observable in and above the km layer, especially at the shallowest depth, where more than a dozen minima are apparent. This suggests that gradient based optimization of the misfit function may be prone to entrapment in local minima, depending on starting location of an iterative sequence. By contrast, the Wasserstein misfit function is similar in all four depth layers, with a clear dominant global minimum in each case and no significant local minima.

In an attempt to quantify the reduction in the number of local minima observed in the Wasserstein surface, we calculated the values of and over a 100x100 grid across the two shallowest layers shown in Fig. 11, and identified all points for which the misfit increased in each of the North, South, East and West directions. For , 37 and 20 points met this criterion in the first two layers respectively, i.e. Figs. 11a and 11b, while for , 5 points had this property in each of the corresponding surfaces, i.e. Figs. 11e and 11f. This is encouraging, and suggests that optimization of the Wasserstein based misfit function may be more successful than of the standard norm in terms of avoiding entrapment in local minima. This is also consistent with the observations of Hedjazian et al. (2019) in their comparison of Wasserstein misfits with for seismic receiver functions.

In the next experiment we performed multiple gradient based inversions with both and , with , from different initial source locations, while retaining the correct moment tensor parameters. Fig. 10a shows the initial displacement waveforms, in gray, for the starting source. One can see that the initial phases of the predicted waveforms are not aligned with those of the observed, indicating that the starting locations are relatively distant from the true source, and that gradient based methods may suffer from ‘cycle skipping’. In addition, since these are displacement rather than velocity seismograms and include static offset, we can see that most of the predicted waveforms from this source also have differences in their static offset. The optimisation therefore needs to fit amplitudes by both aligning phases and correcting static offsets.

Location x y z
True 1 1 20
Start 40 40 10 56.1
0.6 1.9 21.9 2.1
5.5 58.2 54.8 67.1
Table 1: Source location results for test problem in Fig. 10. is the distance to the true source location. All values in kms.

Fig. 10b shows the same observed seismograms together with the predictions from the best fit solution produced by optimising (black dashed lines) and (coloured dashed lines), respectively. While the former has aligned the waveforms somewhat better than the initial set, the fits remain quite poor in many cases, both in the phase alignment and the static offset, whereas minimization of the Wasserstein misfit has resulted in good alignment of all phases and static offsets. In this case the Wasserstein minimization converged to a location km from the true source, consistent with being at the global minimum of the misfit function in Fig. 11f. Minimization of on the other hand converged to a location km from the true source suggesting entrapment in a local minima. Full results are given in Table 1.

This test was repeated using a suite of 48 different starting locations, spread regularly across the 3D volume comprising km laterally and km in depth. At each of the four depths and km, 6 starting locations were arranged along both NW-SE and NE-SW diagonals at co-ordinates of km respectively, and these are shown as white circles in Fig. 11d. Results of the suite of repeat minimizations of both and from the 48 initial locations are summarized in Fig. 12a. Here we plot the solution error for minimization of (x-axis) versus (y-axis). We observe that minimization of the Wasserstein misfit results in many more cases of convergence to an effective global minimum than the L case from the same starting location. Here we define convergence as meaning the final location is within km of the true location, which is indicated as dashed lines in Fig. 12a. We found that 77% of the trials converged to the global minimum using the Wasserstein misfit and 41% with the misfit. There are also cases where both converged, or neither converged, but none where the converged and the did not. In assessing these results, it must be remembered that the failure of minimization algorithms to converge to a global minimum may be due to several factors including the accuracy of the derivatives provided and the details of the optimization procedure itself. We are confident that the first count is not significant here as independent experiments show that accurate derivatives are calculated in both cases. On the second count, we merely rely on the fact that we use the same L-BFGS-B algorithm in each case.

In the final experiment we extend the previous result to the case where moment tensor components are also included. These six moment tensor parameters are linearly related to displacement waveform amplitudes and so this addition increases the size of the model space and creates a mixed linear-nonlinear problem. We repeat the misfit minimizations for both and from the 48 source initial locations with initial moment tensor parameters determined by linear inversion. Some preconditioning of the gradient was required in both cases to balance the relative influence of source location and moment tensor components. Fig. 12b shows the results in a similar format to the location-only optimization. Here the global minimum is about km from the true solution in the co-ordinates, and we define convergence as meaning the final location is within km of the true location, which is indicated as dashed lines in Fig. 12b. As with the earlier experiment, the results show the clear improvement in convergence when using over the . For this case 54% of the trials converged to the global minimum using the Wasserstein misfit and 35% with the misfit. This case shows the least improvement of Wasserstein over , and may reflect the poorer performance of the optimisation method in the larger dimensional space. In addition it may be worthwhile exploring the influence of the various scaling parameters, , on the character of Wasserstein misfit, which we leave for future study.

Figure 11: Contoured slices through Least squares, , (panels a-d) and Wasserstein misfits, , (panels e-h) for the source relocation experiments. For each misfit four depth slices are shown (10-40 km) over a km lateral range about the true source, given by the black circle. White circles in panel d show the lateral location of initial sources used in the repeat optimization test. Colour scales span misfit values in each slice, given by the legend. The true source is at 20 km depth. Overall the Least squares has few secondary minima below the source depth, but has 11 or more local minima at or above the source depth. The Wasserstein exhibits less variation between depths, a more quadratic appearance and fewer local minima.
Figure 12: Errors in source location obtained by minimizing a least squares waveform misfit, , compared to a Wasserstein distance, , for 48 cases with different starting locations up to 86 km laterally and 20 km in depth from the true source location. a) is for source location keeping moment tensor fixed at the true value, while b) is for coupled source location and moment tensor inversion. Dashed lines indicate a 2.5 km (a) and 1 km (b) error from true location, respectively, which are used as criteria for convergence in the main text. In both cases the Wasserstein results in convergence to the global minimum in a greater number of cases than the least squares misfit.

An area not yet addressed is the computational efficiency of our approach. In our experiments the calculation of the 2D density function (the ‘fingerprint’) for a single displacement waveform in Fig. 10 took 0.016s of CPU. This used a grid of and was performed with a Python 3 script on a Macbook Pro with 2.9Ghz Quad-Core Intel Core i7 processor and 16Gb memory. In the same experiment, the cost of an evaluation of all 33 displacement waveforms including derivatives and the misfit using our (Python) implementation of the waveform forward solver of O’Toole et al. (2012), was 2.37s. This compares to the same calculation for the Wasserstein misfit of 2.89s. While these values depend on implementation details, their ratio is perhaps more robust, and so we observe just a 22% increase for the Wasserstein calculation over that of including derivatives. Given that this involves all 5 stages in Fig. 6, it does not seem a significant burden. These relative computation costs carry over to a complete execution of the optimization algorithm. For example, in the source location experiment shown in Fig. 10b the minimization of the misfit took 37.6s to perform 6 iterations of the L-BFGS-B algorithm, which required 13 misfit and Jacobian evaluations, whereas the equivalent Wasserstein minimization took 51.2 s to perform 9 iterations with 17 misfit and Jacobian evaluations, an increase of 36%. We also compared the compute cost of our Wasserstein misfit, utilizing marginals, with a fully 2D Wasserstein calculation using the entropically smoothed Sinkhorn algorithm (Cuturi, 2013; Flamary et al., 2021). Here our marginal Wasserstein algorithm took 5666s to compute the 3D misfit grid shown in Fig. 11

while the same calculation using the Sinkhorn algorithm was estimated at 11.8 days based on the cpu time for a single Wasserstein distance averaged over 20 trials. Hence there is more than two orders of magnitude in their relative compute costs. In our experiments, we can therefore conclude that in terms of computational costs, optimization of the marginal Wasserstein is not a significant increase over that of

and that our marginal Wasserstein algorithm is extremely efficient compared to an exact 2D OT calculation.

Overall we see a consistent picture that the Wasserstein misfit performs better, in that it permits a standard optimization algorithm to more reliably converge to the correct solution without significant additional computational cost. In our experiments it would be feasible, if cumbersome, to have used either misfit together with an enumerative search to find the global minimum. For both misfit functions the global minimum is consistently near the true solution, as one would expect. Our experiments suggest that when enumerative search is infeasible, due to costs of forward modelling as the number of unknowns increases, then gradient based optimization is apparently more reliable with the Wasserstein misfit. We take this as an encouraging result, and as an indication that gradient-based methods for the nonlinear fitting of other time series might show similar results.

5 Discussion and Conclusions

In this paper we have proposed a new approach to the use of Optimal Transport based waveform misfit functions for inversion, and developed computational algorithms to enable this. Our method allows seismic waveforms to be compared in independent, non-overlapping time windows with arbitrary amplitudes. This general setting can be used for regional or global waveform fitting problems, and in principle for any case where two time series need to be compared. The central idea is to avoid the common practice of directly transforming an oscillatory waveform to a positive one, and instead build a 2D density function about each waveform and then calculate Wasserstein distances between their marginals. We have shown how this approach leads to an entirely analytical formulation, allowing use of 1D Optimal Transport theory and exact calculation of derivatives of the Wasserstein distances with respect to waveform amplitudes.

A first experiment was performed on a toy problem problem fitting two double-Ricker wavelets as a function of time shift, amplitude and frequency which verify that both the proposed and Wasserstein misfits have global linear and quadratic shapes respectively, in this simple case. A second experiment was performed involving the fitting of high rate GPS displacement waveforms for seismic source and moment tensor parameters. Here the Wasserstein misfit was more complicated than in the earlier experiment and the presence of local minima could not be ruled out. Nevertheless it was clear that the Wasserstein misfit surface again showed less complex structure, without the many local minima observable in the corresponding misfit surface. This conclusion was supported by a significantly improved convergence rate of gradient based optimization algorithms and other analyses. Computational efficiency of the new approach has been investigated and results are also encouraging. They suggest that Optimal Transport based misfit measures for time series need not result in significantly increased computational demands. Together, we feel these experiments help illustrate the power of Optimal Transport based misfit functions and build confidence for their implementation in regional and global seismological applications.

Our approach has been to compare the proposed marginal Wasserstein misfit algorithm for waveforms to a standard misfit for synthetic cases where we know the true solution. As noted at the outset, there is a long history of authors proposing alternate misfit measures for exploration and regional seismic waveform inversion. Many of these are aimed at reducing the influence of non linearities and over-coming cycle skipping. We have not attempted a comparison of our proposed algorithm with the plethora of alternatives in the literature, and so make no comment on how it would compare. We note, however, that recently a review of misfit functions for full seismic waveform inversion has appeared (Pladys et al., 2021). These authors also perform detailed comparison of the performance of two formulations of Optimal Transport, namely the minimization of the KR-norm (Métivier et al., 2016b) and Graph Space OT (Métivier et al., 2018, 2019), with other prominent waveform misfit functions, including those based on instantaneous phase (Bracewell, 1965; Taner et al., 1979; Bozdağ et al., 2011; Wu et al., 2014), the Gabor Transform (Fichtner et al., 2008), the Normalised Integration Method (Donno et al., 2013), and Adaptive Waveform Inversion (Warner & Guasch, 2016), all of which have shown efficacy in overcoming cycle skipping problems.

Here Wasserstein distances are used as an alternative measure for the data misfit part of an objective function only. If an inversion setting also required addition of one or more regularization terms to an objective function, then these can be minimized together in the usual way, as data misfit and regularization terms are typically independent calculations. However it is conceivable in some cases that the theory of Optimal Transport might also have application to the calculation of regularization terms as well, which are themselves usually a distance measure in model space, often based on an norm. A recent study by Engquist et al. (2020), suggests that use of the as a data misfit measure itself has a smoothing effect on the inversion process.

We conclude with some remarks on the relationships between the Optimal Transport approach set out here, and other approaches to geophysical inversion. First, we emphasise that there is no universal solution to an inverse problem: the choice of misfit function defines the sense in which we set out to ‘explain’ observational data, and the ‘best-fitting’ model by one measure need not be optimal by another. In the examples presented here, it happens that the and global minima are similar, but this is not necessarily universally true. Moreover, readers will note that our construction of the Wasserstein misfit does not allow for any statistical treatment of noise, such as covariance information. This may be an avenue for future development, but it implies that in our current formulation the Wasserstein misfit cannot be used to define a true Likelihood.

The concept of a transport map as a means of connecting two distributions has a range of potential applications. In particular, it may offer a route towards simplifying computations in probabilistic settings: operations may be performed on one distribution and then transported to another. For example, if a transport map were to be constructed between the prior and posterior distributions for a given problem, it would then be possible to generate samples from the (often complex) posterior by sampling the (simpler) prior and then mapping these samples appropriately. This could avoid the need for expensive Monte Carlo sampling, and suggests possible connections to the ‘prior sampling’ framework described by Käufl et al. (2016).

This also points towards an alternative strategy for posing and solving Bayesian inference problems: one could write a posterior in terms of the prior and a parameterised representation of a transport map. Then, one could formulate and solve a variational inference problem to find parameters such that Bayes’ Theorem is satisfied. This possibility has been explored by

El Moselhy & Marzouk (2012), but does not seem to have received widespread attention.

These are interesting avenues that we believe are worthy of further investigation, and that may provide theoretical and conceptual connections to support the development of the Optimal Transport approach developed in this paper. It is clear that this theory is a powerful approach for quantifying the similarity of density functions, and it can be applied to any inversion or optimisation problem that has this form. In this paper, we have focussed particularly on time series data, and proposed a strategy for converting these into a form amenable to application of Optimal Transport approaches. This enables waveform similarity to be assessed in terms of a measure of the ‘work’ required to transform one signal into the other, and we have shown that this can perform well in optimisation settings compared to conventional misfits. While we bear in mind the famous ‘No Free Lunch’ theorem, which states that all ‘blind’ optimisation algorithms have similar performance when averaged over all conceivable problems, we are optimistic that the Optimal Transport concept may have useful application across a range of inverse problems in geophysics and beyond.


We thank the Commonwealth Scientific Industrial Research Organisation Future Science Platform for Deep Earth Imaging for support. APV acknowledges support from the Australian Research Council through a Discovery Early Career Research Award (grant number DE180100040) and a Discovery Project (grant number DP200100053). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement no. 833848 (UEMHP). The genesis of this work occurred when MS was on a sabbatical visit to the Institute for Geophysics, ETH Zürich. The generous support of a Visiting Professorship is gratefully acknowledged. MS would also like to thank Mr. Jared Magyar for conversations that have contributed to his understanding of this subject. The authors are grateful for some constructive reviews, on an earlier version of this manuscript, received from Stan Dosso and Klaus Mosegaard.

Data Availability statement

Several open source software libraries exist for solution of Optimal Transport problems. One is the

Python Optimal Transport library of Flamary et al. (2021), available from, while another is the Optimal Transport Tools of Cuturi et al. (2022), available from, both of which are in Python. The authors’ Optimal Transport Python library implementing calculations in this study is available from The synthetic seismogram software used in this study, pyprop8, is available from


  • Ambrosio (2003) Ambrosio, L., 2003. Lecture Notes on Optimal Transport Problems, pp. 1–52, Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Benamou & Brenier (2000) Benamou, J. & Brenier, Y., 2000. A computational fluid mechanics solution to the monge-kantorovich mass transfer problem, Numer. Math., 84, 375–393, doi:10.1007/s002110050002.
  • Bertsekas & Castanon (1989) Bertsekas, D. & Castanon, D., 1989. The auction algorithm for the transportation problem, Ann. Oper. Res., 20, 67–96.
  • Bozdağ et al. (2011) Bozdağ, E., Trampert, J., & Tromp, J., 2011. Misfit functions for full waveform inversion based on instantaneous phase and envelope measurements, Geophysical Journal International, 185(2), 845–870.
  • Bracewell (1965) Bracewell, R., 1965.

    The Fourier transformation and its applications

    , McGraw-Hill, New York.
  • Brenier (1991) Brenier, Y., 1991.

    Polar factorization and monotone rearrangement of vector-valued functions,

    Communications on Pure and Applied Mathematics, 44(4), 375–417.
  • Byrd et al. (1995) Byrd, R., Lu, P., Nocedal, J., & Zhu, C., 1995. A limited memory algorithm for bound constrained optimization, SIAM Journal on Scientific Computing, 16(5), 1190–1208.
  • Cuturi (2013) Cuturi, M., 2013. Sinkhorn distances: Lightspeed computation of optimal transport, in Advances in Neural Information Processing Systems 26, pp. 2292–2300, eds Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., & Weinberger, K. Q., Curran Associates, Inc.
  • Cuturi et al. (2022) Cuturi, M., Meng-Papaxanthos, L., Tian, Y., Bunne, C., Davis, G., & Teboul, O., 2022. Optimal Transport Tools (OTT): A JAX Toolbox for all things Wasserstein, arXiv preprint arXiv:2201.12324.
  • De Philippis & Figalli (2014) De Philippis, G. & Figalli, A., 2014. The monge-ampère equation and its link to optimal transportation, Bulletin of the American Mathematical Society, 51(4), 527 – 580.
  • Donno et al. (2013) Donno, D., Chauris, H., & Calandra, H., 2013. Estimating the background velocity model with the normalized integration method, in 75th Annual International Conference and Exhibition, Tu0704, EAGE, Extended Abstracts.
  • El Moselhy & Marzouk (2012) El Moselhy, T. A. & Marzouk, Y. M., 2012. Bayesian inference with optimal maps, Journal of Computational Physics, 231(23), 7815 – 7850.
  • Engquist & Froese (2014) Engquist, B. & Froese, B., 2014. Application of the Wasserstein metric to seismic signals, Communications in Mathematical Sciences, 12(5), 979–988.
  • Engquist et al. (2016) Engquist, B., Hamfeldt, B., & Yang, Y., 2016. Optimal transport for seismic full waveform inversion, Communications in Mathematical Sciences, 14.
  • Engquist et al. (2020) Engquist, B., Ren, K., & Yang, Y., 2020. The quadratic wasserstein metric for inverse data matching, Inverse Problems, 36(5), 055001.
  • Fichtner et al. (2008) Fichtner, A., Kennett, B. L. N., Igel, H., & Bunge, H.-P., 2008.

    Theoretical background for continental- and global-scale full-waveform inversion in the time–frequency domain,

    Geophysical Journal International, 175(2), 665–685.
  • Flamary et al. (2021) Flamary, R., Courty, N., Gramfort, A., Alaya, M. Z., Boisbunon, A., Chambon, S., Chapel, L., Corenflos, A., Fatras, K., Fournier, N., Gautheron, L., Gayraud, N. T., Janati, H., Rakotomamonjy, A., Redko, I., Rolet, A., Schutz, A., Seguy, V., Sutherland, D. J., Tavenard, R., Tong, A., & Vayer, T., 2021. POT: Python Optimal Transport,

    Journal of Machine Learning Research

    , 22(78), 1–8.
  • Gauthier et al. (1986) Gauthier, O., Virieux, J., & Tarantola, A., 1986. Two-dimensional nonlinear inversion of seismic waveforms: Numerical results, Geophysics, 51(7), 1387–1403.
  • Górszczyk et al. (2021) Górszczyk, A., Brossier, R., & Métivier, L., 2021. Graph-space optimal transport concept for time-domain full-waveform inversion of ocean-bottom seismometer data: Nankai trough velocity structure reconstructed from a 1d model, Journal of Geophysical Research: Solid Earth, 126(5), e2020JB021504, e2020JB021504 2020JB021504.
  • He et al. (2019) He, W., Brossier, R., Métivier, L., & Plessix, R.-E., 2019. Land seismic multiparameter full waveform inversion in elastic VTI media by simultaneously interpreting body waves and surface waves with an optimal transport based objective function, Geophysical Journal International, 219(3), 1970–1988.
  • Hedjazian et al. (2019) Hedjazian, N., Bodin, T., & Métivier, L., 2019. An optimal transport approach to linearized inversion of receiver functions, Geophysical Journal International, 216, 130–147.
  • Huang et al. (2019) Huang, G., Zhang, X., & Qian, J., 2019. Kantorovich-Rubinstein misfit for inverting gravity-gradient data by the level-set method, Geophysics, 84, 1–115.
  • Kantorovich (1942) Kantorovich, L. V., 1942. On translocation of masses, Dokl. Acad. Nauk. USSR, 37, 7–8,227–229 (in Russian). [ English translation: J. Math. Sci., vol 133, 4 (2006), 1381–1382.].
  • Karmarkar (1984) Karmarkar, N., 1984. A new polynomial-time algorithm for linear programming, Combinatorica, 4, 373–395.
  • Käufl et al. (2016) Käufl, P., Valentine, A., de Wit, R., & Trampert, J., 2016. Solving probabilistic inverse problems rapidly with prior samples, Geophysical Journal International, 205, 1710–1728.
  • Kobayashi et al. (2006) Kobayashi, R., Miyazaki, S., & Koketsu, K., 2006. Source processes of the 2005 west off fukuoka prefecture earthquake and its largest aftershock inferred from strong motion and 1-hz gps data, Earth Planets and Space, 58, 57 – 62.
  • Kolouri et al. (2017) Kolouri, S., Park, S. R., Thorpe, M., Slepčev, D., & Rohde, G. K., 2017. Optimal mass transport: Signal processing and machine-learning applications, IEEE Signal Processing Magazine, 34(4), 43–59.
  • Lellmann et al. (2014) Lellmann, J., Lorenz, D. A., Schönlieb, C., & Valkonen, T., 2014. Imaging with Kantorovich-Rubinstein discrepancy, SIAM J. on Imag. Sci, 7 (4), 2833–2859.
  • Levy & Schwindt (2018) Levy, B. & Schwindt, E., 2018. Notions of optimal transport theory and how to implement them on a computer, Computers and Graphics, 72, 135–148.
  • Luo & Schuster (1991) Luo, Y. & Schuster, G. T., 1991. Wave‐equation traveltime inversion, Geophysics, 56(5), 645–653.
  • Mainini (2012) Mainini, E., 2012. A description of transport cost for signed measures, J. Math. Sci., 181(6), 837–855.
  • Métivier et al. (2016a) Métivier, L., Brossier, R., Mérigot, Q., Oudet, e., & Virieux, J., 2016a. An optimal transport approach for seismic tomography: application to 3d full waveform inversion, Inverse Problems, 32(11), 115008.
  • Métivier et al. (2016b) Métivier, L., Brossier, R., Mérigot, Q., Oudet, E., & Virieux, J., 2016b. Increasing the robustness and applicability of full-waveform inversion: An optimal transport distance strategy, The Leading Edge, 35, 1060–1067.
  • Métivier et al. (2016c) Métivier, L., Brossier, R., Mérigot, Q., Oudet, E., & Virieux, J., 2016c. Measuring the misfit between seismograms using an optimal transport distance: application to full waveform inversion, Geophysical Journal International, 205(1), 345–377.
  • Métivier et al. (2016d) Métivier, L., Brossier, R., Oudet, E., Mérigot, Q., & Virieux, J., 2016d. An optimal transport distance for full-waveform inversion: Application to the 2014 chevron benchmark data set, SEG Technical Program Expanded Abstracts:, pp. 1278–1283.
  • Métivier et al. (2018) Métivier, L., Allain, A., Brossier, R., Mérigot, Q., Oudet, E., & Virieux, J., 2018. Optimal transport for mitigating cycle skipping in full-waveform inversion: A graph-space transform approach, Geophysics, 83, R515–R540.
  • Métivier et al. (2019) Métivier, L., Brossier, R., Mérigot, Q., & Oudet, E., 2019. A graph space optimal transport distance as a generalization of distances: application to a seismic imaging inverse problem, Inverse Problems, 35(8), 085001.
  • Monge (1781) Monge, G., 1781. Mémoire sur la théorie des déblais et des remblais, De l’Imprimerie Royale.
  • Okazaki et al. (2021) Okazaki, T., Hachiya, H., Iwaki, A., Maeda, T., Fujiwara, H., & Ueda, N., 2021.

    Simulation of broad-band ground motions with consistent long-period and short-period components using the Wasserstein interpolation of acceleration envelopes,

    Geophysical Journal International, 227(1), 333–349.
  • O’Toole & Woodhouse (2011) O’Toole, T. & Woodhouse, J., 2011. Numerically stable computation of complete synthetic seismograms including the static displacement in plane layered media, Geophysical Journal International, 187, 1516–1536.
  • O’Toole et al. (2012) O’Toole, T., Valentine, A., & Woodhouse, J., 2012. Centroid–moment tensor inversions using high-rate GPS waveforms, Geophysical Journal International, 191, 257–270.
  • Peyré & Cuturi (2019) Peyré, G. & Cuturi, M., 2019. Computational optimal transport, Foundations and Trends in Machine Learning, 11(5-6), 355–607.
  • Pladys et al. (2021) Pladys, A., Brossier, R., Youbing, L., & Métivier, L., 2021. On cycle-skipping and misfit function modification for full-wave inversion: Comparison of five recent approaches, Geophysics, 86(4), R563–R587.
  • Rawlinson & Sambridge (2004) Rawlinson, N. & Sambridge, M., 2004. Wavefront evolution in strongly heterogeneous layered media using the Fast Marching Method, Geophys. J. Int., 156, 631–647, doi: 10.1111/j.1365–246X.2004.02153.x.
  • Santambrogio (2015) Santambrogio, F., 2015. Optimal Transport for Applied Mathematicians. Calculus of Variations, PDEs and Modeling, Birkhäuser.
  • Sethian (1996) Sethian, J. A., 1996. A fast marching level set method for monotonically advancing fronts, Proc. Nat. Acad. Sci., 93, 1591–1595.
  • Sethian (1999) Sethian, J. A., 1999. Level Set Methods and Fast Marching Methods, Cambridge University Press, Cambridge.
  • Sethian & Popovici (1999) Sethian, J. A. & Popovici, A. M., 1999. 3-D traveltime computation using the fast marching method, Geophysics, 64, 516–523.
  • Sieminski et al. (2007) Sieminski, A., Liu, Q., Trampert, J., & Tromp, J., 2007. Finite-frequency sensitivity of surface waves to anisotropy based on adjoint methods, Geophys. J. Int., 168, 1153–1174.
  • Sirgue & Pratt (2004) Sirgue, L. & Pratt, R. G., 2004. Efficient waveform inversion and imaging: A strategy for selecting temporal frequencies, Geophysics, 69(1), 231–248.
  • Solomon (2015) Solomon, J., 2015. Transportation Techniques for Geometric Data Processing, Ph.D. thesis, Stanford Department of Computer Science.
  • Solomon et al. (2015) Solomon, J., De Goes, F., Peyré, G., Cuturi, M., Butscher, A., Nguyen, A., Du, T., & Guibas, L., 2015. Convolutional Wasserstein distances: Efficient optimal transportation on geometric domains, ACM Transactions on Graphics, 34(4), 66:1–66:11.
  • Sun & Alkhalifah (2019) Sun, B. & Alkhalifah, T., 2019. The application of an optimal transport to a preconditioned data matching function for robust waveform inversion, Geophysics, 84(6), R923–R945.
  • Taner et al. (1979) Taner, M. T., Koehler, F., & Sheriff, R., 1979. Complex seismic trace analysis, Geophysics, 44(6), 1041–1063.
  • Tromp et al. (2005) Tromp, J., Tape, C., & Liu, Q., 2005. Seismic tomography, adjoint methods, time reversal, and banana-donut kernels, Geophys. J. Int., 160, 195–216, doi: 10.1111/j.1365–246X.2004.02456.x.
  • Valentine & Sambridge (2021) Valentine, A. P. & Sambridge, M., 2021. pyprop8: A lightweight code to simulate seismic observables in a layered half-space, Journal of Open Source Software, 6(66), 3858.
  • Villani (2003) Villani, C., 2003. Topics in Optimal Transportation, Graduate studies in mathematics, American Mathematical Society.
  • Villani (2008) Villani, C., 2008. Optimal Transport: Old and New, Grundlehren der mathematischen Wissenschaften, Springer Berlin Heidelberg.
  • Virieux & Operto (2009) Virieux, J. & Operto, S., 2009. An overview of full-waveform inversion in exploration geophysics, Geophysics, 74(6), WCC1–WCC26.
  • Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F., van Mulbregt, P., & SciPy 1.0 Contributors, 2020. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods, 17, 261–272.
  • Warner & Guasch (2016) Warner, M. & Guasch, L., 2016. Adaptive waveform inversion: Theory, Geophysics, 81, R429–R445.
  • Wu et al. (2014) Wu, R.-S., Luo, J., & Wu, B., 2014. Seismic envelope inversion and modulation signal model, Geophysics, 79(3), WA13–WA24.
  • Yang & Engquist (2018) Yang, Y. & Engquist, B., 2018. Analysis of optimal transport and related misfit functions in full-waveform inversion, Geophysics, 83(1), A7–A12.
  • Yang et al. (2018) Yang, Y., Engquist, B., Sun, J., & Hamfeldt, B. F., 2018. Application of optimal transport and the quadratic Wasserstein metric to full-waveform inversion , Geophysics, 83(1), R43–R62.
  • Zhu et al. (1997) Zhu, C., Byrd, R., Lu, P., & Nocedal, J., 1997. Algorithm 778: L-BFGS-B: Fortran Subroutines for Large-Scale Bound-Constrained Optimization, ACM Transactions on Mathematical Software, 23(4), 550–560.

Appendix A Optimal Transport by Linear Programming

Kantorovich (1942) recast Optimal Transport as a Linear Programming constrained optimization problem. The discrete point mass representation of and in (22) means that the transport plan can be represented by the matrix , where the transport from to is given by the matrix vector product in (LABEL:eq:discreteT0)


where and