1 Introduction
Seismic velocity model is of great importance for the seismic exploration, especially the migration, imaging and interpretation processes. Accurate estimation of velocity model plays an essential role in the investigation of earth’s structures and natural resource. In the last decades, such velocity information can be derived by some classical imaging methods such as velocity analysis [berkhout1997pushing], tomography [iyer1993seismic], and full waveform inversion (FWI) [tarantola1984inversion, virieux2009overview, brossier2009seismic]. Compared to other traditional techniques, FWI can obtain highresolution inversion results by iteratively matching the simulated seismograms, which obtained by numerical simulation based on the forward modeling equation, and the observed seismograms. Usually, the minimization of objective function of FWI is solved by adjointstate method [plessix2006review], in which the gradients of the cost with respect to (w.r.t.) velocity model is computed explicitly. This leads to challenges and inconvenience when faced with the changes of the forward operator. Moreover, classical FWI is quite effective when a relatively good initial model and optimized gradientbased algorithm are given [virieux2009overview], because the nonlinearity involved in FWI is strong. It’s application is limited in some configurations owing to a number of problems, such as computational inefficiency and human intervention.
Recently, deeplearning (DL) techniques
[lecun2015deep], in particular, convolutional neural network (CNN)
[lecun2010convolutional], have achieved the stateoftheart performance in a variety of applications. They surpass the conventional approaches in many research fields of inverse problem, for instance, image reconstruction [mccann2017convolutional, yang2021robust], xray computed tomography [jin2017deep], and optical diffraction tomography [sun2018efficient, yang2020deep]. To some extent, this development results from the wide availability of domainspecific languages (DSL) for DL, such as Tensorflow
[abadi2016tensorflow]or PyTorch
[paszke2017automatic], used in academia and industry.Generative adversarial networks (GANs) [goodfellow2014generative]
are powerful DL models capable of generating realistic images, videos, and voice. Rooted in game theory, one neural network, termed as discriminator, attempts to judge whether the information is real or fake, the other neural network, referred as generator, strives to produce data that the discriminator assumes is real. Wasserstein GAN (WGAN)
[arjovsky2017wasserstein]is an extension to the GAN, with it’s usage not only the training stability is improved but also the loss function is designed to correlate with the quality of generated images. This conceptual shift is motivated mathematically by an argument that training the GAN that generator should seek to minimize the distance between the data distribution observed in the training dataset and the distribution observed in the generated examples. The argument contrasts the EarthMover distance, also called Wasserstein1 distance
[vallender1974calculation], to achieve this goal.In this paper, we propose a promising alternative of physicsinformed trainingfree frameworks for twodimensional (2D) FWI, termed as FWIGAN. Inspired by [gupta2021cryogan], we integrated a physics generator with adversarial learning to solve the illposed nonlinear inverse problem. To the best of our knowledge, it is the first time that the WGAN with gradient penalty (WGANGP) [gulrajani2017improved] is adopted for FWI, in which the estimation of velocity model is obtained in a distributional sense. A key feature of the proposed framework is that it can learn the noise distribution included in the noisy observed seismograms so that the “denoised" inversion results can be well estimated. In addition, FWIGAN is an appealing solution to the peculiar challenges of FWI, such as the usage of an undesired initial model or contaminated measurements. Beyond that, we further illustrate the prospective ability of FWIGAN to simultaneously estimate an uncomplicated source wavelet in FWI workflow. To summarise, the main contributions of our work are as follows.

We propose an unsupervised learning framework to handle the inversion task of 2D velocity model from observed seismograms. This paradigm needs no labelled training dataset nor pretraining of networks, and is flexible on account of automatic differentiation (AD) of DSL to estimate seismic wavelet synchronously. It requires minimal user interaction such that make this pipeline more available for research.

Inspired by the work of WGAN, we incorporate the acoustic wave equation and replace the generator by a wave equation. Unlike the traditional FWI optimization, we recover the unknown variables through solving a minmax game by competitive learning. Taking advantages of Wasserstein distance and modeldata adjointdriven fashions, our method has the capability to produce high quality and physically consistent results in a likelihoodfree manner.

Thanks to the theoretical characterization of WGANGP, the proposed method is able to learn the distribution of additive white Gaussian noise (AWGN) when the observed data is noisy. Consequently, it bypasses the perturbation of noise and leads to the “denoised" inversion results. The substantial improvement makes it more robust and versatile to tackle the realistic applications.

More specifically, the preliminary results demonstrate that our paradigm is robust and general enough to accommodate either an undesired initial model guess or the contaminated seismic data for the handling of localminima events.
In Section 2
, the related works concerning traditional approaches for FWI, supervised learning approaches for waveform inversion and unsupervised learning approaches for FWI are presented. In Section
3, we introduce the physical model of FWI and formulate the computational problem. In Section 4, we describe the details of the proposed method, it consists of review of WGANGP, mathematical framework of our method and the architecture of the critic. In Section 5, we extensively compare our method against FWI solved by using Adam optimizer with minibatches data to minimize the leastsquares function, for Marmousi, Marmousi2, and Overthrust models with noisefree and noisy data. The quantitative evaluation is illustrated to show the effectiveness and generalization of FWIGAN. In Section 6, we further discuss the details about the estimation of source wavelet, sensitivity to different initial models, and the possible future works. Finally, we summarize our study in Section 7.2 Related Work
2.1 Traditional Approaches for FWI
FWI with leastsquares criterion was originally introduced by Tarantola [tarantola1984inversion]. It is an effective framework to infer the underground structures based on seismic waveforms, but is known to suffer from the localminima problem, one typical phenomenon of cycleskipping, and is sensitive to the lack of low frequencies, data noise, and poor starting model. Over the past decades, many researchers have designed various methods by considering the multiple data components [bunks1995multiscale] or adding regularization terms [asnaashari2013regularized] to overcome this limitation.
For the multiscale frameworks, FWI can be implemented in the time domain or frequency domain. Bunks
et al. [bunks1995multiscale] firstly proposed successive inversion of subset data in time domain by increasing highfrequency, because the low frequencies are less sensitive to artifacts. By contrast, the methods in frequency domain lead to a more natural approach for multiscale schemes by performing successive inversions of increasing frequencies.With regard to the misfit function, the goal of the leastsquares regularization term is to penalize the roughness of the model, hence defining the socalled Tikhonov regularization [golub1999tikhonov]. Total variation (TV) regularization is usually implemented by minimizing the norm of the model misfit. It was applied to FWI in [askan2008full] and the weighted regularization was applied to frequencydomain FWI in [hu2009simultaneous]. Beyond that, the quadratic Wasserstein metric () from optimal transport has been proven as a better preference for mitigating the trapping of cycleskipping problem [engquist2016optimal, yang2018application], since this metric exhibits advantageous mathematical properties, including convexity w.r.t. time shifts and insensitivity to noise.
2.2 Supervised Learning Approaches for Waveform Inversion
As we all know, a fully connected neural network with one hidden layer can approximate any continuous function arbitrarily well, when its hidden layer is large enough [hornik1991approximation]. In seismic exploration, DLbased works for fault detection [xiong2018seismic, wang2018automatic, wu2019faultseg3d], random noise attenuation [yu2019deep, saad2020deep], interpretation [wang2019deep, zhang2020can] and others [waldeland2018convolutional, yu2021deep] have been rapidly developed. To address the 2D velocity model building problem from seismic data, there are many approaches based on DL have been proposed. In [yang2019deep], the authors introduced a framework applying an UNet [ronneberger2015u] to predict the velocity models from prestack multishot seismic data. The numerical experiments compared with FWI showed that the complicated nonlinear mapping from seismic data to velocity model can be well approximated by the convolutional neural network (CNN). Similar works were described in [araya2019deep, wu2019inversionnet, wang2020velocity, zhang2020data], which share the same purpose by using the universal approximation of different CNN architecture. Most of the aforementioned methods depend on supervised learning to learn the mapping between the paired input and groundtruth. This pipeline needs a large representative training dataset composed of seismic data and the corresponding groundtruth velocity models, which may not be available in many practical applications. In addition, the results obtained by direct feedforward networks might be inconsistent with the observed measurements due to a lack of equation constraints. As a result, the reliability and generalization of this kind methods can not be guaranteed.
Instead of training a deep neural networks (NNs) with a lot of data pairs, a relative smallsize NN can be learned from additional information obtained by enforcing the physical laws to efficiently solve the physical equations. A new NNbased framework, known as physicsinformed neural networks (PINNs), was proposed by Raissi et al. [raissi2019physics] to solve the forward and inverse problems involving nonlinear partial differential equations (PDEs) by integrating available data and mathematical models. Aiming at solving the inversion problem of wave equation, PINN was encoded with acoustic wave equation (AWE) to produce an accurate velocity model from analytical solution to an initial boundary value or seismic data [xu2019physics]. The notable results on 2D synthetic model indicated that PINN is able to infer the velocity distribution through the minimization of composed loss function. In the same spirit, Karimpouli and Tahmasebi addressed 1D wave velocity (P and Swave) and density inversion problem via PINN constrained by seismic wave equation [karimpouli2020physics]. By contrast, both elastic and acoustic wave inversion were considered and settled with PINN in [rashtbehesht2020application]. The motivation for developing those algorithms is that such prior knowledge or constraints can yield more interpretable methods based on NN, and can provide accurate and physically consistent predictions. However, the effectiveness of this pipeline still relies on the training set, architecture of the NN, and the components of the loss function.
2.3 Unsupervised Learning Approaches for FWI
To bridge the gap between supervised learning approaches and seismic waveform inversion, unsupervised learning which does not need labels is a promising paradigm. In [wu2019parametric], a CNNdomain FWI method which uses the concept of deep image prior (DIP) introduced by Ulyanov et al. [ulyanov2018deep] was proposed. Under the assumption that the velocity model can be well represented by a generative NN, the authors integrated the CNN with FWI and minimized the misfit between simulated data and observed data by updating the parameters of the CNN. Similarly, Zhu et al. [zhu2020integrating] described a framework, named NNFWI, to utilize the NN to generate a physical velocity model, which is then embedded in wave equation to simulate the seismic data. The comparable performance of above works demonstrated that the combination of NNs and PDEs is an appealing solution to build velocity model distribution.
Except for the aforementioned methods, another popular labelfree variant to solve timedependent PDEs is recurrent neural networks (RNNs)
[mikolov2011extensions]. RNNs are a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. RNNs can use the previous outputs, which are regarded as current inputs, as well as the internal states (memories) to process variable length sequences of inputs. RNN has also been applied to deal with seismic data reconstruction [yoon2020seismic], normal moveout velocity estimation [fabien2020seismic], and impedance inversion [alfarraj2019semisupervised]. Thanks to the AD [baydin2018automatic] of DSL and the fact that the feedforward computation graph of an RNN seems like the iterative process of timedomain wave propagation, FWI can be well implemented by a suitable RNN. In [richardson2018seismic], Richardson indicated that an RNN is able to carry out the timedomain finite difference scheme to simulate the forward modeling of AWE. The numerical experiments showed that the Adam optimizer [KingmaB14]with minibatches of seismic data produces quicker convergence than stochastic gradient descent
[bottou2010large] and than LBFGSB [zhu1997algorithm] with entire data. Moreover, the gradient of cost w.r.t. velocity model obtained by using reversemode AD is the same as that computed in the adjoint state method. Yet other similar approaches were proposed in [sun2020theory, fabien2020seismicv, ren2020physics]. In [wang2021elastic], the authors improved upon [richardson2018seismic] to perform inversions simultaneously for multiple parameters in elastic isotropic and anisotropic media.Those novel frameworks based on untrained NNs showed remarkable performance compared to traditional methods and supervised learning approaches for solving FWI. However, the methods based on DIP require earlystopping rules, otherwise the reconstruction quality will be degraded. In addition, it’s hard to understand what priors does the NN capture to represent the final solution. By contrast, the paradigms applying RNN can be well understood. It is worthy to mention that both DIP and RNNbased FWI algorithms need a good starting of initial model and enough capacity of GPU memory to store the necessary parameters. Therefore it’s difficult to overcome the instinct challenges of FWI, especially when the observed data is contaminated by noise or the initial model is an unexpected candidate.
3 Problem Formulation
In order to mimic the wave propagation, we consider the 2D constant density AWE in time domain, in which only Pwave velocity model is an unknown parameter, as the control equation. Let represents the support of the velocity models, more formally, the AWE can be expressed as
(1) 
where denotes the position, is the wavefield amplitude, indicates the Pwave velocity distribution, is the source term with the position constraint , and is the time coordinate. denotes the Laplace operator and is the Dirac delta function.
For a given velocity model and source term , the simulated data can be expressed as , where denotes the extraction operator that outputs the wavefield at the receiver positions . is the solution of Eq (1). We denote the true observed data corresponding to the aforementioned parameters is written as , so the classical FWI ( leastsquares FWI) usually measures the residual defined as the sum of the squared differences between the simulated data and the true observed data. We can invert the velocity models by minimizing the cost function
(2) 
where is the maximum travel time, and and represent the number of sources and receivers, respectively. Using an adjointstate algorithm [plessix2006review], the gradient of Eq (2) w.r.t. velocity variable yiels the following formalism
(3) 
where represents the adjoint wavefield that is the solution of the following adjoint wave equation on the domain from time to 0
(4) 
The adjoint wavefield is propagating along a reversal time direction. Then the velocity model can be updated by it’s gradient information iteratively
(5) 
where refers to the number of iteration, denotes the optimal step length of th iteration, and is the modification of the gradient like conjugate gradient.
4 Methodology
Generally, the approaches to solve an illposed inverse problems can be classified into two categories: modelbased and learningbased methods. The first type frameworks aim at obtaining the acceptable solutions with some modelconstraint optimization algorithms, while the second kind pipelines usually train a network or atoms through an optimization of a loss function based on the paired input and groundtruth. The former is flexible and general to address diverse inverse problems but is timeconsuming due to the sophisticated algorithms. In contrast, the later is restricted by specific tasks while is capable of handling challenging issues and fast for prediction. Inspired by
[gupta2021cryogan], we combine these two ways which leverage their respective merits to handle the nonlinear inversion task of FWI.4.1 Review of WGANGP
According to the work of WGAN, the critic, whose role is similar as that of discriminator of GAN but is not trained to classify, should be in the set of 1Lipschitz functions. To enforce this constraint, Arjovsky et al. [arjovsky2017wasserstein] proposed to clip the weights of the critic to lie in a compact space . However the weights clipping strategy leads to optimization difficulties thus even the deep WAGN critics fail to converge sometimes. In [gulrajani2017improved], the authors introduced an alternative way, gradientpenalty WGAN, to solve this issue. Consider the fact that a differentiable function is 1Lipschitz if and only if it has gradients with norm at most 1 everywhere, Gulrajan et al. [gulrajani2017improved] introduced a novel strategy with gradient penalty, the loss of WGANGP is denoted by
(6) 
where and are the distribution of true data and generated samples, respectively. is the random sample satisfies the distribution . denotes the set of 1Lipschitz functions and represents the expectation. is a suitable penalty weight.
By contrast, WGANGP makes the training more stable and easier to train, so that we can use a more complex network to implement the tasks.
4.2 Mathematical Framework of FWIGAN
Let the region of interest be discretized into pixels, a general formulation of Eq (1) reads as
(7) 
where denotes the discretized velocity model, is the simulated seismic data of size , and represents the wave equation operator. In addition, is the noisefree observed data.
Let be a compact metric set and Prob(
) denote the space of probability measures defined on
. Prob() are two distributions. The EarthMover distance or Wasserstein1 is given by(8) 
where is the set of all adjoint distribution whose marginals are and , respectively. Using the KantorovichRubinstein duality [villani2009optimal], we can simplify the calculation to
(9) 
where the supremum is over all the 1Lipschitz functions .
The goal of our task is to estimate the velocity distribution , and source peak frequency if needed, whose simulated seismic data computed according to Eq (7) is consistent with the observed data of the true velocity distribution
. We assume the conditional probability density function of seismic data
given a velocity model is defined as . Therefore the inversion task can be formulated as the optimization problem(10) 
where denotes some distance metric between two distributions. By using the Wasserstein1 distance ( Eq (8)), the minimization of Eq (10) translates as
(11) 
where and
. Note that it is necessary to do the normalization to turn seismic signals into probability measures such that the theory of optimal transport applies. We adopt the strategy of data normalization via a linear transformation and rescaling proposed in
[yang2018application] as follows(12) 
where is a constant such that . In practice, we prefer to choose approximate 1.1 times absolute value of the minimum value of the observed data. This prepsocessing can ensure the positivity of both simulated and observed seismic signals. After that, we rescale all signals to unit mass 1. The normalized data will be the candidate fed into the critic.
According to the theory of WGANGP, we can estimate the velocity model by optimizing the cost function
(13) 
where denotes the critic ( a neural network) with parameters . The Lipschitz constraint is enforced by penalizing the norm of the gradient of output w.r.t. its input. is implicitly defined as sampling uniformly along straight lines between pairs of points sampled from the real data distribution and the generated data distribution , with uniformly sampled between 0 and 1. This is motivated by the fact that the optimal critic contains straight lines with gradient norm 1 connecting coupled points from and .
By replacing the expectation by the empirical values, the minmax optimization problem of Eq (13) leads to the following final formulation
(14) 
Here, denotes either entire commonshot seismograms or minibatches sampled from full . is the normalized real observed data according to Eq (12) and is the normalized simulated data obtained from the current velocity model and source peak frequency by physics generator.
The schematic diagram of FWIGAN is shown in Figure 1. The workflow resembles that of original WGAN, with a mutation that the AWE operator, named as physics generator, substitute for the generative neural network. FWIGAN is driven by the competitive “training" of two sides: the critic discriminates the difference between the observed data and the generated data
, yet the physics generator aims to physically simulate seismic data from current velocity model. Gradients from the critic backpropagates to
to update it directly at each optimization step. Unlike the traditional FWI algorithms, FWIGAN attempts to reconstruct the variable by playing a minmax game to make the simulated data most closely match the real data. Thanks to the adversarial learning strategy, FWIGAN is able to produce highresolution and physicsconstraint predictions. The process is described in Algorithm 1.Note that the aforementioned observed data is assumed noisefree, which means it is generated from the groundtruth velocity model by using the same forward modeling algorithm of Eq (7) as simulated data. As with any FWI implementation, you are unlikely to get such nice results on real data. Therefore, we further explore the feasibility of FWIGAN for velocity inversion with noisy seismic data. Under assumption that the noise exists in the noisy observed data is AWGN that satisfy the distribution , so the noisy seismograms can be expressed as . Accordingly Eq (7) can be reformulated as
(15) 
where denotes the AWGN satisfy the distribution . For the consistency, we shall hence use to express the AWGN in Eq (15), therefore . We denote the conditional probability density function for the noisy scenario as
, the probability distribution of AWGN in noisy observed and simulated data are
and , respectively. Then given ( ), the following holds(16) 
Because the AWGN is statistically independent of the seismic signal, then the distribution of the summation of two random variables has the following equality
(17) 
That is, the probability distribution of the sum of two or more independent random variables is equal to the convolution (denoted by ) of their individual distributions [hogg2005introduction]. This implies that
(18) 
where
is the Fourier transform. We assume
has a full support in Fourier domain, , thus we can obtain(19) 
From this, it is easy to see that Eq (16) holds when .
Taking advantages of AD, we can also parameterize the noise level and estimate its value during the training process. Although the critic distinguishes the difference between the noisy real data and the noisy generated data, the framework of FWIGAN is still able to capture the “denoised" velocity distribution if approximates to . The modified scheme is shown in Figure 2. It is worthy to note that the variable would be added to the unknown parameters when we also restore the noise level. Hence, the loss function we optimize corresponds to the noisy configuration is
(20) 
where is the uniformed sampling corresponding to the noisy observed and simulated data.
4.3 The Critic of FWIGAN
Indeed, the critic of FWIGAN is similar to those used in the standard WGANs. It plays a role which is close to the discriminator of GANs, but without the sigmoid function and outputs a scalar score rather than a probability. This score can be interpreted as how real the input images are through measuring the difference between real data and fake data generated by the physics generator. Driven by the adversarial learning scheme, the parameters of critic will be adjusted according to the backpropagation of the loss. Meanwhile, it will provide the feedback information to update the velocity model to fit the fake data with real data well.
We design a CNN based on that of WGAN, the network consists of repeated applications of six convolutional blocks and two fully connected layers. The components of one block are

A () maxpooling layer with stride ()

A leaky ReLU function with negative slope of 0.1
The maxpooling layer leads to downsampling such that the size of the feature map is halved. The feature map of last convolutional block is flatten to a vector, then fed into a fully connected layer with neurons of 2000 followed by the leaky ReLU function. The final output is obtained in the form of a scalar. Note that the batch normalization is omitted to safeguard the gradient penalty strategy. Furthermore, the number of channels in first convolutional block is 32, then is doubled in the next block successively. We set the input channels is equal to the number of the shots of minibatch seismic data. The architecture of the critic adopted for FWI is described in Figure
3.5 Numerical Experiments
In this section, we present experiments that validate our proposed method (FWIGAN) on 2D synthetic models. We compare FWIGAN against the stateoftheart FWI method which is implemented by using Adam optimizer to minimize the leastsquares function. Since source excitation is generally unknown and should be considered as an estimated parameter in seismic inversion like timedomain FWI, we did the tasks of velocity inversion and source wavelet estimation ( reconstruction of dominant frequency of wavelet) for three models simultaneously. In the inversion process, we avoid the use of techniques such as adding regularization. The optimization is performed on a desktop workstation (Nvidia GeForce GPU, Ubuntu operating system) and implemented on PyTorch, especially the toolbox named Deepwave^{1}^{1}1The link for Deepwave is available from https://github.com/ar4/deepwave [richardson2018seismic]. It provides wave propagation modules and allows to automatically reconstruct the unknown variables owing to the chain operations and backpropagation.
5.1 Data Preparation
To illustrate the availability of our proposed method, we test the inversion performance for three wellknown 2D synthetic models which are suitable for a wide variety of geophysical research, that is Marmousi model [versteeg1994marmousi], Marmousi2 model [martin2006marmousi2], and Overthrust model [aminzadeh1994seg]. With the current limitation of computational equipment, we downsampled the model size and aim to restore the rescaled models.
Marmousi Model: One of the most famous and standard acoustic velocity models in exploration geophysics, was created by the French Institute of Petroleum in 1988 and its geometry is designed based on seismic profiles in the Cuanza Basin, Angola. We resampled the Pwave velocity model, subsequently, the size is with a spatial grid increment of 0.03 km. The velocity ranges from 1472 m/s to 5772 m/s. The true model is shown in Figure 5.
Marmousi2 Model: An updated version based on the original Marmousi structure, but is extended in width and depth, and is made fully elastic. The original Marmousi model is close to the center of Marmousi2. By contrast, Marmousi2 is better to simulate longoffset acquisition in a deepwater setting. In our work we use only the Pwave velocity for data simulation and inversion. The downsampled model size is and its velocity is ranging from 1140 m/s to 4700 m/s. The first 16 grid in depth represent the structure of water with a velocity value of 1500 m/s. We set the spacial increment as 0.03 km and the true model is illustrated in Figure 6.
Overthrust Model: A 3D geological model that was created as part of a multiphase collaboration between the SEG and EAGE, we take 2D slice of original model and attempt to reconstruct it. The model is overlain by a flat sedimentary layer underneath the sea bed. It represents a varying degree of complexity, with a central thrust faulted anticline, and an external monocline and flat zone. The resampled model size is with a spacial increments of 0.03 km. The velocity value is in the range of 2360 m/s  6000 m/s. The true model is present in Figure 7.
5.2 Parameter Setting
Forward Modeling: In order to simulate the wave propagation constraint by AWE, a regular grid and finite difference scheme [strikwerda2004finite], with secondorder accuracy in time domain and forthorder accuracy in space domain, were applied for the forward modeling of FWI and FWIGAN. To prevent unwanted reflections from the edges of the simulation domain, we used a perfectly matched layer (PML) [komatitsch2003perfectly] absorbing boundary condition. A Ricker wavelet with a peak frequency of 7 Hz was chosen as the true source wavelet. The time interval for forward simulation was 3 ms. In terms of three different models, the total recording time was set as 6s, 7.5s, and 3.6s, and thirty sources were equally placed on the surface () with a horizontal interval of 0.3 km, 0.329 km, and 0.29 km, respectively. Accordingly, receivers were evenly arranged on the surface with a interval of 0.29 km. The number of receivers is equal to the width of the model. Three normalized commonsource gathers generated from Overthrust model are shown in Figure 4.
In addition, to make the observation more realistic, we added AWGN to the noisefree seismic data . The noise was added such that the input SNR = 10dB, where SNR. Three examples are also displayed in Figure 4. Note that the seismograms generated from true velocity models by using aforementioned forward modeling algorithm are regraded as the true observed data for inversion task. Same simulation scheme is adopted to act as the physics generator to produce the simulated data. Similar for the noisy case.
FWI with norm: Be benefit from the automatic differentiation of PyTorch, it’s easy to automatically adjust the parameters when the differentiated functions are given. Seismic FWI can be well carried out by using Adam optimizer with minibatches data. The initial model guess is obtained by Gaussian smoothed function with the true model (see Figure 5Figure 7). According to the guidelines provided by the researchers, we did normalization processing that consists of division of maximum absolute value and rescaling for both observed and simulated minibatches data. The batch size was set as 1 and the learning rate we used for updating velocity models was 50. The algorithm was run for 800 epochs while the learning rate was decreased by 0.5 at 100th and 200th epoch. The minimum value constraint was adopted to improve the inversion performance.
5.3 Optimization Strategy of FWIGAN
For the implementation of FWIGAN, we did data normalization according to Eq (12) for each minibatches data during the training process. The initial model guess and true observed data are as same as those of FWI. The optimization was realize by using the same Adam optimizer. We selected the learning rate of 5 for updating the velocity model and for changing the parameters of critic. The weight for the gradient penalty term was kept to
. Similarly, we decreased the learning rate with a decay of 0.5 at every 100 epochs and the training process was run for 300 epochs totally. The hyperparameters of the critic were initialized to default values by PyTorch. It was trained 6 times (
) for each updating of model. The batch size was set as 5 which means thirty sources were split into 6 batches and each batch had 5 commonshot gathers. In our experiments, the number of the batch size did not significantly impact the inversion performance once the training converges.For the backpropagation, the value of the gradients for velocity model were clipped to a maximum value of 10 for noisefree case and for noisy case. Meanwhile, the norm of the gradients for critic were also clipped to a maximum value of for noisefree case and for noisy case.
When the observed data is noisy, we can learn the distribution of noise during the training. The initial guess of SNR was 20dB and the learning rate for updating it was 1. The other optimized strategies were as same as those of velocity model.
5.4 Quantitative Evaluation
In order to assess the quality of our proposed method, we quantitatively evaluate the quality of the inverted velocity model w.r.t. the groundtruth . We compute the structural similarity (SSIM) defined as
(21) 
where , , , , and
are the local means, standard deviations, and crosscovariance for images
, , respectively. The regularization constants and evade instabilities over image regions where the local mean or standard deviation is vanishing.In addition, we compute the relative error between the inverted velocity model and groundtruth defined as
(22) 
where denotes the norm. The higher SSIM and lower ERROR indicate the better reconstruction.
5.5 Results of Marmousi Model
The comparable inversion results by FWI with leastsquares minimization and the proposed method (FWIGAN) are illustrated in Figure 5. As expected, with a relatively good starting model, FWI yields a reasonable result (see second row of Figure 5). However, it produces underestimated velocity value in the deep area due to a lack of useful information of deep structures contained in the observation. On the contrary, our framework with adversarial learning strategy faithfully restore the velocity model for this scenerio, especially the deep structures.
To further assess the capacity of FWIGAN to invert the models when the seismic data is contaminated with noise, we proposed to learn the distribution of noise beside the framework simultaneously. The performance is visually present in Figure 5 (see third row). The estimated velocity by FWI seems well. By contrast, FWIGAN is more robust to obtain the denoised inversion result. In addition, we summarize in Table 1 and Table 2 the SSIM and ERROR values which indicate our observation. The estimated SNR (see Table 3) is 10.36 dB that is close to the true SNR. It shows that our method is able to handle the realistic applications.
Metric  Method  Marmousi  Marmousi2  Overthrust 

SSIM  FWI  0.7072  0.6518  0.7802 
FWIGAN  0.8545  0.8099  0.7966  
ERROR  FWI  0.1362  0.1067  0.0355 
FWIGAN  0.0876  0.0678  0.0269 
Metric  Method  Marmousi  Marmousi2  Overthrust 

SSIM  FWI  0.6817  0.6381  0.7611 
FWIGAN  0.8357  0.7324  0.7722  
ERROR  FWI  0.1366  0.1068  0.0360 
FWIGAN  0.0825  0.0855  0.0310 
Marmousi  Marmousi2  Overthrust  
SNR (dB)  10.36  10.11  10.29 
Estimated signaltonoise ratio (SNR) by using the proposed method (FWIGAN) when the seismic data is contaminated with additive white Gaussian noise. The true SNR is 10dB.
5.6 Results of Marmousi2 Model
Additionally, we assessed how the proposed framework performs for Marmousi2 model. The performance of reconstructions with different approaches is shown in Figure 6. Notably, the solution found by FWI exhibits inaccurate velocity values and smooth structures which do not fit the true model. On the contrary, FWIGAN generally obtains satisfying result that is consistent with ground truth. In particular, the structures below the sand at the center and the hull in deep layers.
5.7 Results of Overthrust Model
In Figure 7, we give the inversion results by applying FWI and FWIGAN for Overthrust model. The SSIM and ERROR are also shown in Table 1 and Table 2 for both noisefree and noisy cases. Similarly, FWI leads to an acceptable result but our method offers a slight increase in the quality of the reconstruction for any configuration. These results suggest that the proposed scheme is robust to the complexity of geological structures and the missing information of deep layers. Meanwhile, it’s able to attenuate the noise so that the denoised inversion results can be well produced.
6 Discussion
We introduced an robust and general pipeline for FWI motivated by distribution matching. Based on our experiments, it appears that the results of FWIGAN are generally superior to those of FWI no matter the measurement is clean or noisy, which make the proposed framework of interest for practitioners. To further demonstrate the flexibility and feasibility of FWIGAN, the discussion about estimation of source wavelet, sensitivity to different initial models, and future works are provided in this section.
6.1 Estimation of source wavelet
As FWI applies a wave equation to delineate the subsurface velocity distribution according to the best fit between the simulated and observed data, the source wavelet estimation is one of the essential ingredients that may strongly influence the seismic inversion results [hu2017demodulation]. Usually, the wavelet amplitude and phase spectra are estimated statistically from seismic data. Last decades, a great number of works have been introduced to extract the source wavelet. In [frazer1998new], the authors described a new objective function to invert sonic waveforms with an unknown source by using the receiver functions to interpret welllogged sonic data. Since the mapping between the source and the seismic wavefield is linear, the source function can be derived in FWI once the incident wavefield have been modeled. Then the medium and source are updated alternatively according to the cost [virieux2009overview]. To shy away the troublesome source inversion, developed frameworks have been designed both in frequency and time domains so the velocity inversion is independent from the source [lee2003source, choi2011source]. Another alternative of source wavelet estimation is to solve a nonlinear inverse problem by utilizing 1D wave equation and 2D prestack seismic data. This approach avoid any kind of seismic processing due to the source wavelet is the parameter to be reconstructed by optimization. Aside from these methods, diverse new strategies are adopted to improve the performance.
In practical situations, the source signature of observed data can be obtained either by windowing a part of the early arrivals [yu2014application] or by predicting along with the velocity structure in the inversion algorithm. However, it is hard to invert a good source wavelet without knowing the exact nearsurface velocity distribution. Therefore the assessed source signature can deviate fundamentally from the real source and may results in the final imaging is severely polluted with unacceptable artifacts.
Similar to reconstruct the velocity model by using the gradientdescent optimization strategy, the source wavelet can be forecast simultaneously in FWI workflow. It is worthy to note that inversion tasks of both velocity models and source wavelet is strenuous in FWI, for the sake of simplifying the inversion problem, we assume that the initial source guess is shifted in frequency from the true one and both true and predicted source wavelet are Ricker wavelet [ricker1951form]. Consider the new inversion task, we can rewrite the cost function of classical FWI as
(23) 
where is the peak frequency of the estimated source wavelet. Similarly, this variable will be adjusted like velocity model during the training process in the paradigm of FWIGAN.
For all experiments present in Section 5, we actually estimated the source wavelet at the same time. We used different learning rates for the model and source inversions as they have very different scales. In practice, the learning rate for source inversion was set as and was decreased by the same strategy of model. The Adam optimizer was chosen for optimization. After the whole training, FWIGAN is able to return a reasonably inversion of the model and source. In Figure 8, we show the source wavelet estimation performance of FWI and our method. By contrast, the proposed method leads to restoration that fit the true source better than that by FWI, especially the peak value. These results demonstrate that FWIGAN is capable of achieving multiparameters inversion that general in piratical applications.
6.2 Sensitivity to Initial Model
While the aforementioned results are encouraging, unfortunately, FWI at present can be attacked through local optimization algorithms so that building an accurate starting model for FWI remains one of the most topical issues. To explore the sensitivity of FWIGAN to the accuracy of different initial models, we extensively test the proposed framework with different smoothed models and linear model.
It is worthy to note that in our work, we attempt to introduce a novel paradigm that integrating deep learning with physics equation to achieve the task of seismic inversion even when the conditions are difficult, instead of improving the conventional optimization algorithms for FWI. Therefore the comparisons with developed approaches such as multiscale frequency inversion for FWI with norm or other misfit function will be fixed in a future.
6.2.1 Inversion Results with Smoothed Model
We give in Figure 9 the inversion performance of Marmousi2 model by applying FWI and FWIGAN. The starting model is built by Gaussian smoothed function with a higher standard deviation, which is highly smoothed and far from the true model. Notably, FWI with norm dramatically underestimates the velocity distribution and fails to invert the accurate shape of the geological structures for these two cases. The results have spurious highfrequency artifacts which is a typical localminima issue. On the contrary, our method obtains the highquality reconstructions that match the true model better. It faithfully recovers most features of Marmousi2 model with lower ERROR and higher SSIM values (see Table 4). These illustrative examples suggest that our proposed method can alleviate the trapping of localminima despite the starting model is an undesired candidate.
Metric  Method  Initial model 1  Initial model 2 

SSIM  FWI  0.1772  0.1442 
FWIGAN  0.7939  0.7567  
ERROR  FWI  0.1870  0.1929 
FWIGAN  0.0789  0.1032 
6.2.2 Inversion Results with Linear Model
Furthermore, we validate the effectiveness of the proposed framework with the linear initial model. The velocity value is increasing linearly in depth as , where is the starting velocity value on the surface, denotes the average increments, and is the depth. As shown in Figure 10, FWI produces blocky artifacts in all scenarios. It always fails to recover the correct structures, which points out the sensitivity of norm to the mismatch between the initial model and ground truth. However, our method still recovers reconstructions with more accurate structures and velocity values. It performs well for most area except few parts at the bottom where are hard to precisely invert usually.
For clear comparison, the velocity profiles of true, initial, FWI, and FWIGAN inverted models at two horizontal locations for three models are visually illustrated in Figure 11. By contrast, our proposed method leads to remarkable improvements such that the estimated velocities fit the true velocities better than those by FWI. The SSIM and ERROR measures shown in Table 5 demonstrate our observation as well.
We emphasize that it may not perform well for all types of initial models, but these experiments show that the inversion by FWIGAN may still be satisfactory and reasonable when the starting model is far from the true model.
Metric  Method  Marmousi  Marmousi2  Overthrust 

SSIM  FWI  0.3836  0.1466  0.2690 
FWIGAN  0.7600  0.6527  0.7648  
ERROR  FWI  0.1970  0.1996  0.1137 
FWIGAN  0.1476  0.1281  0.0414 
6.3 Future Works
Currently, the forward wavefield need to be saved in memory for use during backpropagation. It means the realistic 2D largescale and 3D surveys will probably require more memory than is available. In addition, the design of the critic with complex architecture needs much memory as well. This computational limitation hinders the application of the proposed framework for practical models. The way to overcome this difficulty is one important direction of the future works.
Based on our experiments, we observe that the paradigms that rely on the DSL has the capacity of reconstructing multiparameters owing to the AD and chain operations. Therefore it’s easy to target at elastic full waveform inversion which is more accurate to simulate the wave propagation in isotropic data. However, the required storing memory for elastic wave equation is much more than that of acoustic wave equation, so the strategy to reduce the computational memory is a significant event.
Additionally, in terms of misfit function for FWI, adding regularization term is one usual way to improve the results and helps the optimization algorithm to overcome the localminima problem. At present, we only adopt the Wasserstein1 distance that is mostly used in DL field. For a future research, we will further explore different distance measure combined with regularization to tackle the difficulties of FWI.
7 Conclusion
We proposed a promising unsupervised learning paradigm FWIGAN that takes advantages of partial differential equations and deep learning techniques for twodimensional seismic full waveform inversion. Motivated by the competitive training of Wasserstein generative adversarial learning, we integrate the acoustic wave equation with a neural network such that the physics generator is responsible for generating the physically constraint wavefield from current velocity estimation, and the critic discriminates the quality between observed and simulated data via distribution matching way. The goal of our framework is to recover the velocity model by making the distribution of simulated data as close as possible to that of observed data. It needs no ground truth nor pretraining of networks, and is flexible owing to the automatic differentiation so that requires minimal user interaction. We have validated our approach on wellknown synthetic Marmousi, Marmousi2, and Overthrust models with diverse challenging conditions such as an undesired initial model and noisy measurements. Numerical experiments have demonstrated that the proposed method outperforms classical FWI algorithm with leastsquares minimization in all configurations. Moreover, this modeldata adjointlearning pipeline is able to achieve multiparameters inversion simultaneously in FWI workflow. The preliminary results suggest that our method is an appealing solution to alleviate the localminima issues which make it of interest for practitioners.