## 1 Introduction

The choice of misfit function is one of the key ingredients for a successful application of FWI in practice. The conventional norm misfit admits a "sample by sample" comparison of the predicted and measured data, and thus, it is prone to "cycle skipping." State of the art misfit functions, including the optimal transport [Opt_cls1], the matching filter [Warner] or a combination of them, i.e., the optimal transport of the matching filter [MF_OTMF], try to compare the data in a more global way to mitigate the cycle-skipping issue.

All those mentioned misfit functions are hand-crafted. Though these methods have been applied successfully, their form is fixed, and they may fail for specific datasets. Modification of these methods or improving their performance tends to be difficult as it requires an in-depth understanding of such algorithms, as well as the data. To ease these difficulties, in this abstract, we seek to learn a misfit function automatically, entitled the ML-misfit, for FWI using machine learning (ML). The features of the proposed ML-misfit lies in three aspects: 1) Rather than formulate the misfit function as a fully black-box, we design a specific neural network (NN) architecture for the ML-misfit, which in principle mimics reducing the mean and variance of the resulting matching filter distribution in the OTMF approach; 2) We introduce symmetry and Hinge loss regularization terms to guarantee the resulting ML-misfit is a metric (distance); 3) We use the meta-learning (one kind of semi-supervised learning methods) to train the resulting ML-misfit. In the following, we will first introduce the architecture of the NN network and then describe the training procedure using meta-learning. Afterward, we try to learn a convex function for time-shifted signals and then invert for the well-known Marmousi model using the ML-misfit trained on randomly generated 2D horizontally layered models.

## 2 The Method

### 2.1 The architecture for the neural network

In order to better constrain the function space to stabilize the training of the NN, we use the following NN architecture for the ML-misfit :

(1) |

where represents an NN with inputs and

in vector form (single trace), and its output is defined by a vector of length two. Here,

represents the neural network parameters, which we will train later. The form of the ML-misfit in Equation 1 is inspired by the OTMF misfit function [MF_OTMF].The misfit function of Equation 1 consists of two terms. Due to symmetry, let us only focus on the first term: . The network takes two traces of data as input or , and outputs a two-dimensional vector, which is expected through intuition to be similar to the mean and variance, like the OTMF approach. We assume the evaluation of results in zero mean and variance values as output when the two input traces are identical, while outputs different mean and variance values indicating the dissimilarity between the input traces of and . Of course, such outputs of the network reflect our simplified assumption that the NN might mimic the OTMF objective function, but as it adapts to the data, the outputs could be more complex with larger dimensions. Refer to [MF_OTMF] for more details regarding the OTMF misfit. Courtesy of the form we introduced in Equation 1, we can verify that the ML-misfit satisfies the following rules for a metric (distance): where are arbitrary input vectors. The remaining important requirement for a metric is the "triangle inequality" rule:

(2) |

where is an arbitrary input vector. The ML-misfit given by Equation 1 do not fulfill the "triangle inequality" rule automatically. Thus, we introduce a hinge loss regularization to comply with the condition:

(3) |

### 2.2 Training the neural network

In this section, we describe how to train the neural network of the ML-misfit defined in the previous section using meta-learning. In meta-learning, the training dataset is a series of tasks, rather than labeled data in supervised learning problems such as classification. Our loss for the training referred to as the meta-loss function is defined to measure the performance of the current neural network for the implementation of those tasks. Back to our ML-misfit learning problem, the tasks are formulated by running FWI applications, and we run many FWI, e.g., using different models. In the training, as the true models are available, we define the meta-loss as the Hinge loss regularization plus the normalized norm of the difference between the true and the inverted models, e.g.,

(4) |

where is the weighting parameter, and is the unroll integer meaning every k iteration steps we perform the NN parameter updating. The first term asks the ML-misfit to converge fast with the least model residuals, and thus, can mitigate the cycle-skipping problem. The second term seeks an ML-misfit, which complies with the "triangle inequality" rule. We will accumulate such loss values and backward project the residuals to update the neural network parameters. Given model at current iteration , we perform the forward modeling to obtain the predicted data . The derivative of the ML-misfit with respect to the predicted data leads to the adjoint source (data residual):

We backpropagate the adjoint source to get the model perturbation for updating the model:

where is the step length. Using the updated model, we can simulate predicted data again and iteratively repeat this process. The predicted data will depend on the parameter of the NN through model and depends on parameter through the gradient , and further through the adjoint source . Figuring out these dependencies, we can compute the associated meta-loss of equation 4 at each iteration and obtain the gradient for updating the NN parameters , accordingly.## 3 Learning a convex misfit function for the travel-time shifted signals

Since the biggest issue with objective functions is cycle skipping, for initial testing, we formulated a simplified "FWI" by optimizing a single parameter, i.e., the travel-time shift . An assumed forward modeling produces a shifted Ricker wavelet representing the predicated data : where is the dominant frequency. The meta-loss function of Equation 4 for training the NN is modified accordingly:

(5) |

In this example, we discretize the waveform of the record using 70 samples with a sampling interval of 0.03 s. We have five convolution layers and one fully connected layer. The inputs to the NN of

are two vectors, i.e., one trace of the predicted and measured data, and they are considered as two channels for the first convolution layer. We follow each convolution layer with a LeakyRelu activation function, followed by a MaxPooling. We set the channel number to

for the five convolution layers, while the kernel size is set to be, respectively. The kernel and stride sizes are set both to be two for the MaxPooling. The fully connected layer will take the flattened output from the previous MaxPooling and output a vector of size two (which is supposed to be the mean and variance).

In each epoch, we randomly generate 600 true and initial travel-times between 0.4 s and 1.6 s, and the main frequency is randomly generated between 3 and 10 Hz. The batch size for training is 60, i.e., we invert for 60 travel-times, simultaneously. We run 100 iterations for each inversion, and every ten iterations (the unroll parameter

), we update the neural network parameters. We use the RMSprop algorithm for training the neural network, and the learning rate is set to be a constant at 1.0e-6. We create another 60 inversion problems for testing (In the testing dataset, the true travel-times, as well as the initial travel-times for starting inversion, are also randomly generated between 0.4 s and 1.6 s and kept fixed during the training).

After 250 epochs of training, in Figure 1a, we show the curve of the normalized meta-loss of equation 5 over epochs for the training and testing tasks. The continuous reduction in the loss shows convergence and demonstrates the success of the training.

To evaluate the convexity of the ML-misfit with respect to the travel-time shift, we compute the ML-misfit between a target signal and its shifted version with varying time-shifts and compared with norm. In the computation, we set the main frequency to 6 Hz, and the travel-time shift for the target signal is set to 1.0 s, while the time-shifts with respect to the target signal varies from -0.6 s to 0.6 s. Figure 1b shows resulting misfit curves. We can see that the norm shows obvious local minima. Though the initial ML-misfit without been trained (black curve) shows random behavior, after 250 epoch training, it (blue curve) shows rather improved convexity with respect to the time shift. This demonstrate that our ML-misfit based meta-learning scheme successfully learned a convex function for travel-time shifted signals, which as we expect will reduce the corresponding cycle-skipping issue.

## 4 The Marmousi model example

We train the ML-misfit on randomly generated 2D horizontally layered models and apply the resulting learned misfit to the Marmousi model. For training, the model size is set to be 2 km in depth and 8 km in the distance, with a sampling interval of 40 m in both directions. We mimic a marine geologic setup with velocities between 1500 and 4200 m/s. The 2D layered model is generated randomly with a general increase of velocity with depth. The initial model for FWI is obtained with a highly smoothed version of the true model. We randomly generated 64 models and kept them for testing, and in each epoch, we randomly generated 256 models for training. The NN architecture for the neural network is the same as that in the time-shifted signal examples. The input size of the trace for the neural network is 160, which is the entire length of the record. To accommodate the complexity in seismic data, such as those corresponding to the Marmousi model, we increase the number of channels of the five convolution layers to . The kernel size of each layer is modified to correspondingly.

The meta-loss function for updating is the same as that defined in Equation 4, and we update the parameter of the ML-misfit every 10 FWI iterations (unroll integer ). We train the NN for 100 epochs. Figure 2a shows the normalized meta-loss over iterations; the reduction of the loss value suggests the convergence of the training.

We apply this trained ML-misfit to the modified Marmousi model. Similarly, the Marmousi model extends 2 km in depth and 8 km in distance. We simulate 80 shots, and 400 receivers are spread evenly on the surface. In training, a 6.5 Hz central frequency is used in the inversion. To demonstrate that the learned ML-misfit could mitigate the cycle-skipping without low frequencies, we mute frequencies below 3 Hz. The true model for Marmousi is shown in Figure 2b and the initial model is as shown in Figure 2c. In order to use the ML-misfit trained with the horizontally layered model, the size of the input must be consistent. Thus, we simulate the record time up to 4.8 s to be the same as that in the training step.

Figures 3a and 3b show the inversion result using the norm and ML-misfit objective function, respectively. The result from using the norm shows obvious cycle-skipping features, while the ML-misfit shows considerably improved results with an ability to recover the low-wavenumber components of the model without cycle-skipping, such as in the left part of the model.

## 5 Conclusions

We proposed to learn a robust misfit function, entitled ML-misfit, for FWI using meta-learning. A specific neural network architecture, as well as a Hinge loss function are used to shape the resulting ML-misfit to be a metric. We demonstrate the basic principle and the ability of the ML-misfit to learn a convex objective function, such as for a simple travel-time shifted signals. Trained on randomly generated 2D horizontally layered models, the resulting ML-misfit can invert for the Marmousi model free of cycle-skipping using a signal without frequencies below 3 Hz.