1 Introduction
In our attempt to discover the Earth by inverting measurements of its physical properties we have developed handcrafted algorithms to perform various tasks. These handcrafted (fixed) algorithms are independent of the specific features in the data, and usually, new research is required to update these algorithms to adapt to any evolution in the data to perform a specific task. Recently, datadriven approaches that have taken center stage in algorithms are replaced by dynamic neural network (NN) models trained to perform specific tasks. Machine learning (ML) is gaining a lot of traction as a tool to help us solve outstanding problems in seismic waveform data processing and interpretation. Most of the applications in our field have relied on supervised training of neural network (NN) models, where the labels (answers) are available Wrona et al. (2018); ArayaPolo et al. (2018); Ovcharenko et al. (2019); Di and AlRegib (2020). These answers are often available for synthetic data as we numerically control the experiment, or they are determined using human interpretation or human crafted algorithms applied to real data. The challenge in training our NN models on synthetic data is the generalization of the trained models to real data, as that process requires careful identification of the training set and the inclusion of realistic noise and other variables between synthetic and real data. In other words, the synthetic and real data are usually far from being drawn from the same distribution, which is essential for the success of a trained NN model Kouw (2018). Thus, many synthetically trained NN models have performed poorly on real data. On the other hand, training on real data provides models that are often, at best, as good as the accuracy of the labels that were determined by humans or humancrafted algorithms (weak supervision). So the datadriven feature of machine learning, in this case, will be highly weakened Zhou (2017).
Modeling is a general term encompassing any forward operation in which we have a model and we seek to obtain the corresponding data. The process can be as easy as a matrixvector multiplication (linear) that can be used, for example, to mute potentially missing traces from common shot gathers or image data for interpolation objectives, where the shot gathers or the images represent the model. Modeling can also include more complex operations like solving the wave equation using any numerical method
Aki and Richards (2002). For seismic inversion, modeling, in its simplest form, is often given by the convolution of the source wavelet with a reflectivity model Wang (2016). Modeling is a deterministic process, and thus, the inputoutput combination is often unique and can be determined numerically. As such, using modeling and simulation, we can generate ML training data with their corresponding labels in a straightforward manner. Usually, in this case, the synthetically generated data (like shot gathers with missing traces) represent the input to the NN network in an attempt to have it learn to output the model (which is correspondingly the full shot gather). For the trained NN models on such synthetic data to generalize to real (application) data, the synthetic training set has to include the features embedded in the real data as much as possible Kouw (2018). For one, the training dataset (inputs and labels) should be represented by distributions that include the input and expected labels for the real data. However, this requirement, especially with respect to the input data to the network, is hard to achieve considering the simplified assumptions we use in modeling and simulation. It requires that we accurately reproduce the correlated and uncorrelated noise () present in the real data, where here is the time (or any vertical axis like depth). More importantly, it also requires that we properly represent the source wavelet () and the reflectivity (), present in the real data. In other words, for the synthetic data in our simplified Earth model, the distributions of these three functions should cover those of the real data, and this is very hard to accomplish Birnie (2018). Here stands for the convolution process.The concept of trying to bridge the gap between the training and application data in machine learning is referred to as domain adaptation Kouw (2018); Lemberger and Panico (2020)
. In this case, the training dataset is assumed to belong to the source domain and the application/testing data are assumed to belong to the target domain, the target of our training. The classic theory of machine learning assumes that the application (target) data of a trained model should come from the same general population (sampled from the same distribution) as the training (source) set. So we need the probability distribution of the synthetic (source) dataset,
, where are the inputs (i.e. synthetic waveform data), andare the labels (i.e. traveltime picks or horizons) for the source set, to equal the probability distribution of the real (target) dataset,
, where are the inputs (i.e. real seismic data), and are the labels for the target set that we often seek to predict. One category of data adaptation is referred to as subspace mapping (or more generally, alignment) in which we find a transformation, , that results in the distribution of the training (source) input data to equal that of the application (target) input data Fernando et al. (2013). Specifically,. This can be accomplished by projecting the source and target data to the eigenvectors of the two subspaces, then finding a transformation between these projected spaces. Such projections can be achieved by Neural Network embedding aimed to find the weights of the embedding that minimizes the distance between the distribution of the source samples and the target ones. There are many ways to constrain the transformation or weights to make the distributions similar including the use of optimal transport
Villani (2008). In this category, even cycle Generative Adversarial Networks (GANs) are used for the purpose of learning a generator to map target data to source data
Gupta and Booher (2019). However, these methods become more difficult to apply when the dimensions of the data are large, as is the case with waveform (including seismic and ultrasound) data. The method proposed in this paper shares the general framework of subspace alignment implemented in an empirical fashion.For seismic applications, as well as many other applications, we often acquire waveform data with sensors placed on the surface of the investigated body. In imaging applications, such data often loosely constrain the vertical (depth) axis (mostly described by the behavior of features along recording channels), and thus, we face issues in imaging events accurately in depth. In this case, wells are often used to correct the depth misties as wells are considered ground truth markers Luo et al. (2019)
. Nevertheless, the structural information in these images is provided by the seismic (waveform) data. In applications, not requiring precise vertical dimension labeling (scalewise), crosscorrelation of the input data with a reference trace from the same data has recently been suggested to help reduce the variance in the distributions of the training and testing data, which ultimately helped with the application of ML for direct waveform microseismic event location
Wang and Alkhalifah (2021). Prior to that, Choi and Alkhalifah (2011) used the process of convolving a reference trace from the observed data with the synthetic data, and conversely convolving a reference trace from the synthetic data with the observed data in a waveform optimization problem to invert for the velocity model. This process helped reduce the difference between the terms of the objective function, and especially in mitigating the waveform source effect. In the sprite of these two processes, we propose a combination that could help us adapt the NN model training on synthetic data to work on real data.An objective of a neural network model is to provide us with an output for a given input. The output that a trained model will give is based on the training it experienced, and that depends mostly on the source training set and its distribution. A trained neural network model generalizes well when the target data are represented, as much as possible, in the source data set. To help accomplish that when the application (target) data are real (field) data, we propose, here, to inject as much of the real data features into the synthetic data training as possible. This can be accomplished by utilizing a combination of linear operations including crosscorrelation, autocorrelation, and convolution between the synthetic and field data. These operations will bring the distributions of the training (source) synthetic dataset, and the (target) real dataset closer to each other, which will help the trained model generalize better on real data. One real data generalization example we use here to test the approach is an NN model dedicated to locating microseismic sources directly from recorded waveform data. We also test the approach on an NN model trained to predict lowfrequency data from highfrequency data to ultimately help full waveform inversion (FWI) converge better.
2 An empirical Data projection
The method proposed here is applicable mainly to supervised learning. Also, we assume that the vertical axis of the input sections (images or shot gathers) are not crucial in absolute values to the task at hand. Only the relative relation between events matters along that dimension. The reason for this assumption will be clear later.
2.1 Setup
We assume we do not have labels for the application data, and thus, we cannot perform transfer learning (a form of domain adaptation). In our case, the source synthetic data are labeled, but the target real data are not. This form of domain adaptation is often addressed with unsupervised ML methods. In such domain adaptation, another important assumption is implied, and that is the target labels are drawn from the same distribution as the source labels (
). This is an important assumption for the application of synthetically trained NN models on real data. Of course, this requires that the task we plan to perform on the field data is represented by the synthetic training data. Specifically, . In physical terms, this implies that the modeling we do for our synthetic data generation represents the actual physical behavior. In other words, the assumptions used to model the synthetic training set represent the object of application (like the Earth). Thus, the issue we are addressing here is the case when the input distributions for the source and target data are not the same, specifically .This form of difference between the training and application datasets distributions in machine learning circles is referred to as a covariate shift Fernando et al. (2013). There are many ways to measure such a shift, including using the KullbackLebler (KL) divergence metric. In domain adaptation, and similar to error bounds defined for machine learning in general, we can define an error bound on the application of a trained network model. This error bound is given by the combination of the error in the training and a term related to the complexity of the NN model (like its size). This, however, assumes that the training and application data come from the same distribution. For the case of a covariate shift, we have a similar, more complicated bound, given by BenDavid et al. (2007); Lemberger and Panico (2020):
(1) 
where is the bound on the training error, and is the distance between the marginal distributions of source and target datasets. Here, represents the optimal joint errors of a neural network model between the source and target datasets. So the upper bound of the application error is guided by these three terms.
So our objective is to devise a transformation that minimizes the difference measure in equation 1 between the distribution of the training (source) and testing/application (target) datasets. Specifically, we aim to find the transformation on the source data set and on the target data set so that the probability distributions . Figure LABEL:fig:diagramD summarizes this goal within the framework of the training process. So the input for the training of the Neural network () model is , in which the model parameters are optimized to match the labels
using a loss function (
). On the other hand, the input during inference is . We will discuss the transformations and in the next section, where the input are waveform (i.e. seismic) data .diagramDwidth=0.95The workflow of the proposed data adaptation in which the training (source) dataset might have a different distribution than the application (target) dataset, where transformations and will help reduce such differences and provide new data as input to the neural network function () to train the network to reduce loss , and then apply to real data. Here, are the probability distributions and we show schematic versions of them for the source and target datasets.
2.2 Conditioning synthetic data for training
A trace in the seismic data can be represented by a combination of reflectively, source wavelet and noise, as follows:
(2) 
where is the index of the trace, and is the index of the section in which the trace belongs to whether the section corresponds to a shot gather or a seismic image. Depending on the data, all three components can vary over traces and sections. Here, may represent time or depth depending on whether the inputs are shot gathers or depth images, respectively. For a shot gather for example, often changes with moveout, and of course, changes from one trace to another. In training a neural network model to work properly on d(t), we often generate synthetic data, that hopefully includes a proper representation of these components.
To do that, we use transformations applied on the vertical axis (time or depth), and thus, as mentioned earlier, we assume that the scale of the vertical axis is not crucial to the task. This assumption, though limiting the application of this approach, will allow us to apply the necessary transformation for domain adaptation. We will elaborate in the discussion section on the effect that this assumption may have on the application of this approach. To migrate the components described in equation 2 from the real data to the synthetic ones, we use linear operations, and thus, we define new training data (source data transformed by the proposed approach) as follows:
(3) 
where is the index of a reference trace from the synthetic input data (fixed for both synthetic and real data), and is the index of a section from the real data whether the section corresponds to a shot gather or a seismic image. The operator represents crosscorrelation, and in this equation, we have a crosscorrelation between the input synthetic section and a reference trace from that section convolved with a randomly drawn () autocorrelated section from the real data . The reference trace can be a near offset trace in the case of a shot gather input, or it can be any trace from the input for a seismic image. The index of the reference trace should not change between sections to maintain the relative relation between sections. The randomly picked
index for autocorrelated real data varies in the training per epoch to allow for proper representation of the real data imprint on the training set. Assuming we only have noise in the real data, the autocorrelation of random noise yields a quasi delta function at zero lag proportional to the energy of the noise. A convolution with such a function will incorporate that energy into the synthetic data so that the signaltonoise ratio (SNR) in the transformed synthetic data would be comparable to that of the autocorrelated real data.
To allow for a transformed source data to have a similar distribution to the application data, we also transform the target (application) data in a similar fashion in which the features of the source synthetic data are incorporated in the target data. To achieve that, we apply the following transformation, , to the real data during the inference stage:
(4) 
which includes the same operations as in equation 3 with the role of the real and synthetic data reversed. In this case, represents the number of synthetic sections in the training set, and we actually convolve with the mean of the autocorrelated synthetic data, instead of a randomly drawn sample, like in the training. We will refer to the transformations in equations 3 and 4 as {MLReal transformations}. The idea of having two instances of each data (synthetic and real) in equations 3 and 4 is to balance their contribution to the new training and testing data sets. This way, we match the properties of the synthetic and real data used for training and inference, respectively. Note that the convolution operation that connects equal amounts of the synthetic and real data contributions is commutative. We will see the value of this operation more in the next section.
Meanwhile, Figure LABEL:fig:traintestData2FF (left) demonstrates the process of applying equation 3, where the synthetic data were generated for the training of a NN model to predict low frequencies Ovcharenko et al. (2019). The objective was to improve the full waveform inversion convergence for real marine data by adding lowfrequency content into the data. On the other hand, Figure LABEL:fig:traintestData2FF (right) demonstrates the process of applying equation 4 on real data for an input to the trained model. Note that the resulting shot gathers look similar for the two processes, and especially in the distribution of energy along the channels. They, however, still maintain the characteristics (the moveout) of the original data (synthetic or real). However, they contain more energy and more features in which the NN model can utilize. We will show the results of applying the proposed process for this task in the examples.
traintestData2FFwidth=1.05The workflow chart for the MLReal method applied to marine data. Left side: The proposed process used for producing the training data; Right side: the proposed process used for producing the testing/application data. The circled cross symbol denotes a crosscorrelation operation, and the star symbol denotes a convolution operation. The dashed vertical yellow line in the input traces indicates the location of the reference trace.
2.3 A frequency domain analysis
If we transform the real data to the frequency (or vertical wavenumber) domain, considering the basic laws of the Fourier representation of crosscorrelation and convolution, equation 2 can be written as
(5) 
where is the angular frequency, and
are the amplitude and phase, respectively, of the complexvalued data. All capital letters represent the frequency domain form of the reflectivity, source, and noise functions in equation
5, given respectively. As a result, we can write equation 3 in the frequency domain as(6)  
where the overstrike, , symbol stands for the complex conjugate.
The application of the model on real data will involve an input to the model given by equation 4, which can be represented in the frequency domain by
(7)  
Note that the frequency content of the input to the training and the application on real data will include a squared amplitude () of the real data in an even manner. Also, the key here is that equations 6 and 7 share a similar magnitude of noise, reflectivity, and source signature, or in more general terms, similar energy. Since the two data sets, after transformation, share the same elements sampled from their corresponding distributions, then the distributions of the two data sets should be close, which allows us to satisfy the requirement for the generalization of the NN model. In other words, the second term in equation 1 will be small.
2.4 The distributions of the training and application data sets
We next evaluate how much did we manage to bring the training (source) and testing/application (target) datasets closer to each other with the proposed transformations. So the underlying assumption in supervised learning, which is a statistical exercise in mapping inputs to outputs, where inputs and outputs correspond to particular distributions, is that the training and application data (at least inputs) should fall within the same feature space, or in other words, come from the same distribution. In machine learning, domain adaptation is often handled through machine learning procedures, including cycle GAN and transfer learning. Here, considering we are dealing with waveforms in twodimensional (even 3D) space. we perform the transformation through
The key operation used here to tie the two data is the convolution process, which involves a multiplications of elements and since the multiplication process is cumulative, then the distribution of the new data is given by
(8) 
3 A microseismic data example
We will first show the impact of the above operations on real data acquired as part of monitoring microseismic events. The passive seismic acquisition was performed using a star configuration of sensors, as shown in Figure LABEL:fig:receivergeometry, to monitor a hydraulic fracturing stimulation of a shale gas reservoir in the Arkoma Basin in the United States. Figure LABEL:fig:raw_field_data shows the real data for one microseismic event. The shot gather section includes ten segments from the various lines (azimuths), shown in Figure LABEL:fig:raw_field_data, plotted here side by side. We were provided a total of 75 of these microseismic event recordings and the corresponding locations of the events were determined using conventional methods (we will use here only 10 of them). These labels (event locations) will help us evaluate the accuracy of our trained NN model. For more details on the data, we refer you to Staněk and Eisner (2017). We were also given a velocity model for the area, which is shown in Figure LABEL:fig:velocitymodelvp. Using this velocity model, we employ a secondorder in time and fourthorder in space, finite difference approximation to solve the acoustic wave equation and simulate wavefields from 5000 randomly placed seismic sources within the region of interest (the region we expect the real events to be located). The resulting 5000 synthetically recorded sections, using the layout in Figure LABEL:fig:receivergeometry
, and the corresponding event location (labels) are split in a random manner into a training set (4000 samples) and a validation set (1000 samples). An example synthetic data section, for an event near the one estimated for the real data section in Figure
LABEL:fig:raw_field_data, is shown in Figure LABEL:fig:raw_synthetic_data. If we compare this synthetically generated section with the field one, we can appreciate the large difference between the two data despite them sharing similar general shapes as they originate from nearby sources. We do not have the source time information, which explains the shift between the events in the two sections. We trained the neural network model on such synthetic data (used the crosscorrelation operation with a reference trace to mitigate the shift Wang and Alkhalifah (2021), and because of the large differences in the data between training and testing data, the accuracy of the location of the 10 events, as we will see later, is low.2receivergeometry,velocitymodelvpwidth=0.65a) The passive seismic acquisition lines. b) The velocity model estimated in the region and used here to generate the synthetic data.
2raw_field_data,raw_synthetic_datawidth=0.45a) The field recorded data for a single microseismic event along the 10 lines plotted side by side. b) The synthetic data along the same lines from a source near the field data one, which was provided for this event. The time of the source is unknown, which explains the shift.
3.1 Data transformations
We first calculate the crosscorrelation of the input section with a reference (fixed location for all data synthetic and real) trace from the section. The reference trace in this application is given by the first trace of each input section, or in other words, the first trace of the first line in the section. An example result of this operation applied to the sections shown in Figures LABEL:fig:raw_field_data and LABEL:fig:raw_synthetic_data, is given by Figures LABEL:fig:field_data_correlate_with_reference_trace and LABEL:fig:synthetic_data_correlate_with_reference_trace, respectively. This operation has largely mitigated the time shift between the field and synthetic data, as the vertical axes are now given by the time lag. We then calculate the autocorrelation of the field and synthetic data, a sample of which are shown in Figures LABEL:fig:auto_correlation_field_data and LABEL:fig:auto_correlation_synthetic_data corresponding to the sections shown in Figures LABEL:fig:raw_field_data and LABEL:fig:raw_synthetic_data, respectively. We can appreciate how much the autocorrelation of the field section carries information representing the source wavelet and the noise, in a zerophase fashion. Applying the operations involved in the MLReal transformations (equations 3 and 4) on the synthetic and real data, we obtain the sections shown in Figures LABEL:fig:field_data_convolve_with_mean_autocorrelation2 and LABEL:fig:synthetic_data_convolve_with_mean_autocorrelation2, respectively. We window the part around zero lag to reduce the size of the inputtothenetwork data. The two sections look much more alike than those in Figures LABEL:fig:raw_field_data and LABEL:fig:raw_synthetic_data.
2field_data_correlate_with_reference_trace,synthetic_data_correlate_with_reference_tracewidth=0.45a) The crosscorrelation of the field recorded data shown in Figure LABEL:fig:raw_field_data with the first trace in the section. b) The crosscorrelation of the synthetic data shown in Figure LABEL:fig:raw_synthetic_data with the first trace in the section.
2auto_correlation_field_data,auto_correlation_synthetic_datawidth=0.45a) The autocorrelation of the field recorded data shown in Figure LABEL:fig:raw_field_data. b) The autocorrelation of the synthetic data shown in Figure LABEL:fig:raw_synthetic_data.
2field_data_convolve_with_mean_autocorrelation2,synthetic_data_convolve_with_mean_autocorrelation2width=0.45a) An example section testing application data after applying equation 4 (the proposed transformation) for the section shown in Figure LABEL:fig:raw_field_data. b) The training input data corresponding to the section shown in Figure LABEL:fig:raw_synthetic_data after injecting it with real data information using the proposed method, and specifically equation 3.
3.2 The training
Using sections like that shown in Figure LABEL:fig:synthetic_data_convolve_with_mean_autocorrelation2
in which the location of the source (as the label) is known from modeling, we train a 14layer convolutional neural network to predict the location of the microseismic source (
, , and ). The training included 4000 input sections (samples, modeled synthetically) and their corresponding labels (the training set), and the training was executed over 5000 epochs, single batch, using an Adam optimizer Kingma and Ba (2014). The convolution with randomly selected auto correlated sections from the real data is done at every one of the 5000 epochs. The random selection allows for more variance in the information extracted from the real data to be present in the training samples. The loss function for the training and validation, shown in Figures LABEL:fig:loss_training and LABEL:fig:loss_validation, respectively, show good convergence. As we may expect due to the data being recorded at the surface, the error in the horizontal location of an event is far less than that for the vertical location. Thus, the lateral resolution in locating the event is expected to be higher.2loss_training,loss_validationwidth=0.8a) The training loss for the separate coordinate components as well as the total loss (distance). b) The validation losses.
3.3 Results
Then we input 10 real data sections, after applying the operations in equation 4, into the trained model to evaluate the accuracy of the prediction. Figure LABEL:fig:event_location_3D_given_vs_autocor_vs_no_autocor_new shows the predicted locations (in blue) and the provided ones (in red) in the region of investigation. We also show the predicted without convolving with the autocorrelation, just the crosscorrelation with a reference trace Wang and Alkhalifah (2021). The differences are generally small and can be caused by many factors. For one, the NN model is known to have a bias toward smoothing the output Rahaman et al. (2019). A very small network Wang et al. (2020) will levitate towards the mean of the training labels. Of course, a cure for that is to increase the network size. Another reason for the difference could be the simplified assumptions used in our data simulation for the training of our NN model compared to the modeling approach potentially used in determining the location of the events by the data providers. To further evaluate the improvements in data coherency between training and application using the proposed MLReal transformations, we compute the normalized Euclidean distance (NED) between the new testing (real) sections and the corresponding synthetic ones. Figure LABEL:fig:LD_vs_NED shows NED for sections in which only the crosscorrelation with a reference trace is used to mitigate the shift (dashed blue), and those for sections in which MLReal transformations were used (solid blue) in the training. Though NED values can range between 0 to 1, the values are reasonably higher for our proposed method, compared to just crosscorrelating with a reference. We also plot in Figure LABEL:fig:LD_vs_NED the distances between the predicted sources and provided ones for the same 10 events. For our proposed method, these values average around 15 meters. Considering the quality of the data and the layout of the sensors on the surface, the differences for the proposed method are very reasonable. Meanwhile, with only the crosscorrelation with a reference trace, which was developed earlier Wang and Alkhalifah (2021), the location difference averaged around 45 meters (Figure LABEL:fig:LD_vs_NED, in black). So the transformations (our approach for domain adaptation) helped a lot in this example.
event_location_3D_given_vs_autocor_vs_no_autocor_newwidth=0.95a) A 3D plot of the locations of the predicted (blue dots) and provided (red dots) source locations for 10 field data events, as well as the trained and predicted without the convolution with the autocorrelation of the other data (grey dots). The thin lines between the dots connect both predictions to the provided locations from both predictions.
2event_location_XY_given_vs_autocor_vs_no_autocor_new,event_location_XZ_given_vs_autocor_vs_no_autocor_new,event_location_YZ_given_vs_autocor_vs_no_autocor_newwidth=0.45The locations of the predicted (blue dots) and provided (red dots) source locations for 10 field data events, as well as the trained and predicted without the convolution with the autocorrelation of the other data (black dots), all of which are projected on to a) the plane, b) the plane, and c) the plane. The thin lines between the dots connects both predictions to the provided locations.
LD_vs_NEDwidth=0.95The normalized euclidean distance (blue) between the field data and modeled data from the predicted source locations (blue), as well as the actual source location (the label) difference between the provided locations and the predicted locations using the proposed method (solid lines) and using the same network trained without the convolution with the autocorrelation (dashed lines).
4 Lowfrequency prediction example
In this example, we apply the approach to the task of lowfrequency extrapolation. In particular, we reconstruct lowwavelength components, 5 Hz, for a complete shot gather from available highfrequency representation, 4 Hz, of this shot gather. The predicted lowfrequency data aim to help fullwaveform inversion (FWI) converge to the global minimum by mitigating the cycleskipping problem Ovcharenko et al. (2019). Since low frequencies are often not available in realworld field data or are contaminated with noise, we train a deep neural network on synthetic data and run inference on a bandlimited marine dataset acquired offshore northwest Australia. An example shot gather is shown in Figure LABEL:fig:traintestData2FF.
Similar to an inverse problem, where the soughtafter data are often unknown, available lowfrequencies in the field dataset are insufficient to enable unsupervised training for bandwidth extrapolation. For this reason, we train the deep learning model on synthetic waveforms simulated in a variable range of elastic media initializations. Specifically, we extract the acquisition parameters and source signature from the field data and use these for numerical modeling in a set of synthetic subsurface initializations. The source wavelet should have a broad spectrum to allow us to bandpass the modeled shot gather to a highfrequency shot gather (similar band to the field data) and a lowfrequency band shot gather (our objective). Common shot gathers, generated using such a setup, provide inputs and labels for the training of the deep learning model.
4.1 Data transformations
For training on synthetic data, we only apply the MLReal transformations on the inputs to the network assuming that we seek targets similar to those from the synthetic data distribution. In other words, we assume that the velocities used in generating the synthetic data are close to the true Earth one, which will mitigate the limitation we mentioned with regard to the vertical dimension of the input. To set up the adaptation workflow for the input highfrequency data, we select an arbitrary trace from the highfrequency partition of the synthetic dataset and use this trace as a constant amplitude reference throughout the application. During training, for a given sample of input highfrequency data, we first crosscorrelate it tracewise with the trace selected earlier (Figure LABEL:fig:traintestData2FF). Then we draw a random highfrequency data sample from the field dataset and convolve its autocorrelation with the result of the previous operation, as demonstrated in Figure LABEL:fig:traintestData2FF. The effect of such transformations is shown in Figure LABEL:fig:input_target. The transformed synthetic and real data lookalike to the point they share similar energy distribution as a function of offset. The synthetic transformed data are used to train the network without the need to apply any alterations to the output synthetic lowfrequency shot gather, as the NN needs to learn such mapping since the transformed real data have similar features.
1input_targetwidth=The original input and target data from synthetic and field datasets (first and last columns; the input data transformed by the proposed MLReal transformations (central column).
4.2 Deep learning framework
We approach the frequency bandwidth extrapolation for the entire shot gathers as an imagetoimage translation task. The design of a neural network should account for the need to capture longwavelength patterns in the data since these are the datadomain projection of lowfrequencies. For this reason, we follow
Wang et al. (2018) to build a network that extracts multiscale representations from the input volume by using dilated convolutions. A similar note about the need for widespan convolutional kernels was made by Sun and Demanet (2019). In particular, the proposed architecture (Figure LABEL:fig:arch_low) includes a threecolumn encoder featuring dilated convolutions with kernel sizes of 3, 5, and 7, followed by a convolutional decoder. In total, there are around 900k trainable parameters.1arch_lowwidth=The multicolumn architecture for lowfrequency extrapolation neural network model. This translates the highfrequency data, HF, into its lowfrequency counterpart, LF.
The training strategy implements the superconvergence concept by scheduling the learning rate according to Smith and Topin (2019), with the minimum and maximum learning rates of and , respectively. We also find it helpful to initialize the network following He et al. (2016), as well as to average predictions within an ensemble (Chollet, 2017) of 5 network initializations to reduce noise in the output data. One more note about training includes using the batch size of 4 since smaller batch sizes are prone to lead the inversion to nonsharp local minima (Keskar et al., 2016).
4.3 Results
Each shot gather used as input and output of the network is a singlechannel image of , measuring the number of receivers in the marine streamer and the number of time samples, respectively. Receivers are spaced 25 m apart while the temporal sampling is coarsened to 8 ms. We create the training dataset of 3072 shot gathers by modeling 3 shots in each of 1024 random subsurface realizations. These shots are then split into partitions of 2765, 154, and 153 samples for training, validation, and testing, respectively. The set of random subsurface models is derived by distorting the layered models using elastic transforms, similar to Kazei et al. (2021). The optimization by Adam (Kingma and Ba, 2014) stagnates after 80 epochs, delivering sufficient coveragence of the proposed training.
2syn_data,field_datawidth=0.45Prediction results for synthetic (a) and field data samples (b). The top row contains the input data processed by the proposed transformations and the corresponding predictions of data below 5 Hz. The bottom row contains the same predicted data lowpassed below 3 Hz.
The predicted lowfrequencies 5 Hz indicate a good match for both synthetic (Figure LABEL:fig:syn_data) and field datasets (Figure LABEL:fig:field_data). This is expected due to the intentional overlap from 4 to 5 Hz between input and target frequency bands used to recover the amplitude of the signal when needed. We also lowpass the predicted data below 3 Hz to explore the regression capability of the trained network in the part of frequency spectra where data were not present. For the synthetic test, while weak reflections in the lowpassed data are missing in the predictions, the general shape of the wavelet, as well as the early arrivals, are fairly well reconstructed. On the other hand, since real data suffers from a low signal to noise ratio at low frequency, like below 3 Hz, the predictions admitted higher signaltonoise ratio, and reasonable smooth wavefields that can ultimately benefit waveform inversion. The high signaltonoise ratio is courtesy of the synthetic data training where the output is free of noise and includes very low frequency.
In this example, we mainly conduct a proofofconcept study of a geophysical application of our domain adaptation approach given by MLReal transformations (equations 3 and 4), rather than seek the stateoftheart bandwidth extrapolation accuracy. Realistic inference results by the same trained network on synthetic and field datasets suggest the viability of the approach for supervised regression tasks where only input data from the field dataset is available.
5 Discussions
The reason for the crosscorrelation with a reference trace is to balance the contributions from synthetic and real data to the input data to the network, without affecting the general features of the input data. In the microseismic example, this operation also can help reduce the input data size. The fact that the inputtothenetwork is effectively different from the original data is not an issue for NN models, as they adapt to any data we deem as input. What matters is whether the input is consistent between the training (source) and application (target) data. Actually, the crosscorrelation operation can enhance the data with features (from the crosstalk between unrelated events) that will further enrich the training data with information that can help in identifying the corresponding labels. For example, if a shot gather includes two reflections, the crosscorrelation of that shot gather with a reference trace from it will result in two additional events from the crosstalk of the two events, and these added events will be different from one shot gather to another depending on the distance between the two events and their moveouts. Such additional events will add to the information embedded in the input data that the NN model can use to learn to identify the corresponding labels.
However, the application of the MLReal transforms in equations 3 and 4
, and specifically, the crosscorrelation with a reference trace will alter the vertical axis of the input data, moving the energy closer to zero lag, or in other words, retaining only information on the relative location of events in the vertical dimension (whether the vertical dimension is time or depth), not their absolute locations. If the task is a classification of features not related to the absolute value of the vertical axis, like in the microseismic source location task we shared, this process will not affect the ability of trained models to classify the real data, since the real data are exposed to the same operations. This holds for any task that does not depend on the actual scale of the vertical axis (time or depth) and relies more on the relative location of events vertically, and on lateral behavior of features. This limitation should not be a critical obstacle for seismic data recorded on the Earth’s surface as our resolution of depth (the vertical axis) has always been limited. This limitation often manifests itself in the misties we face between seismic and well depths. In other words, the vertical scale, in many seismic tasks, is poorly represented in the data.
Nevertheless, there is a relatively straightforward remedy for such a limitation that would allow us to apply the proposed approach to a wider range of tasks. If the task involves the same size data (in the vertical dimension) for the input and output to the network, like in denoising, superresolution, and even interpolation of missing traces, the crosscorrelation operation can be applied to both the input and output data. The reference trace should be taken from the same channel location from the input data (as the input data are available for training and application). For prediction, in this case, we will need to apply an inverse crosscorrelation on the predicted data using the same location reference trace used in the training, but taken from the input to the prediction. The inverse crosscorrelation will transform the output of the prediction back to the form of the section we desire. Figure
LABEL:fig:diagramOuput outlines these steps. The correlation and inverse correlation steps could be done in the Fourier domain, as the inverse correlation, in this case, implies dividing each output trace by the reference trace. Kinematically, this implies that in the training, we subtract the phase of the reference trace from the input data (crosscorrelation) and then we add it back again to the output of the NN model in the inference stage (inverse crosscorrelation). If the amplitudes of events are not critical, the inverse crosscorrelation can be replaced with convolution, which has exactly the same effect on the phase.diagramOuputwidth=0.95A diagram describing the steps applied to the output of the NN network to allow the proposed approach to work for applications like denoising and missing trace recovery when the training is performed on synthetic data. For training (top row), the original label is correlated with a reference trace from a fixed location from the input data, as the dashed line indicates, to provide the output for the training . This same reference trace is used in the transform given by equation 3. For inference (bottom row), the output of the NN model,, is inversely correlated with a fixed location reference trace from the input, as the dashed line indicates, to provide the final output.
Finally, there is nothing in the proposed MLReal transformations that can prevent their applications to waveforms (seismic or electromagnetic) at any scale. The proposed transformations are applied to the vertical dimension of a multi dimensional input to the network, and in the shared examples the vertical dimension was the time axis. The microseismic example can be considered as a miniature test of a potential Earthquake location task. So if the objective is to study and monitor earthquake locations in a certain region using a set of seismic stations Boschi et al. (2018); Theunissen et al. (2017), we can use a global model Lei et al. (2020)
to simulate synthetic data at the seismic stations, albeit elastic (multi component), from random sources (with random source moment tensors) at the monitored area of interest. In this case, one of the stations are picked as the reference station for crosscorrelation and fixed for both synthetic and real data. Then the MLReal transformations will migrate the real data features (events not reproduced by the global model) to the synthetic training set, and vice versa, as shown for the microseismic example in Figures
LABEL:fig:field_data_convolve_with_mean_autocorrelation2 and LABEL:fig:synthetic_data_convolve_with_mean_autocorrelation2 .Training with welllog velocities
Though we shared two specific applications, the proposed approach is meant to help with any supervised machine learning task in which the task is insensitive to the scale of the vertical axis of the input data. The vertical scale in waveform applications in which data are acquired on the input data are given by multidimensional waveforms. This may include tasks in first arrival picking, horizon picking, facies segmentation and classification, fault detection, semblance analysis, velocity inversion, and many more. The key common denominator in these tasks is defined by the behavior of the features over multiple traces. The initial crosscorrelation of the data with a reference trace will not affect the majority of these applications. For traveltime picking, the reference trace should be fixed for all shots. The same holds for horizon picking and segmentation tasks. However, in applications where the vertical axis (time or depth) does not influence the labels, like direct velocity inversion, the reference trace can be taken from each input section separately.
In applications where the vertical axis is not crucial to the task, which
The only part of the proposed data adaptation operations here that considerably affects the input features (structures) is the crosscorrelation of the input with a reference trace. The choice of a reference trace can be crucial in applications involving first arrival picking, segmentation, horizon, and fault detection. Since for these tasks, the vertical location is a crucial feature, the reference trace should be consistent between training and application.
In some applications, like those given by predicting low frequency, the outputs in the training might have to be exposed to the same operations as the input.
In applications where the vertical axis of the input is crucial to the task, a crosscorrelation between the reference trace used in the training and that used in the inference will provide the shift
6 Conclusions
We proposed a novel technique to precondition the synthetic training data set for a supervised neural network optimization so that the trained model works better on real data. The concept is based on incorporating as much information from the real data into the training without harming the synthetic data features crucial for the prediction. Considering the two data domains (synthetic and real), we specifically crosscorrelate an input section from one domain of data with a reference trace from that data followed by convolution with an autocorrelated section from the other domain. For training the NN model, the input section is from the synthetic data domain, and for the application (inference) of the NN model, the input section is from the real data domain. A test of this approach on a microseismic source location task using input waveforms helped us improve the application of the NN model on real data. Another application, in which we train an NN to predict low frequencies, the preconditioning helped improve the prediction on real data.
7 Acknowledgments
We thank Umair bin Waheed from KFUPM, Frantisek Stanek from Seismik, and Claire Birnie and Yuanyuan Li from KAUST for helpful discussions. We thank Microseismic, Inc. and Newfield Exploration MidContinent, Inc. for graciously supplying the data. We thank CGG for the marine dataset. We also thank KAUST for its support and the seismic wave analysis group (SWAG) for constructive discussions.
References
 Quantitative seismology. Cited by: §1.
 Deeplearning tomography. The Leading Edge 37 (1), pp. 58–66. Cited by: §1.
 Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, B. Schölkopf, J. Platt, and T. Hoffman (Eds.), Vol. 19, pp. . External Links: Link Cited by: §2.1.
 Statistical methods for ambient noise characterisation, modelling and suppression: theory and applications for surface microseismic monitoring.. University of Leeds, University of Leeds. External Links: Link Cited by: §1.
 A simple method for earthquake location by surfacewave time reversal. Geophysical Journal International 215 (1), pp. 1–21. External Links: ISSN 0956540X, Document, Link, https://academic.oup.com/gji/articlepdf/215/1/1/25166409/ggy261.pdf Cited by: §5.
 Sourceindependent timedomain waveform inversion using convolved wavefields: application to the encoded multisource waveform inversion. GEOPHYSICS 76 (5), pp. R125–R134. Cited by: §1.
 Deep learning with python. Simon and Schuster. Cited by: §4.2.
 A comparison of seismic saltbody interpretation via neural networks at sample and pattern levels. Geophysical Prospecting 68 (2), pp. 521–535. Cited by: §1.

Unsupervised visual domain adaptation using subspace alignment.
2013 IEEE International Conference on Computer Vision
, pp. 2960–2967. Cited by: §1, §2.1.  CycleGAN for sim 2 real domain adaptation. Cited by: §1.

Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document Cited by: §4.2.  Mapping full seismic waveforms to vertical velocity profiles by deep learning. Geophysics 86 (5), pp. 1–50. Cited by: §4.3.
 On largebatch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Cited by: §4.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2, §4.3.
 An introduction to domain adaptation and transfer learning. ArXiv abs/1812.11806. Cited by: §1, §1, §1.
 Global adjoint tomography—model GLADM25. Geophysical Journal International 223 (1), pp. 1–21. External Links: ISSN 0956540X, Document, Link, https://academic.oup.com/gji/articlepdf/223/1/1/33476599/ggaa253.pdf Cited by: §5.
 A primer on domain adaptation. ArXiv abs/2001.09994. Cited by: §1, §2.1.
 A welltoseismic calibration method for seismic data in depth domain. In SEG Technical Program Expanded Abstracts 2019, pp. 1958–1962. External Links: Document, Link, https://library.seg.org/doi/pdf/10.1190/segam20193211214.1 Cited by: §1.
 Deep learning for lowfrequency extrapolation from multioffset seismic data. GEOPHYSICS 84 (6), pp. R989–R1001. Cited by: §1, §2.2, §4.
 On the spectral bias of neural networks. External Links: 1806.08734 Cited by: §3.3.
 Superconvergence: very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for MultiDomain Operations Applications, Vol. 11006, pp. 1100612. Cited by: §4.2.
 Seismicity induced by hydraulic fracturing in shales: a bedding plane slip model. Journal of Geophysical Research: Solid Earth 122 (10), pp. 7912–7926. Cited by: §3.
 Extrapolated full waveform inversion with deep learning. arXiv preprint arXiv:1909.11536. Cited by: §4.2.
 Absolute earthquake locations using 3D versus 1D velocity models below a local seismic network: example from the Pyrenees. Geophysical Journal International 212 (3), pp. 1806–1828. External Links: ISSN 0956540X, Document, Link, https://academic.oup.com/gji/articlepdf/212/3/1806/23507795/ggx472.pdf Cited by: §5.
 Optimal transport: old and new. Grundlehren der mathematischen Wissenschaften, Springer Berlin Heidelberg. External Links: ISBN 9783540710509, LCCN 2008932183, Link Cited by: §1.
 Predict passive seismic events with a convolutional neural network. SEG Technical Program Expanded Abstracts 2020, pp. 2140–2145. Cited by: §3.3.
 Direct microseismic event location and characterization from passive seismic data using convolutional neural networks. GEOPHYSICS 86, pp. 0–0. Cited by: §1, §3.3, §3.
 Seismic inversion: theory and applications. John Wiley & Sons. Cited by: §1.
 Image inpainting via generative multicolumn convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 331–340. Cited by: §4.2.
 Seismic facies analysis using machine learning. GEOPHYSICS 83 (5), pp. O83–O95. Cited by: §1.
 A brief introduction to weakly supervised learning. National Science Review 5 (1), pp. 44–53. External Links: ISSN 20955138, Document, Link, https://academic.oup.com/nsr/articlepdf/5/1/44/31567770/nwx106.pdf Cited by: §1.