1 Introduction
In seismic data acquisition, the geophones record reflected seismic signals as well as random and coherence noises. Random noise is caused by environmental interferences. Coherent noise, such as linear noise (groundroll waves), and multiple reflection waves, are generated by the sources. Noises lead to undesired artifacts in seismic migration and inversion; therefore, noise attenuation must be applied before subsequent seismic processing steps.
Traditional random noise attenuation methods are typically based on filtering techniques. Generally, noise is assumed as Gaussian distributed while the data can satisfy different assumptions, such as linear events (
Spitz, 1991; Naghizadeh, 2012), sparsity (Zhang and Ulrych, 2003; Hennenfent and Herrmann, 2006; Fomel and Liu, 2013; Yu et al., 2015), and lowrank assumptions (Trickett, 2008; Kreimer and Sacchi, 2012), such that the data can be distinguished from the noise with specially designed algorithms.Coherent noise attenuation methods are based on two technologies: filtering and prediction. The ground roll travels along the land surface, with high amplitude, low frequency, and low velocity. Shallow reflections may be masked by strong ground rolls, which must be removed first. For removing ground roll, filtering methods based on the combinations of its frequency and velocity properties (Corso et al., 2003; Zhang et al., 2010; Liu and Fomel, 2013) have been proposed, as well as prediction methods based on modeldriven techniques (Yarham et al., 2006) and datadriven techniques (Herman and Perkins, 2006).
Multiples are recorded signals that are scattered or reflected more than once. Multiples cannot be migrated to the correct positions with most of stateoftheart imaging methods (e.g., reversetime migration) (Weglein, 2016). Many algorithms have been proposed based on two properties of multiples (Berkhout and Verschuur, 2006): the moveout differences between multiples and primaries (Hampson, 1986; Herrmann et al., 2000; Trad et al., 2003), and the predictability of multiples (Robinson, 1957; Berkhout, 1982; Verschuur, 1992).
Although many methods are proposed for random and coherent noise attenuation, the algorithms still encounter two bottlenecks: inaccurate assumptions and improper parameter settings. The seismic data models are still approximations of the field data. For example, the sparsetransformbased methods used currently assume that seismic data can be represented sparsely with a specially designed transform; however, this assumption is not suitable for field data. Adaptive dictionary learning methods (Yu et al., 2015)
train an adaptive sparse transform from the dataset, rather than using a predesigned transform. However, dictionary learning is still based on the sparsity assumption. Furthermore, the fine tuning of parameters by experience is required for good denoising quality. For example, the noise level is unknown in field data random noise attenuation, such that repeated numerical tests with different variances are required, causing low efficiency. Noise estimation is used for reducing labor and uncertainty
(Liu et al., 2013), which is still an approximate procedure.Can we have a uniform framework for general noise attenuation? Rather than establishing the model and estimating the parameters in advance, we introduce deep learning (DL), an advanced machine learning method as an alternative seismic noise attenuation method with little prior knowledge of the data or noise. DL achieves “intelligent denoising” by exploiting the hidden relationship between corrupted data and clean data from a large amount of existing datasets.
Before describing DL in detail, we first introduce its original: machine learning (ML). ML algorithms are designed to learn the features and relationships hidden in large number of datasets automatically. ML is used primarily in the regression, prediction, and classification of large datasets, such as face detection
(Rowley et al., 1998) and medical diagnosis (Kononenko, 2001). ML has been also used in seismic exploration.Zhang et al. (2014) proposed using a kernelregularized least squares (KRLS) (Evgeniou et al., 2000) method for fault detection from seismic records. The authors used toy velocity models to generate records, and set the velocity models and records as inputs and outputs of a network, respectively. They used KRLS to construct and optimize the network and obtain meaningful results. Jia and Ma (2017)
proposed using supported vector regression (SVR)
(Cortes and Vapnik, 1995)for seismic interpolation. The authors used linear interpolated data and original data as the inputs and outputs, respectively, and they used SVR to obtain the relation between the inputs and outputs. They claimed that no assumptions are imposed on the data and that no parameters required tuning for interpolation. Another popular ML method is artificial neural network (NN), which has been widely explored in seismic data processing, such as timetodepth conversion
(Röth and Tarantola, 1994), event picking (Glinsky et al., 1996), tomography (Nath et al., 1999), parameters determination and pattern detection (Huang et al., 2006), and faciesclassification (Ross and Cole, 2017), among others.As the rapid development of computer hardware, especially graphic processing units (GPUs), deep neural network has been a popular topic since 2010, such as the deep belief network
(Hinton et al., 2006), stacked autoencoder
(Vincent et al., 2010), and deep convolutional neural network (CNN) (Lecun et al., 1998). CNN uses a shared local convolutional filter bank designed for images, which contains much less parameters compared to a fully connected multilayer neural network (FCNN). The FCNN encounters computational and storage problems owing to the massive parameters when the network becomes deeper, or the size of the inputs becomes larger. The FCNN ignores the structure of the input entirely. Seismic data have a strong local structure: neighbor samples are highly correlated. CNN utilizes correlation using a shared local convolutional filter, thus avoiding the use of numerous parameters.Lecun et al. (1998) proved that CNN with less parameters provided much better classification result on the MNIST set compared to the FCNN. The CNN was developed rapidly since 2010 for image classification and segmentation, such as VGGNet (Simonyan and Zisserman, 2015), and AlexNet (Krizhevsky et al., 2012). The CNN was also used in image denoising (Jain and Seung, 2008; Zhang et al., 2017)
and super resolution
(Dong et al., 2014; Cheong and Park, 2017). A CNN with 17 convolutional layers was used in Zhang et al. (2017) for image denoising. The authors used noise as the outputs rather than clean data, i.e., residual learning. They claimed that residual learning accelerated the training process as well as improved the denoising performance. Compared to dictionary learning with singlelayer decomposition, DL (if not specified, DL refers to deep learning with CNN herein) with deeper layers enables the exploration of rich structures in seismic data training set with different levels of abstraction.The applications of DL in image processing provide new ideas and techniques to geophysicists. We herein propose using DL for seismic noise attenuation. Our contributions are as follows: (i) we construct three training sets for synthetic noise attenuation and two training sets for field noise attenuation, (ii) we obtain automation, higher efficiency, and higher quality compared to traditional methods, and (iii) we present an indirect visualization of the trained convolutional filters and discuss the hyperparameter tuning of the CNN. The second part introduces the theory, including the design of the CNN, optimization algorithm, inverting the CNN for visualization, and transfer learning for field data training. The third part introduces the preparation of the training datasets and presents the numerical tests on synthetic and field datasets. Subsequently, we discuss the performance, parameters, and other aspects of DL. Finally, we conclude the paper.
2 Method
The seismic data acquisition model is
(1) 
where is the clean seismic data, is the contaminated data, and is the noise. The basic idea of ML based on NN is to establish a relationship between and using the following formula:
(2) 
where Net stands for the NN architecture, which is equivalent to a denoising operator, and are the network parameters, including the weight matrix and bias . In specific applications, the residual may be used as the output:
(3) 
where R stands for residual learning. In this section, we will introduce the network architectures, and the optimization of , inverting CNN and transfer learning.
2.1 Network architecture and optimization
In this section, we begin by introducing a simple NN, i.e., fully connected NN (FCNN) with a single hidden layer. First, we present the formula of the FCNN:
(4) 
where are the weight matrices and biases of the hidden layer and output layer, respectively, and
introduces the nonlinearity, such as the rectified linear units (ReLUs), defined as
. Figure 1 shows the sketch of the FCNN with one hidden layer. Here, “fully connected” implies that every two nodes in adjacent layers are connected.The CNN uses convolutional filters; it is equivalent to the FCNN where the elements in the weight matrices are shared across different local windows. in the CNN is expressed explicitly as
(5) 
where is the number of convolutional layers, are convolutional filters, are the biases, ‘’ stands for the convolutional operator. and
is the intermediate output, named activation. The definitions of ReLU and BN (Batch Normalization) are shown in Table
1. The sketch of the CNN network architecture is shown in Figure 2.Layers  Description  Definition 

Conv  Convolutional layer  
BN  Batch normalization  normalization, scale and shift 
ReLU  Rectified linear units 
In the CNN, , and
are written in the tensor form.
, where and are the dimensions of the input, is the size of the convolutional filter, is the number of channels in layer , and is the number of convolutional filters in layer . Each threedimensional (3D) convolutional filter is applied on the 3D input tensor to produce one output channel, as shown in Figure 3. A convolutional filter causes each unit in a layer to receive inputs from a set of units located in a small neighborhood in the previous layer, as shown in Figure 3. The region of the neighborhood is named the local receptive field (Lecun et al., 1998).A traditional method of introducing nonlinearity to a network is to use the sigmoid or tanh function. However, in gradient descent, these nonlinearities are much slower than ReLU, because the gradient is small in the nonlinear regions (Krizhevsky et al., 2012). The comparison among sigmoid, tanh, and ReLU is shown in Figure 3. BN is introduced in Ioffe and Szegedy (2015) and used to accelerate training.
To optimize in equation 3
, a loss function is defined as
(6) 
where are training pairs. D measures the discrepancy between the desired output and the network output, which in our case is simply chosen as the mean square error, i.e., D, where stands for the Frobenious norm. Because is extremely large, computing the gradient of numerically is impractical. Therefore, a minibatch stochastic gradient descent (SGD) (Yann, 1998) is used to minimize . In every iteration, only a small subset of
is used to approximate the gradient. One pass through the whole training set is defined as an epoch. The training samples are first shuffled into a random order, and subsequently chosen sequentially in minibatches to ensure a whole pass.
A tradeoff exists for the batch size in SGD. A small batch size allows for frequent updates per epoch, feasibility in GPU memory, and acceleration of convergence. However, a smaller batch size uses totally random gradients, leading to low efficiency. Smaller batch sizes also does not utilize the acceleration of parallel matrix–matrix products. According to Bengio (2012), the sweet region of the batch size is between 1 and a few hundred. We select the batch size according to Zhang et al. (2017) and under the restriction of GPU memory in different types of noise attenuation.
A tradeoff also exists for the number of epochs of SGD. If the number of epochs is small, a satisfactory solution would not be obtained. However, too large a number will cause overfitting. We decide the number of epochs by observing the training curve of the validation loss function (loss function of the validation set). If the loss function appears stable, we end the training after 10–20 more epochs. This empirical principle is called early stopping, which also avoids strong overfitting (Bengio, 2012).
2.2 Inverting CNN for indirect visualization of convolutional filters
In the dictionary learning method (Yu et al., 2015), we wish to observe the images of the dictionaries for a better understanding of their operation. In DL, after the network is trained, we also wish to observe the convolutional filters, especially in different layers. However, the filter size is 33, which is extremely small for visualization. Therefore, an inverting CNN (Mahendran and Vedaldi, 2015)
is used to visualize the filters in the data domain by backpropagating the corresponding activations to the first layer.
The inverting CNN first sets the activation to a given and other activations to zeros, which corresponds to , i.e., the th filter in the th hidden layer. Subsequently, we obtain input in the data domain, which contributes most to . Therefore, will contain the information from , which is an indirect visualization method. We can compute , which activates the most by solving
(7) 
where is a balancing parameter, is a regularization term, such as the Tikhonov regularization or total variation, in case of multi solutions. The elements in can be set as ones. The optimization problem (7) can be solved by the gradient descent method. The details of parameter setting and implementation are referred to inMahendran and Vedaldi (2015).
2.3 Transfer learning
The trained network can also be used as an initialization for learning a new network, i.e., transfer learning (Donahue et al., 2014). Fine tuning a network with transfer learning is typically much faster and easier than training a network with randomly initialized weights from scratch. With a smaller number of training samples, we can promptly transfer the learned features to a new task, such as from random noise attenuation to coherent noise attenuation, or a new dataset, such as from a synthetic dataset to a real dataset. This work will focus on the latter, i.e., real dataset random noise attenuation. The relationship between the transferability of features, and the distance between different tasks and different datasets are discussed in Yosinski et al. (2014) and Oquab et al. (2014), respectively.
3 Numerical Results
In this section, we first introduce how the training set is generated. Subsequently, we use the generated training sets to train three CNNs for the attenuations of random noise, linear noise, and surfacerelated multiples. Further, we test the CNN on field random noise and scattered ground roll attenuation. The networks are trained on an HP Z840 workstation with one Tesla K40 GPU, 32 Core Xeon CPU, 128GB RAM, and Ubuntu operating system.
3.1 Datasets preparation
To train an effective network, a good training set for DL should be well labeled, large, and diverse. For example, ImageNet
(Deng et al., 2009) was designed for natural images, and contains images of human beings, animals, plants, scenes, etc. Seismic data typically consist of events with smoothly varying slopes, which differ completely from the structures of natural images. However, no open training set for seismic data applications is available currently, to the authors’ knowledge. Three factors render the establishment of the seismic training sets difficult: (i) private companies do not share seismic data, (ii) labeling a large number of training samples requires intense human labor, and (iii) certain tasks, such as denoising or inversion, are difficult to be labeled by humans. Therefore, we concentrate on the pseudo training set herein, implying that the clean dataset is either synthetic or obtained by the existing denoising methods. Three pseudo training sets are first established to prove the suitability of DL for seismic data processing.A dataset for training a neural network typically consists of three subsets: a training set, a validation set, and a test set. The training set is used to fit the parameters in the network. The validation set is used to tune the hyperparameters. The test set is used to test the performance of the trained network. The training, validation, and test sets are generated with the same rule but are independent of each other. The sizes of the validation and test sets are approximately 25 of the training set. The training loss, validation loss, and test loss are defined as the average loss in equation 6 during training, validation, and testing.
3.1.1 Random noise
In a random noise situation, we can use clean signals as the outputs, and manually add Gaussian noise to the clean signals to serve as the inputs (we use the added noise as the outputs for residual learning). Further, the inputs can be split into small patches rather than using the whole sections directly for saving memory, because random noises are locally incoherent with useful data.
Although the network architecture does not change, the sizes of the intermediate outputs changes according to the input size. Suppose that the network contains 17 convolutional layers with 64 convolutional filters in each layer, and the input size is 10001000 (time sample and space sample), and the size of the intermediate outputs after convolutional layers is (4.05 GB in single precise). If the input is split into patches of size , subsequently the total intermediate size is (0.41 MB), which is only 0.01 of the original one. It is noteworthy that the denoised result is not an overlapping of small patches. Although we used small patches in training, the input can be of any size in testing. Each unit of one layer is computed by the weighted sum of the units in its corresponding receptive field in the previous layer.
The training set is downloaded from SEG (Society of Exploration Geophysicists) open datasets (https://wiki.seg.org/wiki/Open_data#2D_synthetic_seismic_data). Four of the datasets are shown in Figure 4. To improve the diversity of the training set, we select the prestack data, poststack data, twodimensional data and 3D data. In 3D data, we select sections with intervals of 20 shots, because the adjacent shot sections are similar.
Some portions of the seismic data are almost all zeros, which does not contribute to the training process. We introduce a Monte Carlo strategy (Yu et al., 2016)
while generating the training set. For each training sample, we compare its variance with a uniformly distributed random number. We maintain the sample if its variance is greater than the random number. Finally, the training set contains approximately 50,000 samples. The size of each training sample is
by referring to Zhang et al. (2017). To achieve blind denoising, we add Gaussian noise with different levels of variances ( of the maximum amplitude, uniformly distributed) to the data. Figure 5 shows eight examples from the training set. It is noteworthy that the samples contain different types of structures and noise variances.3.1.2 Linear noise
Splitting is not valid in coherent noise attenuation, because the coherent noise is difficult to be distinguished in the local scale. Therefore, the size of the training data is larger than in the random noise situation. The linear noise we consider herein are the events with random time shifts and slopes, rather than ground rolls. The number of linear events is three. The desired signals are three hyperbolic events with random curvatures and random zero offset arrival times. The amplitudes of the desired signals and linear noises are the same. We set the size of the training sample as and the size of the training set as 8000. Figure 6 shows three training samples and one test sample with linear noise.
3.1.3 Multiple
We generate models of the same size (150 75), containing three interfaces to simulate the water bottom and various interfaces. The medium above the first interface is water with a Pwave velocity 1500 . The three underlying layers contain Pwave velocities of 2000, 2500, and 3000 , respectively. The size of the training set is approximately 900.
To obtain the maximum diversity for the training set, each interface is placed randomly with uniform distribution. The first interface contains no dip to produce obvious firstorder multiples. The second and third interfaces are dipping, with local slopes chosen randomly (Gaussian distribution followed by average smoothing). Figure 7 shows a sample from the generated velocity models. We use an eighth order in space, and second order in a timefinite difference algorithm to solve the acoustic wave equation and generate synthetic seismograms on the surface. One fixed source is placed at () = (0.375, 0.005) km for each model. Additionally, 75 receivers are placed evenly below the surface at a depth of 5 m with 10 m spacing. A convolutional perfectly matched layer (CPML) absorbing boundary conditions (Komatitsch and Martin, 2007) are used on the left, right, and bottom grid edges to reduce unwanted reflections.
For each velocity model, two synthetic forward models are performed to generate the seismograms. The first one has a freesurface boundary on the top, and the data generated contains both primary reflections and surfacerelated multiples, including ghosts and water bottom reverberations. Thus, the generated seismogram can serve as the input of the training set. The other forward modeling has an absorbing zone at the top boundary to avoid generating any surfacerelated multiples, and the differences between the two generated seismogram is used as the output for training. Figure 8 shows four seismograms with multiple reflections and the corresponding clean signals from the training set. The first arrivals are removed. The events on the top of Figure 8a are caused by the ghost source rather than the first arrivals.
3.2 Dataset tests
3.2.1 Random noise attenuation
In a random noise situation, the size of the convolutional filters in each layer is set to be , with 64 filters referring to Simonyan and Zisserman (2015) and Zhang et al. (2017)
. The batch size is 128. The intermediate inputs are padded with zeros to maintain the output size. The number of convolutional layers is 17. The parameters used in SGD are the same as in DnCNN
(Zhang et al., 2017). The number of epochs is set as 50.We used deconvolution (Canales, 1984), curvelet (Hennenfent and Herrmann, 2006), and nonlocal mean (NLM) (Bonar and Sacchi, 2012) methods for comparison. For these three methods, we tested different parameters for the best denoising quality. This test was performed on the dataset shown in Figure 10b The output S/N versus input S/N are shown in Figure 9. After the network is trained, the parameters need not be tuned for the testing datasets, and the CNN still achieves the best denoising quality (approximately 2 dB higher than the second) among the tested methods. However, unlike the other methods, we cannot control the harshness of the CNN manually without tuning the parameters. In fact, we can assume that the harshness is controlled adaptively by the CNN itself according to the input, which avoids human interventions and achieves intelligent denoising. Figure 10 and 11 show the denoising results and the error sections. The weak events marked by the arrows are much better preserved with the CNN method than with the other methods, but the diffractions are also smoothed while denoising. In traditional methods, we can tune the parameters to preserve the diffractions along with the noise. In the CNN, datasets with diffractions must be included to train the network. Figure 12 shows the 80th trace from the denoised result and the corresponding difference with respect to the original trace. The differences are multiplied by two. It is clear that the CNN preserves the amplitude best and produces the smallest difference. The training time is approximately nine hours. The elapsed time for the denoising test, shown in Figure 10, of each method on the CPU is: 0.08 s ( deconvolution), 0.24 s (curvelet), 5.46 s (NLM), and 1.68 s (CNN). The CNN on the GPU requires 0.008 s, which is ten times faster than the deconvolution method. Figure 13 shows the average training loss and average validation loss per epoch versus the number of epochs.
As shown in Figure 14, the inverting CNN method (Mahendran and Vedaldi, 2015) is used to visualize the activations in the image domain. The activations corresponding to every convolutional filter in layers 7, 12, and 17 are set to ones. In each subfigure, each block represents the input that activates the given activations best. The blocks show obvious texture features in different scales corresponding to different layers. The textures represent how the CNN understands seismic datasets, which is currently difficult to explain.
It is challenging to analyze how the CNN operates in theory because many layers and nonlinearities exist. From the numerical perspective, we present the intermediate outputs of six ReLU layers (out of 16) in Figure 15. In each subfigure, outputs are presented from the first 16 channels (out of 64), with each corresponding to one convolutional filter. The subfigures are sorted according to the direction from the input to output. This test is performed on the data in Figure 10. We observed that the seismic events are removed gradually and the random noise is retained (we use noise as the outputs).
3.2.2 Linear noise attenuation
For the linear noise situation, we used the CNN with the same parameters as in the random noise situation. The batch size is 64. Figure 16 shows three denoising results from the test set. In each subfigure, the left side represents the data with linear noise, and the right side represents the denoised data. The linear noise is removed and hardly observed in the prediction section. DL shows the ability to learn knowledge from the training set and to use the knowledge on new data. Figure 17 shows the loss function versus the number of epochs. The training time is approximately five hours, while the denoising time on the input with size 100 is 0.13 s on the CPU and 0.008 s on the GPU.
3.2.3 Multiple attenuation
The CNN architecture is the same as in the previous tests, which indicates a potential generalization of the CNN for different kinds of noise attenuation. The batch size 16. The number of epochs is 200. We tested the trained network with seismograms from the test set. The results are shown in Figure 19 with the direct waves removed. The events on the top are caused by the ghost source rather than the direct waves. In each subfigure, the left to right represent the following: input, synthetics, output, and residual. The results show the clean removal of the multiples without affecting the primaries. The residuals between the predicted seismogram and synthetics are acceptable. Figure 20 shows that the loss function. The training time is approximately six hours, while the denoising time on the input of size is 0.34 s on the CPU and 0.007 s on the GPU.
3.2.4 Real datasets random noise attenuation
The CNN trained from synthetic datasets in random noise attenuation cannot be applied on real datasets directly, because the real noises are not Gaussian distributed. However, training a new network with randomly initialized weights may be inefficient and the training set may be insufficient to train a valid network. We assume that the field data exhibit local event structures similar to the synthetic data. Therefore, transfer learning is used in real dataset training. The trained network with synthetic datasets is used to initialize the network for real datasets. The training set contains noisy datasets as inputs, and denoised results with the curvelet method (Hennenfent and Herrmann, 2006) as outputs. One pair of the training input and label is shown in Figure 21a and b, respectively. Figure 21c shows a dataset for testing. The training set contains four large sections, which are divided into 17152 subsections of size of . The network and training parameters are the same as in the synthetic situation. Figure 22 shows the denoised results with the curvelet and CNN methods, separately. Figure 23 shows the corresponding noise sections.
In curvelet denoising, we tested three choices of the thresholding parameter sigma, which controls the harshness of denoising. No parameter required tuning in the CNN method and the denoising result is similar to that in the curvelet method, with sigma = 0.3. When sigma = 0.2, the events appear undersmoothed. When sigma = 0.4, the events appear oversmoothed. The CNN method achieves noise attenuation adaptively without human intervention. The training time is approximately seven hours, the denoising time on the input of size is 11.95 s on the CPU and 0.02 s on the GPU. Figure 24 shows the benefit of transfer learning on CNN training when the number of training samples is small. This test is performed on the same dataset generated previously. After the number of training samples reaches a certain point, a random initialization can provide better training.
3.2.5 Real datasets scattered ground roll attenuation
Another field example is performed on scattered ground roll attenuation. Scattered ground roll is caused by the scattering of ground roll when the near surface is laterally heterogeneous. Scattered ground roll is difficult to remove as it occupies the same domain as reflected signals. Figure 25 shows 12 from 40 pairs of training datasets. The denoised datasets are obtained by the carefully designed filtering. The datasets are split into patches to fit in the memory of the GPU. The network architecture is the same as in the previous tests. Figure 26 shows a test dataset, the denoised result of the CNN, and the filtering. The CNN can remove the energy of the ground roll, implying that the ground roll is separable on the patch scale. Figure 27 shows the frequency spectrum of the center traces from the data in Figure 26.
4 Discussion
Many unanswered questions arise in DL, such as the selection of hyperparameters. The introduction of DL may raise more questions than it solves. We discuss several interesting topics in DL in this section.
4.1 Can DL perform better than industrial standard tools?
We discuss two situations: the clean dataset is generated by one existing method, and by more than one existing method. For the former, we use the random noise attenuation of real datasets as an example. If we use the industrial standard tools, such as deconvolution, to obtain a clean dataset, can we obtain a CNN that performs better than deconvolution? For the clean dataset, we can tune the parameters manually, such as the length of the auto regression operator, to obtain an optimal denoising result. After the network is trained with these optimized inputs and outputs, the network can handle the denoising of new datasets adaptively without the tuning parameters. Meanwhile, if we use deconvolution for denoising the new datasets and tune the parameters manually, insufficient experience or time may lead to poor results. In the real datasets tests in Figure 22, if the thresholding parameter is not properly selected, the curvelet method fails to achieve satisfying denoising results. The CNN can achieve intelligent denoising without the tuning parameters. Such automation in DL is mostly missing in traditional methods.
For the second situation, it is natural to establish a more diverse training set with denoising results obtained from different methods suitable for different type of data. Subsequently, the trained network can benefit from different methods and outperform any independent existing method. It is equivalent to training a network that can choose a suitable method adaptively to a certain type of dataset.
4.2 How does number of training samples contribute
We are interested in how many training samples are needed to train a fullyspecified neural network. In principle, the more the number of training samples, the more powerful is a network. However, a longer time is required to prepare the training set and train the network. We treat the number of training samples as a hyperparameter for training the network. This test is performed on the synthetic random noise training set. Figure 28 shows the training loss and validation loss versus the number of epochs and number of training samples. It is noteworthy that we used more epochs for small numbers of samples to ensure that the total number of samples used in each optimization are the same. The epochs are normalized to 50. We found that if the number of training samples is extremely small (8448), the training loss is small but the validation loss is large, thus causing overfitting. Early stopping and a larger number of training samples can avoid overfitting. The validation loss is almost the same for 44220 and 50688 training samples. After the number of training samples reaches a certain level, the extra training datasets contribute little to the validation loss.
4.3 Why “deep” matters
Deeper layers incorporate a larger receptive field and more nonlinearities. Receptive field represents the number of elements involved in one convolution operation. Larger receptive field involves more information from the input layer. Figure 3 shows that the receptive field size of a deeper layer is larger than that of a shallow layer in two dimensions. If the convolutional filter size is , subsequently the receptive field of layer 2 with respect to layer 1 is , while the receptive field of layer 3 with respect to layer 1 is . It is clear that more units in the input are enrolled in computing the units in the deeper layers.
Meanwhile, if the other parameters are fixed, the CNN with deep layers incorporate more nonlinear layers, resulting in a more discriminative decision function (Simonyan and Zisserman, 2015), and providing more parameters to fit the complex mappings. Figure 29 shows how the different numbers of layers (1, 3, 5, 7, 9) affect the network performance on the test set in a random noise situation. The training and testing sets are the same in different setups. In each epoch, we calculate the validation loss. The validation loss reduces as the layer increases. However, it is not always optimal to use extremely deep networks, as it may lead to low efficiency and overfitting. Hence, techniques for avoiding overfitting must be considered, such as dropout (Srivastava et al., 2014).
4.4 Hyperparameters
Hyperparameters are parameters that are set before the training begins. Other than the batch size and number of epochs, the CNN also contains numerous of hyperparameters in optimization, such as the learning rate, and momentum. Another type of hyperparameters pertains to the model architecture, such as the number of hidden layers, and the number of filters in each layer. Although parameter tuning is not required after training, hyperparameter tuning is required while designing the network architecture and optimization. Even though guidance has been proposed for recommending the hyperparameters (Bengio, 2012), it is difficult for readers who are not experts in DL and only wish to use DL rather than optimizing every hyperparameter. Fortunately, most of the hyperparameters in optimization, such as the batch size, typically impact the training time rather than the test performance (Bengio, 2012). Therefore, except for the number of epochs, most hyperparameters in optimization can be borrowed from image denoising, such as Zhang et al. (2017).
If the number of model hyperparameters is small, such as 1 or 2, we can use the grid search method in the regulardomain or logdomain. Different configurations can be computed in parallel, and the one that achieves the minimum loss on the validation set is selected. The grid search method scales exponentially with the number of hyperparameters. When the number of model hyperparameter becomes larger, a practical solution is the random sampling of hyperparameters. More information on grid search and random sampling can be found in Bengio (2012). In our work, the number of layers is determined with the grid search method. Other parameters, such as the number of filters are determined by referring to Zhang et al. (2017).
In DnCNN, the authors used a network with 17 convolutional layers. Therefore, we tested 13, 15, 17, 19, and 21 convolutional layers and chose the one that achieved the least validation loss. This test was performed on the synthetic random noise training set. Figure 30 shows how the validation loss changes with the number of epochs and number of layers. We used 13 layers as a baseline and show the differences between the other layers and 13 layers. The validation loss changes little for 13, 15, and 17 layers, but is much larger for 19 and 21 layers, which may be caused by overfitting when increasing the number of parameters in the network. When 17 layers were chosen, the least validation loss was shown in most cases.
4.5 Distance between training and test set
The simplest method to measure the distances between two samples is using Euclid distance. For the multiple datasets, we use a test sample and compute the distances between the test sample and all training samples. We present the training samples with the smallest and largest distance in Figure 31(a). We found that the one that achieves the smallest distance contains the same first arrivals as the test sample but with different multiples. The distances are plotted in Figure 31(b). The test sample does not copy any training sample.
For the dataset containing linear noise, we plot the distribution of slopes and time shifts of the linear events on a twodimensional plane in Figure 32. Further, 400 of the training samples are plotted in Figure 33. Theoretically, because we use a uniform distribution for generating the slope and time shift, different samples cannot be exactly the same. However, the plane may be crowded, especially when many samples are present. The red dots indicate the training samples and the blue circles indicate the testing samples. The blue circles locate among the red dots. If we memorize the red dots, we cannot obtain the blue circles by linear interpolation. Adding two events with slope and cannot yield an event with slope . The training sample and test samples are generated independently by the same rule. Therefore, the network learns the relationship between the input and output rather than memorizing each training sample.
4.6 Failure situations
In general, DL will fail in the following scenarios: insufficient training samples (Figure 28), improper hyperparameter setting (Figure 30), large input size (discussed in random noise dataset preparation section), or a test set that is completely different from the training set. We use the trained network from the synthetic random noise situation and apply it on the field data in Figure 21. The denoised result is shown in Figure 34, where the noise is hardly removed because the noise distribution in the field data is different from that in the synthetic dataset.
5 Conclusion
We propose a first attempt to apply DL in seismic noise attenuation. With DL, we achieve several superiorities over traditional methods: (i) denoising quality in synthetic random noise attenuation, (ii) automation (no requirement of parameter tuning) and high efficiency on GPU after training.
In realdataset random noise attenuation, CNN and transfer learning enable automated/intelligent denoising without human intervention, where much labor and time can be saved. The training progress is slightly time consuming, which is approximately six hours. However, testing on a GPU is extremely fast (0.02 s on a 1501333 input). In field scattering ground roll attenuation, although the patching strategy is used, the ground roll is removed successfully.
Largescale and welllabeled field datasets are essential for realdata processing, which requires intense human intervention. We proposed two possible directions for future studies. The first is to combine a large number of synthetic datasets generated by complex velocity models, e.g., the EAGE model, and a small number of labeled field datasets to form a new training set, which can be used to train a general network for seismic data processing. The second is unsupervised learning, where no clean signals are needed and DL may separate the coherent noise and useful signals by exploiting the hidden information in the training set. Apart from insufficient training samples, DL also encounters several general problems, such as difficulties in understanding complicated network, empirical hyperparameter setting, and intense computation while training. DL can also be used for other seismic data processing tasks, such as erratic noise attenuation, provide that a training set is properly constructed.
6 Acknowledgement
The work was supported in part by the National Key Research and Development Program of China under Grant 2017YFB0202900, and NSFC under Grant 41625017, Grant 41374121 and Grant 91730306. We thank the Northeast Research Institute of Petroleum Exploration and Development for providing the scattered ground roll dataset.
References
 Bengio (2012) Bengio, Y., 2012, Practical recommendations for gradientbased training of deep architectures: arXiv preprint arXiv:1206.5533, 437–478.
 Berkhout (1982) Berkhout, A. J., 1982, Seismic migration, imaging of acoustic energy by wavefield extrapolation, A: Theoretical aspects: Elsevier Science Publ. Co., Inc.
 Berkhout and Verschuur (2006) Berkhout, A. J., and D. J. Verschuur, 2006, Imaging of multiple reflections: Geophysics, 71, SI209–SI220.
 Bonar and Sacchi (2012) Bonar, D., and M. Sacchi, 2012, Denoising seismic data using the nonlocal means algorithm: Geophysics, 77, A5–A8.
 Canales (1984) Canales, L. L., 1984, Random noise reduction: SEG Technical Program Expanded Abstracts, 3, 329–329.
 Cheong and Park (2017) Cheong, J. Y., and I. K. Park, 2017, Deep cnnbased superresolution using external and internal examples: IEEE Signal Processing Letters, 24, 1252–1256.
 Corso et al. (2003) Corso, G., P. S. Kuhn, L. S. Lucena, and Z. D. Thome, 2003, Seismic ground roll time–frequency filtering using the Gaussian wavelet transform: Physica A: Statistical Mechanics and Its Applications, 318, 551–561.
 Cortes and Vapnik (1995) Cortes, C., and V. Vapnik, 1995, Support–vector networks: Machine Learning, 20, 273–297.

Deng et al. (2009)
Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, 2009, Imagenet: A large–scale hierarchical image database: IEEE Conference on Computer Vision and Pattern Recognition, 248–255.
 Donahue et al. (2014) Donahue, J., Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, 2014, Decaf: A deep convolutional activation feature for generic visual recognition: International Conference on Machine Learning, 647–655.
 Dong et al. (2014) Dong, C., C. C. Loy, K. He, and X. Tang, 2014, Learning a deep convolutional network for image super–resolution: European Conference on Computer Vision, Springer, 184–199.

Evgeniou et al. (2000)
Evgeniou, T., M. Pontil, and T. Poggio, 2000, Regularization networks and support vector machines: Advances in Computational Mathematics,
13, 1–50.  Fomel and Liu (2013) Fomel, S., and Y. Liu, 2013, Seislet transform and seislet frame: Geophysics, 75, V25–V38.
 Glinsky et al. (1996) Glinsky, M. E., G. A. Clark, and P. K. Z. Cheng, 1996, Automatic event picking in prestack migrated gathers using a probabilistic neural network: Geophysics, 66, 1488–1496.
 Hampson (1986) Hampson, D., 1986, Inverse velocity stacking for multiple elimination: Journal of the Canadian Society of Exploration Geophysicists, 22, 44–55.
 Hennenfent and Herrmann (2006) Hennenfent, G., and F. J. Herrmann, 2006, Seismic denoising with nonuniformly sampled curvelets: Computing in Science and Engineering, 8, 16–25.
 Herman and Perkins (2006) Herman, G., and C. Perkins, 2006, Predictive removal of scattered noise: Geophysics, 71, V41–V49.
 Herrmann et al. (2000) Herrmann, P., T. Mojesky, M. Magasan, and P. Hugonnet, 2000, De–aliased high–resolution Radon transforms: 70th Annual International Meeting, Expanded Abstracts, SEG, 1953–1956.
 Hinton et al. (2006) Hinton, G. E., S. Osindero, and Y. W. Teh, 2006, A fast learning algorithm for deep belief nets: Neural Computation, 18, 1527–1554.
 Huang et al. (2006) Huang, K. Y., J. D. You, K. J. Chen, H. L. Lai, and A. J. Don, 2006, Neural network for parameters determination and seismic pattern detection: SEG Technical Program Expanded Abstracts, 25, 2285–2289.
 Ioffe and Szegedy (2015) Ioffe, S., and C. Szegedy, 2015, Batch normalization: Accelerating deep network training by reducing internal covariate shift: international conference on machine learning, 448–456.
 Jain and Seung (2008) Jain, V., and H. S. Seung, 2008, Natural image denoising with convolutional networks: International Conference on Neural Information Processing Systems, 769–776.
 Jia and Ma (2017) Jia, Y., and J. Ma, 2017, What can machine learning do for seismic data processing? an interpolation application: Geophysics, 82, V163–V177.
 Komatitsch and Martin (2007) Komatitsch, D., and R. Martin, 2007, An unsplit convolutional perfectly matched layer improved at grazing incidence for the seismic wave equation: Geophysics, 72, SM155–SM167.

Kononenko (2001)
Kononenko, I., 2001, Machine learning for medical diagnosis: history, state of the art and perspective: Artificial Intelligence in Medicine,
23, 89–109. 
Kreimer and Sacchi (2012)
Kreimer, N., and M. D. Sacchi, 2012, A tensor higher–order singular value decomposition for prestack seismic data noise reduction and interpolation: Geophysics,
77, V113–V122.  Krizhevsky et al. (2012) Krizhevsky, A., I. Sutskever, and G. E. Hinton, 2012, Imagenet classification with deep convolutional neural networks: Advances in Neural Information Processing Systems 25, 1097–1105.
 Lecun et al. (1998) Lecun, Y., L. Bottou, Y. Bengio, and P. Haffner, 1998, Gradient–based learning applied to document recognition: Proceedings of the IEEE, 86, 2278–2324.
 Liu et al. (2013) Liu, X., M. Tanaka, and M. Okutomi, 2013, Single–image noise level estimation for blind denoising.: IEEE Transactions on Image Processing, 22, 5226–5237.
 Liu and Fomel (2013) Liu, Y., and S. Fomel, 2013, Seismic data analysis using local time‐frequency decomposition: Geophysical Prospecting, 61, 516–525.
 Mahendran and Vedaldi (2015) Mahendran, A., and A. Vedaldi, 2015, Understanding deep image representations by inverting them: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5188–5196.
 Naghizadeh (2012) Naghizadeh, M., 2012, Seismic data interpolation and denoising in the frequency–wavenumber domain: Geophysics, 77, V71–V80.

Nath et al. (1999)
Nath, S. K., S. Chakraborty, S. K. Singh, and N. Ganguly, 1999, Velocity inversion in crosshole seismic tomography by counterpropagation neural network, genetic algorithm and evolutionary programming techniques: Geophysical Journal International,
138, 108–124.  Oquab et al. (2014) Oquab, M., L. Bottou, I. Laptev, and J. Sivic, 2014, Learning and transferring midlevel image representations using convolutional neural networks: CVPR ’14 Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1717–1724.
 Robinson (1957) Robinson, E. A., 1957, Predictive decomposition of seismic traces: Geophysics, 22, 767–778.
 Ross and Cole (2017) Ross, C. P., and D. M. Cole, 2017, A comparison of popular neural network facies–classification schemes: Geophysics, 36, 340–349.
 Rowley et al. (1998) Rowley, H. A., S. Baluja, and T. Kanade, 1998, Neural network–based face detection: IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 23–38.
 Röth and Tarantola (1994) Röth, G., and A. Tarantola, 1994, Neural networks and inversion of seismic data: Journal of Geophysical Research, 99, 6753–6768.
 Simonyan and Zisserman (2015) Simonyan, K., and A. Zisserman, 2015, Very deep convolutional networks for large–scale image recognition: International Conference on Learning Representations.
 Spitz (1991) Spitz, S., 1991, Seismic trace interpolation in the F–X domain: Geophysics, 56, 785–794.
 Srivastava et al. (2014) Srivastava, N., G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, 2014, Dropout: a simple way to prevent neural networks from overfitting: Journal of Machine Learning Research, 15, 1929–1958.
 Trad et al. (2003) Trad, D. O., T. Ulrych, and M. Sacchi, 2003, Latest views of the sparse Radon transform: Geophysics, 68, 386–399.
 Trickett (2008) Trickett, S., 2008, F–XY Cadzow noise suppression: SEG Technical Program Expanded Abstracts, 2586–2590.
 Verschuur (1992) Verschuur, D. J., 1992, Surface–related multiple elimination in terms of Huygens’ sources: Journal of Seismic Exploration, 1, 49–59.

Vincent et al. (2010)
Vincent, P., H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, 2010, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion: Journal of Machine Learning Research,
11, 3371–3408.  Weglein (2016) Weglein, A. B., 2016, Multiples: Signal or noise: Geophysics, 81, V283–V302.
 Yann (1998) Yann, L., 1998, Efficient backprop: Neural Networks Tricks of the Trade, 1524, 9–50.
 Yarham et al. (2006) Yarham, C., U. Boeniger, and F. J. Herrmann, 2006, Curvelet–based ground roll removal: SEG Technical Program Expanded Abstracts, 2777–2782.
 Yosinski et al. (2014) Yosinski, J., J. Clune, Y. Bengio, and H. Lipson, 2014, How transferable are features in deep neural networks: neural information processing systems, 3320–3328.
 Yu et al. (2016) Yu, S., J. Ma, and S. Osher, 2016, Monte Carlo data–driven tight frame for seismic data recovery: Geophysics, 81, V327–V340.
 Yu et al. (2015) Yu, S., J. Ma, X. Zhang, and M. D. Sacchi, 2015, Interpolation and denoising of highdimensional seismic data by learning a tight frame: Geophysics, 80.
 Zhang et al. (2014) Zhang, C., C. Frogner, M. ArayaPolo, and D. Hohl, 2014, Machine–learning based automated fault detection in seismic traces: Presented at the 76th EAGE Conference and Exhibition 2014.
 Zhang et al. (2017) Zhang, K., Y. Chen, Y. Chen, D. Meng, and L. Zhang, 2017, Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising: IEEE Transactions on Image Processing, 26, 3142–3155.
 Zhang and Ulrych (2003) Zhang, R., and T. J. Ulrych, 2003, Physical wavelet frame denoising: Geophysics, 68, 225–231.
 Zhang et al. (2010) Zhang, Z., Y. Sun, and K. Berteussen, 2010, Analysis of surface waves in shallow water environment of the Persian Gulf using S and t‐f‐k transform: SEG Technical Program Expanded, 3723–3727.
Comments
There are no comments yet.