A classic challenge in the field of geophysics involves accurately estimating the characteristics of the Earth’s subsurface based on measurements acquired by sensors on the surface. Seismic reflection is one of the most widely used methods. It involves generating seismic waves using controlled active sources on the surface (e.g., dynamite explosions in land acquisition or air guns in marine acquisition), and further collecting the reflected data with sensors located above the area[duarte2014seismic]. The term shot refers to a firing by one of these sources. By grouping the seismic signals resulting from the same shot and registered by the sensors into a common-shot domain called the shot gather makes it possible to produce an image that represents information about that Earth’s subsurface area [yilmaz2001seismic].
Seismic shot-gather quality classification is relevant for the early stages of seismic processing and noise removal. However, controlling seismic quality is challenging as it is traditionally very subjective and time-consuming and relies upon the valuable time from skilled and experienced experts. Thus, the application of machine learning for this problem is useful as it reduces the turnaround time by quickly identifying the bad quality shot lines, instead of visual and cumbersome quality control techniques.
Some previous work investigate applications of neural networks for geophysical signal classification. Valentine and Woodhouse [valentine2010approaches]
, for instance, perform an automatic selection of high-quality seismic data to improve results from tomographic inversion. To accomplish that, they train an Artificial Neural Network (ANN) to recognize the frequency-domain characteristics of high and low-quality waveform. We share the same goal of selecting seismic data aimed by quality, but instead of classifying the waveforms individually, we take advantage of the typical shot-gather structure and build a classifier that makes an overall decision based on the correlated information presented between multiple traces. Jainet al. [similarity2019] utilize a Convolutional Neural Network (CNN) to perform a similarity-based classification [similarity2019, McBrearty2019]. They define an objective similarity function based on the Triplet Network [hoffer2014deep]
, which is a deep learning technique with widespread use in computer vision for the face recognition task. In order to apply this procedure, the authors build a Temporal Convolutional Network (TCN) and measure pairwise similarities between the seismograms using the Triplet loss function. Further, they evaluate their approach using the Receiver Operating Characteristic Curve and report 87% as the area under the curve (AUC) in K22A, which is the most active station in the test set from the database of USArray111http://www.usarray.org/.
The main contributions of our work are twofold: (1) Describing the construction of a dataset for shot-gather image quality classification; (2) Presenting a comparative study of three deep learning-based approaches. The first one, using state-of-the-art
CNNs for feature extraction combined that are fed to a SVM[boser1992training] classifier. The second one, introducing our CNN architecture for shot-gather image quality classification. And the last one, fine-tuning the previous CNN. Our proposed CNN architecture achieves the best results with 94.91%of F1-Score in a 10-fold cross-validation experiment.
2 Seismic shot-gather quality dataset
Our dataset consists of an offshore towed streamer data in a targeted region consisting shot-gathers with 8 cables each, thus containing a total of shot-gather images.
In a common shot-gather, the abscissa stands for the position of the sensor relatively to the shot position, whit this displacement being known as the offset distance. The ordinate represents the registered time of the signal, the larger this time, the deeper the signal reached the underground surface.
Out of the total generated images, were chosen and manually classified by a geophysicist, using good, bad and ugly labels, according to a visual inspection of artifacts related to swell noise and anomalous recorded amplitude. This process takes roughly ten hours of human labor. Figure 1 shows one example per class of shot-gather images classified by a geophysicist. From this figure, we observe that the good label represents clean images, while the ugly label represents images that have an intense presence of noise. The bad label represents those images who are neither fully clean, nor as noisy as the ugly class.
The table 1 shows the shot-gather quantity and proportion of the three labels. The dataset is more represented by good (66.68%), followed by the bad (31.54%) and the ugly (1.76%). This indicates that we are working with an acquisition, where good and bad labels almost dominate the entire representation. This dataset class imbalance is very challenging to machine learning algorithms, since they are data-intensive and we have just a few ugly shot-gather images. The Train represents 79% and Test 21%.
3 Minception Network
A typical CNN usually has a stack of convolutional layers with 3x3 or 5x5 kernels followed by a pooling layer. Due to the variance of shot-gather noise shape and scale, select an appropriate kernel size is a key challenge. The shot-gather image presented in Figure2 shows examples of noise variations acquired through a seismic marine survey [elboth2009attenuation]. To detect larger noises, higher kernel sizes are more adequate, while to thin noises, small kernel sizes work better.
Inspired on GoogleNet’s Inception block [szegedy2015going], we suggest the use of both 3x3 and 5x5 kernel sizes at the same layer. In this way, during networking training, the internal layers automatically choose the kernel sizes that will be relevant to learn the required information.
The Figure 3
(A) shows the standard Minception block, a simplified version of Inception block with fewer parameters. Instead of using a single convolutional layer, we combine a 3x3 and 5x5 convolutions by concatenating their output in a single tensor forming the input of a 3x3 convolution that creates new filters by correlating the previous one.
Table 2hyperparameter indicates the number of kernels per layer, respectively equals to 1, 2, 4 and 8. A final downsampling is handled with a 5x5 max pooling with stride 2 to perform spatial dimensionality reduction. The last two layers, which are fully connected, has 32 units and feeds a softmax layer with 3 classes.
Down-Sampling (Area interpolation)
|Minception block 1 (N=1)||97x97x2|
|Minception block 2 (N=2)||45x45x4|
|Minception block 3 (N=4)||19x19x8|
|Minception block 4 (N=8)||6x6x16|
|Fully connected (ReLU)||32|
|Fully connected (Linear)||32|
We explored the other three variants of minception blocks using mechanisms of benchmarks networks for image classification. The first one, called (B) Residual-Minception uses the mechanism proposed by ResNet Network that has a goal to address the vanishing gradient problem of Deep Neural Networks. As described in[He2016DeepRL] it’s empirically known that when we increase the depth of the network, the gradient gets saturated and then degrades rapidly. To solve it we want to experiment with the addition of a skip-connection to Standard Minception. However, to sum it properly, we need to previously add a convolution block 2N 1x1 in the skip-connection just to match with the shape from the output of Conv 2N 3x3.
The second variant, (C) Attention-Minception uses the mechanism of self-attention [Woo2018CBAMCB]
. This mechanism is similar to the previous block as it has a skip-connection with a Convolutional Block, but in this particular case we use the sigmoid activation function instead of the standard ReLU, then it performed anelement-wise multiplication.
The last Minception variant, (D) SE-Minception incorporates the channel attention mechanism provided by Squeeze and Excitation Module (SE) from SE-Net [Hu18]. Squeeze and Excitation is a technique that improves the quality of representations produced by the network by learning global information from channels and dynamically emphasizing informative features. The process of SE is divided into three operations: (1) squeeze spatial information into a channel descriptor, (2) capture channel-wise dependencies targeting on important features and (3) perform a scale operation (channel-wise multiplication) in the original channels. The first operation is performed by a Global Average Pooling, which generates a summary of channel-wise statistics. The second operation is performed by two fully-connected (FC) layers in a row, where the first layer compress channel dimensionality by a given reduction ratio and the second layer increase channel dimensions back to the input channel size.
4 Experimental Evaluation
In this section, we evaluate the effectiveness of proposed approaches for seismic shot-gather image quality classification. First, we report the performance of the SVM classifier combined with transfer learning method. Next, we investigate the performance of our end-to-end CNN called MinceptionNet. And finally, we optimize the MinceptionNet exploring the trade-offs by tunning the network parameters based on two hyper-parameters, a width multiplier and a resolution multiplier.
We use Adam as the default optimizer and set the learning rate to 0.001, the batch size to 64, the training phase to 100 epochs with patience at 10 epochs, looking always for the best accuracy in the validation set.
The Dataset has imbalanced classes, whit the ugly label barely hitting 2% of the total. We perform a 10-fold cross-validation and evaluate the model by the F1-Score per class and the F1-weighted (F1-W), which is the sum of the F1-score for each class weighted by its proportion.
For the baseline method, we use the SVM [boser1992training] classifier with input features provided by transfer learning. We use the output of a chosen hidden layer of a pre-trained CNN for the object recognition task. We use the inner layers of the VGG16, VGG19 [simonyan2014very] and InceptionV3 [szegedy2016rethinking]
with the network weights trained in the ImageNet classification dataset[deng2009imagenet]. The workflow of this approach using the InceptionV3 network is illustrated in Figure 4.
CNNs when trained tend to learn at the first layers features that resemble either Gabor filters or color blobs. At the intermediate and final layers, the combination of these filters helps to extract relevant features from images, resulting in complex patterns [yosinski2015understanding]. Once we have extracted features from each layer, we then compare the performances of SVM classifier using each of them.
Tables 3, 4 and 5 summarize the results for VGG16, VGG19, and InceptionV3, respectively. The best model uses features extracted from VGG19’s block 2, which produces an F1-W of 90.82%. Next, VGG16’s block 2 is the second best model, achieving an F1-W of 88.99%. We can see that the Block 2 is the best for both networks, where we also get the best F1-Good and F1-Bad. Analysing VGG16 and VGG19 results, we observe that both models have similar results, but VGG19 is slightly better. InceptionV3 is our worst performing model, since it does not reach higher results than VGG19 or VGG16.
|Block||F1-W (%)||F1-G (%)||F1-B (%)||F1-U (%)|
|Block||F1-W (%)||F1-G (%)||F1-B (%)||F1-U (%)|
|Block||F1-W (%)||F1-G (%)||F1-B (%)||F1-U (%)|
4.3 Minception Network
Table 8 shows the results for each MinceptionNet variation. The best model is obtained with the architecture SE-MiniceptionNet (D), which produces an F1-W of 93.84%. The Architecture Residual MinceptionNet (B) is the second best model, producing a F1-W of 93.76%. Our basic architecture Standard Minception (A) is the third best model, achieving a F1-W of 92.90%. The architecture Attention MinceptionNet (C), achieves a F1-W of only 91.09%. The SE-MinceptionNet also reaches the best F1-Good and F1-Ugly among other MinceptionNet variations.
|Model||F1-W (%)||F1-G (%)||F1-B (%)||F1-U (%)|
|Attention MinceptionNet (C)||91.09||95.2||85.27||40.13|
|Residual MinceptionNet (B)||93.76||96.55||90.52||46.88|
|Standard MinceptionNet (A)||92.90||96.00||89.20||42.60|
4.4 Minception Tunning
In this phase, we optimize the best version of the MinceptionNet (SE-MinceptionNet) architecture making some fine adjusts in our initial network. Hence, we perform a search on the kernel quantity multiplier alpha. When alpha is equal to one, that is exactly the default SE-MinceptionNet, but as soon as alpha increases, we multiply the initial number of filters in each convolution per alpha. Therefore, we expect that increasing the number of filters per stage and the depth of the network, we should probably boost the network performance. Nevertheless, increasing these hyper-parameters may cause a strong impact in the computational effort and consequently in the computational budget. Hence, we have a trade-off between optimization and computational budget.
To search for the best parameter configuration, we perform a grid search on the kernel quantity multiplier . Table 7 shows the results of SE-MinceptionNet for each value. The best value is 8, which produces a F1-W of 94.91%, F1-G of 97.27%, F1-B of 92.08% and F1-U of 56.01%.
|F1-W (%)||F1-G (%)||F1-B (%)||F1-U (%)|
|1||93.84 0.25||96.58 0.24||90.51 0.55||50.26 7.24|
|2||94.11 0.21||96.68 0.29||90.82 0.57||55.72 4.86|
|3||94.38 0.26||96.88 0.30||91.29 0.65||55.15 5.19|
|4||94.42 0.32||96.94 0.39||91.32 0.85||54.83 5.35|
|5||94.37 0.34||96.88 0.26||91.20 0.60||56.64 5.84|
|6||94.60 0.25||97.09 0.24||91.73 0.43||51.92 6.17|
|7||94.61 0.36||97.14 0.31||91.71 0.67||51.15 8.10|
|8||94.91 0.29||97.27 0.21||92.08 0.51||56.01 5.59|
|9||94.52 0.33||97.16 0.34||91.34 0.95||51.80 7.09|
|10||94.59 0.21||97.05 0.29||91.62 0.75||54.70 5.49|
94.11 0.2196.68 0.2990.82 0.5755.72 4.8694.38 0.2696.88 0.3091.29 0.6555.15 5.1994.42 0.3296.94 0.3991.32 0.8554.83 5.3594.60 0.2597.09 0.2491.73 0.4351.92 6.1794.61 0.3697.14 0.3191.71 0.6751.15 8.1094.91 0.2997.27 0.2192.08 0.5156.01 5.5994.52 0.3397.16 0.3491.34 0.9551.80 7.0994.59 0.2197.05 0.2991.62 0.7554.70 5.49
4.5 Model Results
In Table 8, we summarize our empirical findings in the five examined models. The best model was obtained with SE-MinceptionNet with , which produced a F1-W of 94.91%. Standard SE-MinceptionNet () was the second best model, producing a F1-W of 93.84%. Next, our baselines VGG19, VGG16, and InceptionV3 were the third, fourth, and fifth place, achieving a F1-W of 90.82%, 88.99%, and 86.02%, respectively.
|Model||F1-W (%)||F1-G (%)||F1-B (%)||F1-U (%)|
|VGG19’s block 2 + SVM||90.82||95.64||83.10||47.21|
|VGG16’s block 2 + SVM||88.99||95.08||81.48||49.96|
|InceptionV3’s block 0 + SVM||86.02||93.23||73.07||45.56|
In Table 9, we show the results of SE-MinceptionNet with in the test set, where it achieves an F1-W of 93.56%, F1-G of 95.94%, F1-B of 86.80% and F1-U of 28.57%. The model presents better result for F1-G in comparison with the train set and also show high recalls values, producing a Recall-G of 92.73%, Recall-B of 93.82% and Recall-U of 100%. In contrast, we notice a decrease in F1-B and F1-U, due to Precision-B of 80.76% and Precision-U of 16.67%.
|F1 (%)||Recall (%)||Precision (%)|
shows the confusion matrix. The model correctly identified 970good images, 319 bad images, and 3 ugly images. We can observe that the model major confusion is between contiguous classes, where it misclassifies 76 good images as bad and 15 bad images as ugly. As strong points, the model produces high recall values and has a high precision for good, but in regards to ugly label, our Recall-U of 100% is suspicious, since we have just 3 ugly images in test set (1.76% of total). The weak points of the model are relative to the low precision for bad and ugly, as we notice that the model predicted many images incorrectly for that classes.
In this work, we examine the application of deep learning to perform a quality classifier of shot-gather noisy images. For that sake, we build a dataset with seismic shot-gather images and manually classified them using good, bad and ugly labels, according to a visual inspection of artifacts related to swell noise and anomalous recorded amplitude. We propose the Minception Net, a network architecture inspired on GoogleNet’s Inception Block, but with fewer parameters. Additionally, we also propose three Minception variants that incorporate mechanisms as residual connection (Residual Minception), attention (Attention Minception) and squeeze-and-excitation (SE-Minception).
In a 10-fold cross validation experiment, we use as baseline three CNNs InceptionV3, VGG16 and VGG19 to extract features from shot-gather images and apply their values as the input of an SVM classifier. The best result is obtained by the second block from VGG19 with an F1-W of 90.82%. The SE-Minception variant achieves the best result with an F1-W of 93.84% . Then, we optimize SE-MinceptionNet by performing a search, where is a parameter that multiplies the quantity of initial kernels from the standard minception block. With this search, we are looking for a balance between the depth and the width of the network. The validation results show that setting alpha to 8 upgrades the F1-W to 94.91%. In the test set, the SE-MinceptionNet produced a final F1-W of 93.5%.
As future work, we plan to use deep learning models for data augmentation to increase the quantity of ugly seismic shot-gather images in training. We expect this will produce a reliable recall and improve the value of F1-score for ugly label.
The authors would like to thank Petrobrás S.A. for the collaboration and for providing the dataset. The authors would also thank Raphael Rocha, Luis Felipe Müller, Matheus Cabral, Miguel de Brito and Ivan Pereira from the Pontifical Catholic University of Rio de Janeiro.