1 Introduction
Noise consistently appears as an unwanted companion to the desired signal in seismic recordings. As such, noise suppression is a fundamental step in all seismic processing workflows [1]. Arising from local site conditions, as well as being excited by a seismic source, the total noise field can be seen as the sum of many noise components arising from different sources, each with their own characteristics [2]. Typically, noise suppression procedures identify a defining property that easily distinguishes the targeted noise from the desired signal and leverage that to separate the former from the latter. In this paper, we consider the random component of the noise field and leverage its nonpredictable nature to build a suppression procedure.
Random noise suppression has been extensively investigated by the seismic community with the majority of the proposed techniques falling into one of the following categories: prediction, transformation and decompositionbased. Predictionbased approaches typically employ predictionfilters which aim to leverage the predictable nature of the coherent signal and therefore act as noise suppressors. Examples of such approaches include tx predictive filtering and fx deconvolution both of which can be applied in a stationary or nonstationary manner (e.g., [3, 4, 5, 6]). Transformationbased approaches transform the data into a domain in which the signal and noise can be easily distinguished due to their individual characteristics. By exploiting the sparse nature of seismic data in the curvelet domain, the curvelet transform is an example of a commonly used transformationbased denoising procedure (e.g., [7, 8, 9]). Similarly, other transformationbased methods have been proposed in the literature that use different transforms, for example, the wavelet [10], shearlet [11], and seislettransforms [12], among others. Finally, decompositionbased procedures express the seismic data as the composition of weighted basis functions and suppress those associated to the noise components. Such decomposition procedures utilise the likes of spectral decomposition [13], Emperical Mode Decomposition [14]
, and Singular Value Decomposition
[15], among others.With the increased interest in the use of Machine Learning (ML) in geophysics, a new class of random noise suppression procedures have been proposed. The majority of these approaches fall into the realm of Deep Learning (DL) and use a supervised training approach, which requires clean data for training to accompany the noisy input data. As is prevalent across seismic applications of DL, a number of studies have considered the use of synthetic seismic datasets for training a Convolutional Neural Network (CNN) (e.g.,
[16, 17, 18]). Whilst these experiments have shown promising denoising results on synthetic datasets they struggle on field data [19]. Alternatively, others have used conventional denoising procedures to create their ‘clean’ counterparts to the noisy input data for their CNN denoising procedures (e.g., [20]). However, as no suppression procedure is perfect, noise residual is likely to exist within the clean counterparts and is likely to hamper the networks denoising capabilities. Unsupervised DL procedures have no such requirements of noisyclean pairs for training. Recently, [19] illustrated how an encoderdecoder network could be trained on noisy seismic data for random noise attenuation whilst [21] detailed how an alternative convolutional network architecture can be used without any requirement on windowing the data. Both these approaches were shown to outperform the FXdeconvolution noise suppression procedure.Considering the broader scientific community, most DL approaches for random noise suppression of images, or imagelike data, are typically supervised and therefore have the requirement of paired noisyclean datasets for training [22]. This is often an unrealistic requirement  not just in seismology but across many other fields where there is no monitoring technique in which a clean dataset can be collected. In 2018, [22] proposed Noise2Noise which illustrated how, under the assumption of a stationary signal, a Neural Network (NN) could be trained to denoise an image based on training over two noisy samples. Whilst this removes the requirement of noisyclean pairs, it requires noisynoisy pairs in which the signal is consistent but the noise is varying within each pair  a problematic requirement for many monitoring applications. Building on this, [23] proposed Noise2Void (N2V) which requires only a single noisy image for training. Under the assumption that noise is statistically independent between samples, a blindspot network is used to predict a central sample’s value based on neighbouring samples. As the noise is independent between samples, the noise’s contribution to the sample’s value cannot be predicted and therefore only the signal’s contribution is predicted, resulting in an efficient denoising procedure. Whilst N2V is an ML approach, it can also be considered as a predictionbased approach; wherein, it leverages the ability to predict the signal and the inability to predict the noise resulting in a denoised image.
Previously applied to natural images and microscopy data among others, in this paper we investigate the adaptation of the N2V workflow to handle the highly oscillatory seismic signal and pseudorandom noise. Through an extensive hyperparameter analysis, we identify the optimum hyperparameters for the seismic denoising scenario considering both immediate improvements in the image domain and those observed for downtheline tasks, such as seismic inversion. The paper concludes by illustrating the potential of N2V through an application to a field dataset.
2 Theory: Blindspot networks
Noise2Void [23] is based on the concept of blindspot NNs, which aim to predict a central pixel value based on neighbouring pixels. Operating on patches of a single image , N2V works by replacing a set of nonadjacent pixels from each patch, herein referred to as active pixels, with randomly selected neighbouring pixels, that pertain to the receptive field of the chosen network , as illustrated in Figure 1. The corrupted patches become the input to a NN whilst the corresponding original patches represents the target values. In theory, the NN architecture, denoted as where refers to the trainable parameters, could be anything that can realistically map between the input and target values. In this paper, we follow the original N2V NN architecture: a 2layer UNet styled after [24], as illustrated in Figure 1
. As opposed to standard NN image processing tasks, the loss function here is not computed on every pixel in the image, instead it is only evaluated for the active pixels, i.e., those that were corrupted in the input image:
(1) 
where refers to the norm used in the loss  Mean Absolute Error (MAE) or Mean Squared Error (MSE), respectively  and is the number of available training samples (i.e., patches extracted from the image). [23] illustrated how MSE is the preferred choice for denoising additive WGN with respect to N2V. However, MAE is a prevalent choice for seismic deep learning applications. In Appendix 1, we provide a mathematical formulation that explains under which circumstances MSE and MAE should be used, respectively. Finally, we also highlight that no theoretical guarantee can be provided in the case of correlated noise therefore, we decided to experiment with both loss functions in the following numerical examples.
Once trained, the model is applied directly to the full seismic data. Note that at this stage, windowing of the seismic data is not required due to the ability of CNNs to handle dynamically varying input sizes. However, in the scenario where the data dimensions are not compatible with the down/upsampling of the UNet, then the input data are zeropadded to achieve an acceptable input data size.
2.1 Performance metrics
Whilst theoretically, N2V can be applied at any processing stage where random noise is observed in the seismic data, in this paper we focus on seismic images after time migration. A common challenge with many denoising procedures is that as part of the noise suppression process not only is the noise suppressed but also the signal is significantly damaged; something that may have a negative effect on downtheline tasks. With this in mind, three performance metrics are considered:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

Image Peak SignaltoNoise Ratio (PSNR),

Frequency correlation, and

Poststack inversion PSNR.
The image PSNR is calculated as
(2) 
where represents the clean data, the modified data (either noisy or denoised), and and represent the number of time samples and receivers, respectively. To quantify the effect of the N2V denoising, we compute the percent change in the PSNR between the noisy and denoised images. This can be written as,
(3) 
where and are the values computed from the noisy and denoised images, respectively.
The change in frequency is quantified using the sample Pearson’s correlation coefficient, where the aim is to return the noisy data’s amplitude spectra to that of the clean data. This is computed as:
(4) 
where and are the amplitude spectra of the clean and denoised data, respectively, averaged over the spatialaxis, and are the sample means of and , respectively, and is the number of samples in the spectra. Similarly to the image PSNR, to analyse the effect of N2V we compute the percent change in the sample Pearson’s correlation coefficient when computed with the noisy versus denoised data,
(5) 
where is the correlation coefficient between the clean and noisy data and is the correlation coefficient between the clean and denoised data.
As a final metric of comparison, the denoised data are used as input for a standard downtheline task, namely poststack inversion [25]. By doing so, we can inspect the effect of denoising and possible signal damage on our ability to estimate an acoustic impedance model of the subsurface. Poststack inversion assumes that each trace of the poststack seismic data can be represented by a simple convolutional model as follows:
(6) 
where is the acoustic impedance profile in the time domain and
is the time domain seismic wavelet. We rewrite this expression for the entire set of traces in the poststack data in a compact matrixvector notation,
, where d and m are vectorized seismic data and the natural logarithm of the acoustic impedance model, is a convolution operator and is a first derivative operator. The model vector is then estimated by means of L2regularized inversion using the PyLops computation framework [26]. The PSNR values for the inverted models are calculated as in equation 4 where is the true model and is the inverted model. As with the above two performance metrics, for the inversion PSNR we calculate the percent change between the inversions for the noisy and denoised inversions.Finally, for the field dataset there is no ‘clean’ dataset for comparison and as such, the above metrics cannot be easily computed. In this situation, we perform qualitative comparison of the raw and denoised images, frequency content, and inversion products. In addition to this, we benchmark the N2V approach against two commonly used techniques for random noise suppression: FXdeconvolution and sparsitypromoting inversion with Curvelet transform. FXdeconvolution [27] is based on the concept that the coherent part of a seismic trace can be predicted in the FX domain from the previous traces via spatial predictive filters. Our results are based on the Madagascar implementation of FXdeconvolution [28] with the same parameters used by [29] for this dataset, namely a window of 20 traces with a filter length of 4 traces. The second method is instead based on the principle that seismic data have a sparse representation in the Curvelet domain [8] whilst random noise maps incoherently across the Curvelet domain. We employ sparsitypromoting inversion with the Fast Iterative ShrinkageThresholding Algorithm (FISTA) solver [30] and softthresholding to attenuate random noise in our seismic data.
3 Data
To be able to perform a quantitative analysis of our denoising procedure the majority of this study is performed on synthetically generated data. Utilising the SEG Hess VTI model and a 30 Hz Ricker wavelet, 2D slices are created in order to mimic the multiple inlines or crosslines that would be available from a 3D survey. Two different synthetic datasets are created with this base waveform data: one with White, Gaussian Noise (WGN) as illustrated in Figure 2(a,d) and one with 5100 Hz bandpassed noise as illustrated in Figure 2(b,e).
The paper concludes with an application of the N2V denoising procedure on a field dataset from a land acquisition in China that is heavily contaminated by random noise. Previously analysed by [6] and [29], a single 2D line from the poststack volume was released under the Madagascar framework as part of a continued effort towards reproducible research. Figure 2(c,f) illustrates the seismic image and its respective amplitude spectra. The sampling rate for the datasets (synthetic and field) is 2 ms.
3.1 Data preparation
Following the procedure of N2V, patches are randomly extracted from the seismic lines to form the training dataset. To further increase the size of such dataset, common data augmentation techniques of both rotation and polarity reversal are employed. This results in the data increasing 8fold as illustrated in Figure 3. The number and size of the training patches varies between the examples in this paper and are detailed in Table 1.
Exp.1: WGN  Exp.2: BP Noise  Exp.3: Land Data  
TrainValidate  4500500  4500500  4500500 
Epochs  150  25  15 
Batch size  128  64  128 
Patch size  64x64  32x32  32x32 
% Active pixels  0.2  25  33 
Neighborhood radius  5  15  15 
Loss  MSE  MAE  MAE 
UNet depth  2  2  2 
Kernel size  3x3  3x3  3x3 
4 Numerical Examples
4.1 Synthetic with WGN
The initial example portrays a layman’s application of N2V onto seismic data utilising the same noise properties (i.e., WGN) and hyperparameters as detailed in the original N2V study on natural images [23] and displayed in Table 1. Figure 4 shows the progression of the loss (equation 1) during the training period for all the active pixels in the 4500 training patches and the 500 validation patches. An application of the trained model to a 2D line from the synthetic dataset is illustrated in Figure 5. The training took minutes on an Nvidia Quadro RTX 4000 while the application on the 2D line composed of 198 traces of 453 time samples took
milliseconds. In the image domain, the PSNR has increased by 73% whilst in the frequency domain there is a much higher similarity between the spectra of the noisefree wavefield (black) and the denoised data (green), as opposed to the noisy data (red). When inversion is performed on the data, it is clear that the inversion on the denoised data produces an acoustic impedance model that is closer to the true model than that of the noisy data, with significantly fewer artefacts. Overall, for seismic data contaminated by WGN, it is shown that N2V can accurately learn to reproduce the seismic signal from surrounding samples without recreating the noise. This significantly improves the data quality both for current tasks (in the image domain) and downtheline tasks, such as inversion.
4.2 Synthetic with bandpassed noise
The second example focuses on providing a more realistic example using bandpass filtered noise to identify the potential of N2V for seismic applications. In this example, we have performed a hyperparameter sweep to identify the optimum hyperparameters for denoising of bandpass filtered random noise. Over 860 parameter combinations were considered from the values detailed in Table 2. Due to the computational cost of training, only one model is generated per hyperparameter combination however, 100 additional synthetic datasets are generated to analyse each models’ performance. Figure 6 illustrates the performance of a subset of the hyperparameters for a fixed windowsize () and fixed loss function ().
Parameter value options  

Epochs  [5, 10, 25, 50, 75, 100, 150] 
Batch size  [32, 64, 128] 
Patch size  [32x32, 64x64] 
% Active pixels  [2, 10, 25, 33] 
Neighborhood radius  [5, 15,30] 
Loss  [MSE, MAE] 
TrainValidate  4500500 
UNet depth  2 
Kernel size  3x3 
As detailed above, the models are evaluated on the PSNR gain in both the image and acoustic impedance domains, as well as the increase in the correlation with the amplitude spectra of the noisefree data. Considering Figure 6, the general trend of the experiments shows that as training progresses, i.e. the number of epochs increases, the PSNR gain in the image domain (top row) slightly decreases whilst the PSNR gain in the acoustic impedance domain (bottom row) moderately increases. Finally, the batch size is shown to have a limited influence in comparison to the other hyperparameters. The optimum hyperparameter combination is a sum of the ranking of each combination across all three scoring criteria.
The optimum hyperparameter combination, as detailed in the middle column of Table 1, is used to train the bandpassed N2V model with Figure 7 illustrating the loss function progression during training. Figure 8 shows the result of the trained network applied to a synthetic slice contaminated by bandpass filtered noise. Similar to the WGN results, a PSNR increase is observed in the image domain, alongside an increase in the similarity with the amplitude spectra of the noise free data. However, more substantial signal leakage is observed around the central salt body in comparison to the WGN results, as well as some noise residual remaining in the denoised result. Despite this, the inversion on the denoised image results in a cleaner subsurface model than that from the noisy image.
4.3 Field data application
To conclude, the N2V workflow is applied to a land dataset, which is known to be contaminated by random noise. Trained using patches, the model is applied to the full 2D line. The resulting denoised image is shown in Figure 9 and we compare it with results from the conventional FXdeconvolution and Curvelet denoising procedures. Figure 10 provides a zoomedin comparison for three areas of interest spanning the model’s depth range. Considering the difference between the original and denoised datasets (bottom row of Figure 9), the Curvelet approach has removed the most noise however we argue that it has possibly oversmoothed the data (effectively reducing the resolution) as well as introduced some linear artefacts (particularly noticable in the top row of Figure 10). This is likely due to the fact that since the Curvelet transform explains an image as the superposition of localised oriented wave packets, the denoising process may have slightly corrupted the relative weighting of the different wave packets whilst trying to suppress the noise. On the other hand, while the FX approach has also removed more energy than the N2V approach, the resulting denoised image is still heavily contaminated by noise, which is particularly observable in the closeups of Figure 10. In addition to this, all approaches have resulted in a certain amount of signal leakage, even though of different nature from method to method.
Supporting what is observed in the image domain, Figure 11 illustrates the differences in the amplitude spectra between the different denoised datasets. Both the FXdeconvolution and Curvelet domain results have reduced the energy across all bandwidths with the Curvelet approach outperforming the FX approach between 60100 Hz. The N2V results show less reduction in the bandwidths around which the signal is expected with a reduction in energy being observed above Hz. However, even at higher frequencies where signal is likely not to be present, the N2V approach does not reduce the amplitude spectrum to the levels of the other two procedures.
Finally, Figure 12 shows the inversion products for the original data and the denoised results. Similar to the observations in the image domain, the N2V results seem to have more details than the overlysmoothed Curvelet results. Conversely, the partial attenuation of genuine signal in the FX data leads to lower contrasts between features in the inverted acoustic impedance model.
5 Discussion
Blindspot networks offer a solution to the predicament of requiring noisyclean pairs of training images for deep learning denoising procedures. Previously utilised in applications on the likes of natural images [23], Computed Tomography (CT) images [31] and Synthetic Aperture Radar (SAR) images [32], we have shown that under the right circumstances, N2V can also be a powerful denoiser for seismic data. N2V relies on the assumption that noise is statistically independent between pixels, or, as in the seismic case, between each spatiotemporal sample. In reality, noise in seismic data is always correlated to some extent. Despite this, we have shown that, whilst providing the best results on WGN, N2V can still efficiently denoise both synthetic bandpass filtered noise as well as recorded noise from a field acquisition. It was observed that for seismic denoising the number of epochs had to be significantly reduced in comparison to the initial N2V applications, whilst the number of active pixels had to be increased. The reduction of epochs hinders the network from learning to replicate mildly correlated noise whilst still providing adequate training time to learn the dominant signals in the data. Whereas, increasing the number of active pixels acts as a regulariser to the training procedure by introducing additional corruption into the training dataset.
Typically in seismic applications, denoising is performed either prior to interpretation or as preparation for downtheline tasks such as inversion. In this paper, we took a backseat approach and choose a hyperparameter selection that was a compromise between the three performance metrics. This resulted in a PSNR gain of 39.28% in the image domain and 1.32% in the inversion domain for the realistic synthetic example (Hess VTI model with bandpass filtered noise). However, if the denoising was being performed on data only for direct interpretation, we could have selected the best hyperparameters for this task, which would have resulted in an image domain PSNR gain of 50.24%. Similar can be said for inversion, where the inversion PSNR gain would have been 6.27% for the optimum hyperparameter combination (as illustrated in Figure 6). Typically DL procedures are accompanied by a lengthy training time, often rendering the approaches significantly more computationally expensive than conventional procedures [33]. However, due to the small number of epochs required, the N2V approach can be trained and subsequently applied within a matter of minutes for the field data  minutes for our field data experiment training. Where poststack volumes are available, an extension to 3D denoising would be possible through the adaption from 2D convolutional layers in the NN to 3D convolutional blocks. This would likely further improve the denoising procedure at the price of increasing the computational cost and memory requirements of the network.
The potential of N2V was illustrated using poststack seismic data, however there is no limitation on the processing stage at which blindspot networks could be applied for denoising. The poststack scenario was used due to the availability of a field dataset that is known to be contaminated by random noise and that has been extensively investigated by others as a benchmark dataset for random noise suppression procedures (e.g., [6, 29]. However, in theory the technique could equally be applied to shotgathers, receiver gathers, or even passive seismic data, assuming each of these are contaminated by random noise. One known limitation of N2V is the assumption of statistical independence between samples. [34] proposed an extension to the N2V workflow to adapt the approach for structured noise suppression. Structured N2V utilises selective masking to minimise any contribution of correlated noise into the prediction of the active pixels value. This extension suggests the potential of utilising selfsupervised networks for the suppression of correlated noise signals in seismic data.
Finally, in the initial publication of N2V, the authors acknowledge that: “Intuitively, N2V cannot be expected to outperform methods that have more information available during training.". In other words, N2V is likely to be outmatched by well trained, supervised denoising networks. However, as noted above, creating seismic datasets with the noisyclean pairs required for training traditional supervised procedures is not trivial. When noisyclean pairs are generated using a previous denoising technique, such as by the Curvelet transform, the inclusion of DL can only serve to speed up the original denoising procedure. As the network learns from the training samples provided then it cannot outperform the denoising technique which was used to generate the training data. Alternatively, when synthetic data is generated to act as the training dataset, the clean image will definitely not contain any noise residual. However, generating synthetic data that accurately represent field data is a wellknown challenge [35]. Therefore, whilst certain steps can be taken to reduce the syntheticfield data gap for DL applications [36], there is no guarantee that a synthetically trained network will be as effective when applied to field data. Recently, [37] proposed the first blindspot network procedure that was shown to perform onpar with, and sometimes outperform, supervised denoising approaches for natural images contaminated by independent and identically distributed additive Gaussian noise. Future studies will consider circumstances under which selfsupervised networks, of varying architectures and training procedures, can outperform supervised networks trained on synthetic seismic data, and vice versa.
6 Conclusion
We have shown how blindspot networks can be applied to accurately predict seismic signals without replicating noise, and as such, provide a powerful random noise suppression procedure. As a selflearning procedure, no additional data are required for training, removing the common barriers of most deep learning denoising procedures that often require a ‘clean’ training dataset. The Noise2Void method has been successfully applied on two synthetic and one field datasets. Whilst originally developed for random, additive, white noise, our numerical results show that such networks can be successfully trained to also remove partially correlated noise provided that the number of training iterations is reduced whilst the number of corrupted pixels is increased.
7 Acknowledgements
The authors thank A. Krull, T.O. Buchholz, and F. Jug for opensourcing their TensorFlow implementation of Noise2Void. For computer time, this research used the resources of the Supercomputing Laboratory at King Abdullah University of Science & Technology (KAUST) in Thuwal, Saudi Arabia.
References
 [1] Özdoğan Yilmaz. Seismic data analysis, volume 1. Society of Exploration Geophysicists Tulsa, 2001.
 [2] Claire Birnie, Kit Chambers, Doug Angus, and Anna Stork. Analysis and models of preinjection surface seismic array noise recorded at the aquistore carbon storage site. Geophysical Journal International, 2016.
 [3] Michael K Chase. Random noise reduction by fxy prediction filtering. Exploration Geophysics, 23(2):51–56, 1992.
 [4] Ray Abma and Jon Claerbout. Lateral prediction for noise attenuation by tx and fx techniques. Geophysics, 60(6):1887–1896, 1995.
 [5] Necati Gülünay. Noncausal spatial prediction filtering for random noise reduction on 3d poststack data. Geophysics, 65(5):1641–1653, 2000.
 [6] Guochang Liu and Xiaohong Chen. Noncausal f–x–y regularized nonstationary prediction filtering for random noise attenuation on 3d seismic data. Journal of Applied Geophysics, 93:60–66, 2013.
 [7] Gilles Hennenfent and Felix J Herrmann. Seismic denoising with nonuniformly sampled curvelets. Computing in Science & Engineering, 8(3):16–25, 2006.
 [8] Ramesh Neelamani, Anatoly I Baumstein, Dominique G Gillard, Mohamed T Hadidi, and William L Soroka. Coherent and random noise attenuation using the curvelet transform. The Leading Edge, 27(2):240–248, 2008.
 [9] Shan Lianyu, Fu Jinrong, Zhang Junhua, Zheng Xugang, and Miao Yanshu. Curvelet transform and its application in seismic data denoising. In 2009 International Conference on Information Technology and Computer Science, volume 1, pages 396–399. IEEE, 2009.
 [10] Rongfeng Zhang and Tadeusz J Ulrych. Physical wavelet frame denoising. Geophysics, 68(1):225–231, 2003.
 [11] Amine Merouane, Orhan Yilmaz, and Edip Baysal. Random noise attenuation using 2dimensional shearlet transform. In SEG Technical Program Expanded Abstracts 2015, pages 4770–4774. Society of Exploration Geophysicists, 2015.
 [12] Sergey Fomel and Yang Liu. Seislet transform and seislet frame. Geophysics, 75(3):V25–V38, 2010.
 [13] Sergey Fomel. Seismic data decomposition into spectral components using regularized nonstationary autoregressionregularized nonstationary autoregression. Geophysics, 78(6):O69–O76, 2013.
 [14] Maiza Bekara and Mirko Van der Baan. Random and coherent noise attenuation by empirical mode decomposition. Geophysics, 74(5):V89–V98, 2009.

[15]
Maïza Bekara and Mirko Van der Baan.
Local singular value decomposition for signal enhancement of seismic data.
Geophysics, 72(2):V59–V65, 2007.  [16] Xu Si, Yijun Yuan, Tinghua Si, and Shiwen Gao. Attenuation of random noise using denoising convolutional neural networks. Interpretation, 7(3):SE269–SE280, 2019.
 [17] Yuji Kim, Robert Hardisty, and Kurt J Marfurt. Seismic random noise attenuation in fx domain using complexvalued residual convolutional neural network. In SEG Technical Program Expanded Abstracts 2019, pages 2579–2583. Society of Exploration Geophysicists, 2019.
 [18] Feng Wang and Shengchang Chen. Residual learning of deep convolutional neural network for seismic random noise attenuation. IEEE Geoscience and Remote Sensing Letters, 16(8):1314–1318, 2019.
 [19] Mi Zhang, Yang Liu, and Yangkang Chen. Unsupervised seismic random noise attenuation based on deep convolutional neural network. IEEE Access, 7:179810–179822, 2019.
 [20] Sara Mandelli, Vincenzo Lipari, Paolo Bestagini, and Stefano Tubaro. Interpolation and denoising of seismic data using convolutional neural networks. arXiv preprint arXiv:1901.07927, 2019.
 [21] Chenyu Qiu, Bangyu Wu, Naihao Liu, Xu Zhu, and Haoran Ren. Deep learning prior model for unsupervised seismic data random noise attenuation. IEEE Geoscience and Remote Sensing Letters, 2021.
 [22] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala, and Timo Aila. Noise2noise: Learning image restoration without clean data. arXiv preprint arXiv:1803.04189, 2018.

[23]
Alexander Krull, TimOliver Buchholz, and Florian Jug.
Noise2voidlearning denoising from single noisy images.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 2129–2137, 2019.  [24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [25] Paul Veeken and M Silva. Seismic inversion methods and some of their constraints. First Break, 22, 2004.
 [26] M. Ravasi and I. Vasconcelos. PyLops  A LinearOperator Python Library for scalable algebra and optimization. SoftwareX, 11:100361, 2020. doi: 10.1016/j.softx.2019.100361.
 [27] Luis L Canales et al. Random noise reduction. In 1984 SEG Annual Meeting. Society of Exploration Geophysicists, 1984.
 [28] Necati Gulunay. Fxdecon and complex wiener prediction filter. In SEG Technical Program Expanded Abstracts 1986, pages 279–281. Society of Exploration Geophysicists, 1986.
 [29] Yang Liu and Bingxiu Li. Streaming orthogonal prediction filter in the tx domain for random noise attenuation. Geophysics, 83(4):F41–F48, 2018.
 [30] Amir Beck and Marc Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [31] Kaichao Liang, Li Zhang, and Yuxiang Xing. Training a lowdose ct denoising network with only lowdose ct dataset: comparison of ddln and noise2void. In Medical Imaging 2021: Physics of Medical Imaging, volume 11595, page 115950I. International Society for Optics and Photonics, 2021.
 [32] Andrea Bordone Molini, Diego Valsesia, Giulia Fracastoro, and Enrico Magli. Speckle2void: Deep selfsupervised sar despeckling with blindspot convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing, 2021.
 [33] Claire Birnie, Haithem Jarraya, and Fredrik Hansteen. An introduction to distributed training of deep neural networks for segmentation tasks with large seismic datasets. arXiv preprint arXiv:2102.13003, 2021.
 [34] Coleman Broaddus, Alexander Krull, Martin Weigert, Uwe Schmidt, and Gene Myers. Removing structured noise with selfsupervised blindspot networks. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pages 159–163. IEEE, 2020.
 [35] Claire Birnie, Kit Chambers, Doug Angus, and Anna L Stork. On the importance of benchmarking algorithms under realistic noise conditions. Geophysical Journal International, 221(1):504–520, 2020.
 [36] T. Alkhalifah, H. Wang, and O. Ovcharenko. Mlreal: Bridging the gap between training on synthetic data and real data applications in machine learning. In 82nd EAGE Annual Conference & Exhibition, volume 2021, pages 1–5. European Association of Geoscientists & Engineers, 2021.
 [37] Samuli Laine, Tero Karras, Jaakko Lehtinen, and Timo Aila. Highquality selfsupervised deep image denoising. Advances in Neural Information Processing Systems, 32:6970–6980, 2019.
 [38] Joshua Batson and Loic Royer. Noise2self: Blind denoising by selfsupervision. In International Conference on Machine Learning, pages 524–533. PMLR, 2019.
Appendix: A statistical interpretation of blindspot networks
In this Appendix, we provide a statistical interpretation of the blindspot networks used in this work following a derivation similar to that of [37] and [38].
First of all, we recall the Maximum Likelihood Estimator (MLE) that is generally used as the starting point for the derivation of supervised learning training strategies:
(7) 
where and are the distributions of the input and target data, respectively. Such distributions are generally unknown, but a set of samples are available. Under the assumption that such samples are drawn independently from the underlying distributions, the MLE can expressed as:
(8) 
where the trainable parameters are obtained by maximizing the mean of the loglikelihood evaluated over all available pairs of inputs and targets.
In the context of denoising, the noisy image is expressed as the clean images corrupted by some noise with possibly known statistical properties i.e., . However, as discussed in the main text, availability of clean images is not always possible: therefore, blindspot networks assume that the unknown clean values depend on the context of neighboring (noisy) pixels denoted as . Moreover, when the noise can be assumed to be uncorrelated from pixel to pixel, the noisy image is used as target under the assumption that the network will only be able to reproduce its coherent part and fail to recreate the noise component, i.e. . In mathematical terms the MLE can be rewritten as:
(9) 
where we consider here for simplicity the training patch. Summing over all available patches leads to the loss function in equation 1.
Let’s now consider the two most commonly used statistical distributions for the noise field in seismic data and identify the corresponding MLE estimator:

White Gaussian noise: . The corresponding noisy image is distributed as
. Given the probability density function:
(10) its corresponding training loss function becomes the wellknow Mean Square Error (i.e., p=2 in equation 1):
(11) 
Laplace noise: . The corresponding noisy image is distributed as
. Given the probability density function:
(12) its corresponding training loss function becomes the wellknow Mean Absolute Error (i.e., p=1 in equation 1):
(13)
A numerical validation of the correspondence between statistical models for the additive noise and the choice of the training loss function is finally provided in Figure 13. MSE and MAE provide the best denoising performance in terms of PSNR for Gaussian and Laplace noise, respectively.
Finally, as noise in seismic data is generally correlated in time, space, or both, we observe that neither of the above defined models is correct. On the other hand, if we assume the noise to have a certain correlation length in either time and/or space, we can express the noise within the correlation window as where is the covariance matrix of the noise. To take into account such correlation we must therefore write where we have grouped the nearby correlated pixels to form the vectors x and n. The corresponding probability density function becomes:
(14) 
and the training loss can be written as:
(15) 
We observe that if the covariance matrix is unknown, neither MAE nor MSE can correctly approximate this loss function. In this case, empirical evidence in Figure 13 supports our choice of using MAE in the case of mildly correlated noise, although more sophisticated denosing models that take into account noise correlation will be investigated in the context of seismic data denoising.