Penobscot Dataset: Fostering Machine Learning Development for Seismic Interpretation

03/21/2019 ∙ by Lais Baroni, et al. ∙ ibm 0

We have seen in the past years the flourishing of machine and deep learning algorithms in several applications such as image classification and segmentation, object detection and recognition, among many others. This was only possible, in part, because datasets like ImageNet – with +14 million labeled images – were created and made publicly available, providing researches with a common ground to compare their advances and extend the state-of-the-art. Although we have seen an increasing interest in machine learning in geosciences as well, we will only be able to achieve a significant impact in our community if we collaborate to build such a common basis. This is even more difficult when it comes to the Oil Gas industry, in which confidentiality and commercial interests often hinder the sharing of datasets with others. In this letter, we present the Penobscot interpretation dataset, our contribution to the development of machine learning in geosciences, more specifically in seismic interpretation. The Penobscot 3D seismic dataset was acquired in the Scotian shelf, offshore Nova Scotia, Canada. The data is publicly available and comprises pre- and pos-stack data, 5 horizons and well logs of 2 wells. However, for the dataset to be of practical use for our tasks, we had to reinterpret the seismic, generating 7 horizons separating different seismic facies intervals. The interpreted horizons were used to generated +100,000 labeled images for inlines and crosslines. To demonstrate the utility of our dataset, results of two experiments are presented.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The seismic reflection method is paramount for the location of possible hydrocarbon accumulation in the subsurface. Besides providing structural imaging and high area coverage in a short time, this geophysical method, in conjunction with additional data, provides invaluable information about rock properties, fluid content and lithology.

In this context, using the knowledge provided by the seismic method, one may evaluate not only the location of possible reservoirs but also the economic viability with reasonable accuracy, therefore reducing risk and potential losses. However, the interpretation procedure of the seismic data is a human-intensive and time-consuming task, performed by geoscientists who are continually dealing with tight deadlines and the ever-increasing size of datasets [1].

In this scenario, researchers have proposed the application of computer-aided systems to assist geoscientists in several tasks involving seismic interpretation. For example, [2, 3, 4, 5] aim at the identification of specific structures on seismic images and [6, 7] propose techniques to automate part of the seismic facies analysis process.

Other domains facing similar problems are using neural networks and machine/deep learning techniques with great success to support tasks that deal with high volumes of data and are considered human-centered, for instance, image classification and segmentation

[8, 9], and object detection and recognition [10, 11]. Nevertheless, these methods require training datasets with a sufficient amount of data for the testing and validation of the proposed methodology. For the machine learning community, it is a common practice to make these datasets publicly available. Some examples are MNIST [12] (60k images), PASCAL-VOC [13] (40k images), MS-COCO [14] (330k images), and ImageNet [15] (14 million images). These datasets allow researchers to compare their advances, extend the state-of-the-art, and find new possible applications.

Although we have seen an increasing interest in machine learning in geosciences, to the best of our knowledge, there is no public labeled dataset targeting the seismic interpretation task. In this sense, we propose a new dataset to be publicly available. The Penobscot interpretation dataset consists of 7 horizons and +100,000 labeled seismic images derived from the Penobscot 3D seismic data [16], already in the public domain. The seismic lines were segmented in different portions based on their seismic facies. The proposed dataset has already been used in some applications, for example, the works of Chevitarese et al. [17, 18, 19], which will be discussed in Section VI.

The present paper is organized as follows: in the next section we describe the regional geology where the Penobscot survey lies. In Sections III and IV we present the Penobscot 3D seismic dataset and our interpretation procedure. Section V presents the proposed dataset and describes its main characteristics. Section VI briefly discuss two works in which the proposed datasets was employed and we conclude our work in Section VII.

Ii Geological Settings

The Penobscot 3D dataset [16] was acquired in the Scotian Basin, located on the Scotian Shelf, offshore Nova Scotia, Canada (Figure 1). The basin was formed during the break-up of Pangaea – separation of North American and African plates – and covers an area of 300,000 km2 with sediment maximum thickness of 18 km [20]. The rifting process that took place from the Triassic to the Early Jurassic developed several sub-basins – Shelburne, Sable, Abenaki, South Wale, and Orpheus Graben – and, posteriorly in a passive margin configuration, two plateaus – Banquereau and La Have [21, 22].

Figure 1: Location of the Penobscot 3D survey in the Scotian Basin, offshore Nova Scotia, Canada.

During the break-up phase, which began in the Middle Triassic, the main infilling of the several interconnected sub-basins were the fluvial red bed sediments, along with volcanic rocks associated with the rifting process. In the Late Triassic, a shallow marine environment began, with the development of the Eurydice Formation, primarily formed by siliciclastic and carbonate sediments. The proper climatic configuration promoted the evaporation of the marine waters depositing salt layers, corresponding to the Argo Formation [23]. Subsequently, in the extent of the Early Jurassic/Late Triassic, the rifting process continued until the Break-Up Unconformity and the beginning of the proto-ocean.

Following the Jurassic succession lithostratigraphy, a shallow marine environment allowed the deposition of tidally influenced dolomites with anhydrides and siliciclastics [24]. Afterwards, the Mohican Formation sediments were deposited, which comprise muds and shales derived from a fluviomarine environment. The formation is overlaid by the Abenaki Formation, deposited in the Jurassic–Early Cretaceous during the spread of the sea floor. This formation consists of thick carbonate beds – predominantly limestones and dolomites – due to the configuration of a carbonate platform along the basin margin, and mudstones.

In the Late Jurassic, the Mic-Mac Formation, along with the Verril Canyon Formation and the Mohawk Formation, were deposited. These formations primarily consist of sands interbedded with shales and intercalated with carbonates, marking the initial phase of uplift and delta progradation [23]. In the Early Cretaceous, the Scotian Basin suffered a marine regression phase, resulting in the progradation and deposition of thick fluvio-deltaic sediments of the Mississauga Formation. Subsequently, thick shale packages with sand beds occurred due to intense deltaic sedimentation in a transgression phase, forming the Logan Canyon Formation. Still in the retrogradation phase, the Dawson Canyon Formation primarily consists of deep marine shale deposits and some limestones located across the Scotian Shelf. This sedimentation is calcareous or marly on the top, becoming shaley and silty towards its base [25].

The cessation of the deltaic sedimentation during the Cretaceous allowed the deposit of the Wyandot Formation. This deposit is the most distinctive and recognized lithologic unit on the Scotian Shelf, consisting of chalky carbonates grading from pure chalk to marl [25]. Overlying the Wyandot Formation, the Banquerau Formation consists of Tertiary marine shelf mudstones and shelf sands and conglomerates. This formation has a varying thickness, reaching more than 4 km beneath the continental slope, due to halokinesis [23].

The overview of the Scotian Basin lithostratigraphy and additional information about eustatic variation and possible hydrocarbon reservoirs are displayed in Figure 2.

Figure 2: Chronostratigraphic chart of the Scotian Basin. (Modified from [26].) (Eustatic curve from [27]).

Iii Seismic Data

The seismic data used for the generation of the proposed dataset is a public 3D seismic survey called Penobscot 3D [16], contributed by the Nova Scotia Department of Energy and the Canada Nova Scotia Offshore Petroleum Board, and managed by dGB Earth Sciences Open Seismic Repository [28]. The dataset consists of 87 km2 time migrated 3D seismic data, with 601 inlines and 482 crosslines, located in offshore Nova Scotia, Canada (Figure 1).

The seismic data has a time range of 6,000 ms, with 2 ms of sampling rate. The signal has a low resolution below 3,000 ms, approximately 5 km, with a SEG standard polarity. The acquisition parameters are 12.5 m25 m bin size (inlinecrossline) with 60-fold coverage standard polarity. Along with the 3D seismic data, the repository also provides 2 wells, L-30 and B-41 – with some markers and no geophysical logs –, pre-stack gathers, 2D seismic data, stacking velocity, and 5 interpreted horizons.

Iv Seismic Interpretation

The Penobscot 3D seismic dataset was imported into OpendTect [29] and then interpreted by two geoscientists. It is important to note that although other data are available in the repository, only the 3D seismic data were used to produce the interpretation. The interpretation was performed disregarding the horizons provided by the Open Seismic Repository [16] since they sometimes comprise more than one significant texture, what could hinder the performance of the machine learning algorithms.

Seven horizons were interpreted: H1, H2, H3, H4, H5, H6, and H7, numbered from the highest depth to the lowest. They divide the seismic cube into eight intervals with different pattern configurations. Figure 3 shows the 7 interpreted horizons along with two seismic lines. We emphasize that these horizons do not necessarily have a direct relationship with the geological settings. This means that surfaces may not correspond to the top of formations or stratal interfaces. Four faults were also interpreted to assist the horizons interpretation. For the purposes of machine/deep learning, horizon surfaces were created to separate different seismic facies intervals.

Figure 3: Seven interpreted horizons shown along with two seismic lines.

The analysis of seismic facies consists of the identification of seismic reflection parameters, based primarily on configuration patterns that indicate geological factors like lithology, stratification, depositional systems, etc. [30]. In the following list we explain briefly the seismic facies of each of the horizon intervals based on the amplitude and continuity of reflectors.

  • H1: the facies unit below H1 is characterized primarily by parallel, concordant, high-amplitude reflectors. It is also possible to identify chaotic reflectors, but that may be a consequence of the decrease of seismic frequency with depth.

  • H2-H1: the facies unit is characterized by parallel to subparallel, continuous, high-amplitude reflectors. Parallel/subparallel configuration reflects the uniform deposition of fluvio-deltaic sediments of the Mississauga Formation.

  • H3-H2: facies unit containing parallel to subparallel reflectors, like the previous interval, but less continuous.

  • H4-H3: reflectors below this horizon are continuous but have low amplitude, which makes it difficult to identify them. This is expected since the sedimentary package consists of deep marine shales and limestone showing little lithological contrast.

  • H5-H4: reflectors are predominantly subparallel and present varying amplitude.

  • H6-H5: the package consists mostly of parallel, high-amplitude reflectors. A facies unit with chaotic seismic reflectors is also noticed and may be associated with marine slump deposits.

  • H7-H6: the facies unit is composed of high-amplitude reflectors. Although most of the reflectors are continuous, some have diving angles and others are truncated, evidencing a high energy environment.

V Penobscot Interpretation Dataset

The Penobscot interpretation dataset consists primarily of 7 interpreted horizons in XYZ format and 2,166 images (1,083 seismic lines in TIFF format and 1,083 labeled images in PNG format). To create the labeled images, we took the intersection between the horizon surfaces and each seismic line and labeled the pixels from 0 to 7, following each horizon interval. Figure 4 shows a pair of an inline image (cropped in the figure) and its respective labels. In this paper, we present two applications: a classification and a semantic segmentation of seismic images. For the user’s convenience, we provide the image tiles used in the classification task along with the dataset111The Penobscot interpretation dataset is available at:

To produce the classification dataset, we break the seismic images into tiles with 4040 pixels. One tile has the majority of its area belonging to only one class [31]. In the provided dataset, we allow 30% of interference from other classes as discussed in [17, 31]. The entire process of creating tiles from a seismic image comprises the following steps:

V-1 split image files into training and test sets

where all tiles from a single image belong to either the training or the test set. This approach makes the training and test sets more distinct [17, 31].

V-2 shuffle the selected images

although there are many alternatives [17, 31] to split training and test data, we decided to simply shuffle the images, which gives a good balance between randomness and simplicity.

V-3 process the images removing extreme amplitudes and re-scale values between 0 and 255

by doing this, we reduce the search space and make data smaller.

V-4 generate tiles from processed images

associating them with their respective labels. Notice that labels 2 and 3 were merged during tile generation, following the discussion in [31, 17].

V-5 balance train and test datasets

although seismic images are imbalanced regarding the areas of each seismic facies unit, machine learning methods usually rely on uniform distribution to optimize its parameters. By balancing the classes, we make the training process simpler allowing us, for example, to use well-known metrics, such as accuracy and precision.

The provided classification dataset comprises 6,124 crossline and 4,706 inline seismic tiles per class along with their respective labels. Table I describes the files in the dataset.

Figure 4: Example of a cropped inline and its respective labels.
File Format # Files Total size (MB)
H1-H7 XYZ 7 87.5
Seismic inlines TIF 601 1,700
Seismic crosslines TIF 481 1,700
Labeled inlines PNG 601 4.9
Labeled crosslines PNG 481 3.9
Seismic tiles (train) PNG 75,810 116
Seismic labels (train) JSON 2 1.5
Seismic tiles (test) PNG 28,000 116
Seismic labels (test) JSON 2 0.5
Table I: Details of the Penobscot interpretation dataset

Vi Experiments

In this section, we present two applications that demonstrate the utility of the Penobscot dataset for seismic classification and segmentation using deep learning.

Vi-a Seismic Facies Classification

The first application is the classification of seismic facies presented in [31] and [17]. The authors successfully trained deep neural networks to discriminate the seismic facies present in the Penobscot dataset. They assume that one may distinguish different classes (facies) by their textural features as discussed in [5, 32].

For the presented classification task, it was necessary to break the seismic images into smaller parts (tiles), so that the majority of a tile’s area belongs to only one class. Tiles are the input of the deep neural network that classifies each tile as one of the possible classes.

The authors tested multiple tile sizes, number of examples per class, different interference percentages among many other parameters. The results in [31] show that one can train a neural network in 4 minutes using 25 inline slices, and yet obtain 89% of accuracy. Moreover, they reached up to 97% of accuracy in 30 minutes using 276 inline slices. Figure 5 shows the impact of the number of slices in the final classification accuracy.

Figure 5: Accuracy curve for different number of slices (1, 3, 5, 9, 13, 25, 50, 100, 200, and 276) – plot extracted from [31].

Vi-B Seismic Facies Segmentation

The second application in which the Penobscot interpretation dataset was used is the semantic segmentation of seismic facies. In [18], the authors trained a deep neural network for the classification task. Then, they modified the final part of the model to produce pixel-wise predictions. Finally, they trained the resulting model using the Penobscot interpretation dataset.

For the segmentation task, the authors also divided the input seismic images into tiles. However, the tiles are larger than the ones used for the classification task since they need to comprise more than one class. Next, they applied the network throughout the image to generate the final prediction. By doing this, they achieved more than 97% of the intersection over union (IOU). Figure 6 shows that the model produced masks very close to the actual interpretation with very little discontinuity. Notice that the authors in [18] joined classes 2 and 3, and 4 and 5.

Figure 6: Semantic segmentation of inlines 1200, 1520, 1090 and 1460 (left to right) from Penobscot dataset at the pixel level. In the output, each pixel receives an overlaid color representing a class. The white lines represent the seismic horizons.

Vii Conclusions

We argued in this letter that the expansion of machine learning in other fields was only possible, in part, because of the number of public datasets that have been made available in the past years. The Penobscot interpretation dataset is our contribution to foster the development of machine learning in seismic interpretation which, in our view, has gained increasing interest but still needs to build this common basis. With our dataset, we provide geoscientists, and machine learning practitioners working in the field, with +100,000 labeled images that can be used to develop their methods and compare their results in an easier and faster way.

In the experiments presented, the authors were able to successfully apply state-of-the-art deep learning techniques on the proposed dataset to reach high-accuracy results for seismic facies classification and segmentation. However, these are only two among many possible applications that could benefit from the dataset, such as, clustering, retrieval, and transfer learning. As future work, we intend to elaborate another dataset for other public seismic dataset called Netherlands F3. These two datasets together will provide valuable data for developing and testing machine learning methods for seismic interpretation.


  • [1] T. Randen, E. Monsen, C. Signer, A. Abrahamsen, J. O. Hansen, T. Sæter, and J. Schlaf, “Three-dimensional texture attributes for seismic data analysis,” in SEG Technical Program Expanded Abstracts 2000, pp. 668–671, SEG, 2000.
  • [2]

    P. Guillen, G. Larrazabal, G. González, D. Boumber, and R. Vilalta, “Supervised learning to detect salt body,” in

    SEG Technical Program Expanded Abstracts 2015, pp. 1826–1829, SEG, 2015.
  • [3] C. Zhang, C. Frogner, M. Araya-Polo, and D. Hohl, “Machine-learning based automated fault detection in seismic traces,” in 76th EAGE ACE, 2014.
  • [4] D. Gao, “Latest developments in seismic texture analysis for subsurface structure, facies, and reservoir characterization: A review,” Geophysics, vol. 76, no. 2, pp. W1–W13, 2011.
  • [5]

    A. B. Mattos, R. S. Ferreira, R. M. D. G. Silva, M. Riva, and E. V. Brazil, “Assessing texture descriptors for seismic image retrieval,” in

    2017 30th SIBGRAPI, pp. 292–299, Oct 2017.
  • [6] B. P. West, S. R. May, J. E. Eastwood, and C. Rossen, “Interactive seismic facies classification using textural attributes and neural networks,” The Leading Edge, vol. 21, no. 10, pp. 1042–1049, 2002.
  • [7] C. Song, Z. Liu, H. Cai, Y. Wang, X. Li, and G. Hu, “Unsupervised seismic facies analysis with spatial constraints using regularized fuzzy c-means,” Journal of Geophysics and Engineering, vol. 14, no. 6, p. 1535, 2017.
  • [8] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” nov 2015.
  • [9] E. Shelhamer, J. Long, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,”
  • [10] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in CVPR, June 2016.
  • [11] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in Advances in NIPS 28 (C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 91–99, Curran Associates, Inc., 2015.
  • [12] Y. LeCun, C. Cortes, and C. J. Burges, “Mnist handwritten digit database,” AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, vol. 2, 2010.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
  • [14] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, pp. 740–755, Springer, 2014.
  • [15] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, pp. 248–255, IEEE, 2009.
  • [16] O. S. R. dGB Earth Sciences, “Penobscot 3d - survey,” 2017.
  • [17] D. S. Chevitarese, D. Szwarcman, R. M. G. e Silva, and E. V. Brazil, “Deep learning applied to seismic facies classification: a methodology for training,” in EAGE Saint Petersburg, 2018.
  • [18] D. Chevitarese, D. Szwarcman, R. M. D. Silva, and E. V. Brazil, “Seismic facies segmentation using deep learning,” in AAPG ACE 2018.
  • [19] D. Chevitarese, D. Szwarcman, R. M. D. Silva, and E. V. Brazil, “Transfer learning applied to seismic images classification,” in AAPG ACE 2018.
  • [20] D. M. Hansen, J. W. Shimeld, M. A. Williamson, and H. Lykke-Andersen, “Development of a major polygonal fault system in upper cretaceous chalk and cenozoic mudrocks of the sable subbasin, canadian atlantic margin,” Marine and Petroleum Geology, vol. 21, no. 9, pp. 1205–1219, 2004.
  • [21] M. Albertz, C. Beaumont, J. W. Shimeld, S. J. Ings, and S. Gradmann, “An investigation of salt tectonic structural styles in the scotian basin, offshore atlantic canada: 1. comparison of observations with geometrically simple numerical models,” Tectonics, vol. 29, no. 4, 2010.
  • [22] Y. A. Kettanah, “Hydrocarbon fluid inclusions in the argo salt, offshore canadian atlantic margin,” Canadian Journal of Earth Sciences, vol. 50, no. 6, pp. 607–635, 2013.
  • [23] C.-N. S. O. P. Board, Technical Summaries of Scotian Shelf Significant and Commercial Discoveries. Canada-Nova Scotia Offshore Petroleum Board, 1997.
  • [24] F. Qayyum, O. Catuneanu, and C. E. Bouanga, “Sequence stratigraphy of a mixed siliciclastic-carbonate setting, scotian shelf, canada,” Interpretation, vol. 3, no. 2, pp. SN21–SN37, 2015.
  • [25] A. Mandal and E. Srivastava, “Enhanced structural interpretation from 3d seismic data using hybrid attributes: New insights into fault visualization and displacement in cretaceous formations of the scotian basin, offshore nova scotia,” Marine and Petroleum Geology, vol. 89, pp. 464–478, 2018.
  • [26] C.-N. S. O. P. Board, “Regional geology overview,” 2017.
  • [27] B. U. Haq, J. Hardenbol, and P. R. Vail, “Mesozoic and cenozoic chronostratigraphy and cycles of sea-level change,” 1988.
  • [28] dGB Earth Sciences, “Open seismic repository,” 2018.
  • [29] dGB Earth Sciences, “Seismic interpretation software & services,” 2018.
  • [30] L. F. Brown Jr, “Seismic stratigraphic interpretation and petroleum exploration,” Course Notes AAPG, no. 16, p. 181p, 1980.
  • [31] D. S. Chevitarese, D. Szwarcman, E. V. Brazil, and B. Zadrozny, “Efficient classification of seismic textures,” in 2018 IJCNN, pp. 2984–2991, July 2018.
  • [32] S. Chopra and V. Alexeev, “Applications of texture attribute analysis to 3d seismic data,” The Leading Edge, vol. 25, no. 8, pp. 934–940, 2006.