OpenFWI: Benchmark Seismic Datasets for Machine Learning-Based Full Waveform Inversion

11/04/2021
by   Chengyuan Deng, et al.
Los Alamos National Laboratory
28

We present OpenFWI, a collection of large-scale open-source benchmark datasets for seismic full waveform inversion (FWI). OpenFWI is the first-of-its-kind in the geoscience and machine learning community to facilitate diversified, rigorous, and reproducible research on machine learning-based FWI. OpenFWI includes datasets of multiple scales, encompasses diverse domains, and covers various levels of model complexity. Along with the dataset, we also perform an empirical study on each dataset with a fully-convolutional deep learning model. OpenFWI has been meticulously maintained and will be regularly updated with new data and experimental results. We appreciate the inputs from the community to help us further improve OpenFWI. At the current version, we publish seven datasets in OpenFWI, of which one is specified for 3D FWI and the rest are for 2D scenarios. All datasets and related information can be accessed through our website at https://openfwi.github.io/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 8

page 9

page 11

05/02/2020

Open Graph Benchmark: Datasets for Machine Learning on Graphs

We present the Open Graph Benchmark (OGB), a diverse set of challenging ...
02/18/2019

Seven Myths in Machine Learning Research

We present seven myths commonly believed to be true in machine learning ...
08/10/2020

Measures of Complexity for Large Scale Image Datasets

Large scale image datasets are a growing trend in the field of machine l...
04/12/2018

EMBER: An Open Dataset for Training Static PE Malware Machine Learning Models

This paper describes EMBER: a labeled benchmark dataset for training mac...
07/01/2019

An Open Source AutoML Benchmark

In recent years, an active field of research has developed around automa...
12/03/2018

Deep Learning for Classical Japanese Literature

Much of machine learning research focuses on producing models which perf...
07/14/2020

Bringing the People Back In: Contesting Benchmark Machine Learning Datasets

In response to algorithmic unfairness embedded in sociotechnical systems...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The subsurface geophysical properties (velocity, density, bulk modulus, etc.) are critical to a myriad of subsurface applications, such as carbon sequestration, earthquake detection and early-alarming, etc [Virieux-2009-Overview]. These important properties can be inferred by solving seismic full waveform inversion (FWI), which is illustrated in figure 2. With the target of minimizing the residual between predicted and observed seismic signals, FWI falls into the family of non-convex optimization problems with PDE constraints. Such problems have been intensively studied in the paradigm of physics-driven approaches [fichtner2010full, zhang2012wave, ma2012image, zhang2013double, feng2019transmission+, feng2021mpi]. Notorious complications of these approaches include extravagant computational power consumption, cycle-skipping, and ill-posedness issues. In spite of substantial efforts by imposing various regularization terms [lin2014acoustic, lin2015quantifying, hu2009simultaneous, guitton2012blocky, chen2020multiscale] to mitigate the aforementioned problems, the issue of expensive computation is still inevitable in the physics-driven framework.

With the advance of deep learning techniques, researchers have been actively exploring end-to-end solutions for complicated scientific computational imaging problems [Deep-2021-Adler, willard2020integrating, zhu2019applications, mehta2019high, yang2019deep]. These methods are usually regarded as Data-driven approaches, since they yield a hefty data dependency, just to name a few: quantity, diversity, authenticity, etc. In recent years, promising results have been demonstrated for data-driven FWI. The early attempt by [araya2018deep]

introduced an 8-layer fully neural network model to obtain velocity maps from shot gathers. Inspired by the renowned results of using convolutional neural networks in Computer Vision, prolific works have been proposed based on encoder-decoder fully convolutional networks 

[yang2019deep, wu2019inversionnet, wang2018velocity, feng2021multiscale]. These models focus on 2D solutions. Very recently, the first solution for 3D imaging has been demonstrated by [zeng2021inversionnet3d], which employs group convolution in the encoder and invertible layers in the decoder for high efficiency and scalability. For all the aforementioned methods, the models are supervised, meaning that both velocity maps and seismic data are required for the training purpose. But in practice, it is unrealistic to obtain such a large volume of labeled data in advance. To alleviate the heavy reliance on labels, [Jin-2021-Unsupervised] recently developed an unsupervised inversion model incorporating both InversionNet model and the governing wave equation. For a thorough view of the research line of data-driven FWI, please refer to a survey paper [adler2021deep].

Undeterred by the delights on this celebrated progress, we have a crucial observation: All experiments are implemented on individual datasets scarcely published! This immediately leads to several defective implications on (1) Unified Evaluation: How do we confirm the empirical superiority of one machine learning solution to another? The experimental protocol has to be identical, which is not possible with data not being published. A unified evaluation also helps our community with a global perspective of what we have achieved and what lies ahead. (2) Further Improvement: What if motivated researchers would like to further improve the proposed algorithm but have no data to test with? The data access denial only strangles valuable inspirations in the cradle. (3) Re-producibility and Integrity: Although it sounds a pathological scenario, people may still preserve the skeptics on the integrity of such a blossom of results, as implementing deep neural networks becomes more trivial. We also remark that the lack of open-sourced data is reasonable due to the difficulty of data acquisition. We either collect data from a real-world field study, which requires arduous human labor, or synthesize velocity maps and then generate seismic data by forward modeling. Because of these concerns and conditions, we believe our community has arrived at a complete and sound junction to establish an open-source data and benchmark platform.

Figure 2: Illustration of Seismic Full Waveform Inversion (FWI) and Forward Modelling.

We present OpenFWI, the first collection of large-scale seismic FWI datasets. OpenFWI is dedicated to facilitating diversified, rigorous, and reproducible research on data-driven FWI. Seven benchmark datasets will be released 222More pending approval by Los Alamos National Laboratory and U.S. Department of Energy, will be released chronologically upon approval. in OpenFWI, together with baseline experimental results by InversionNet[wu2019inversionnet], an encoder-decoder based deep learning solution for FWI. The benchmark datasets in OpenFWI demonstrate the following favorable characteristics:

  1. Multi-scale: OpenFWI covers multiple scales of data samples. The smallest one has around 20K data samples and can fit into the memory of a single GPU, which supports training without massive computation power. The large datasets contain about 70K data samples, which are usually trained in the distributed setting, further expediting the development of scalable algorithms for deep learning-driven FWI.

  2. Diverse Domains: OpenFWI empowers both 2D and 3D scenarios of FWI. The 3D Kimberlina dataset is the first 3D dataset synthesized very recently. The 2D datasets include velocity models that are representative of realistic subsurface applications. The Kimberlina CO datasets are accompanied by timestamps and can be used for time-lapse imaging.

  3. Various levels of model complexity:

    Depending on the subsurface modeling, OpenFWI encompasses a wide range of ’model complexity’. We provide an estimation on the difficulty of learning based on the synthesis process and our training experience. As a result, OpenFWI supports researchers to start with simple datasets and refine for the harder ones.

The rest of the paper is organized as follows: Section 2 introduces the background of the governing equation, and the method used for data synthesis. Section 3 provides a detailed description for each dataset with illustrations. Section 4 includes experimental results as baselines for each dataset. Section 5 concludes the paper and discusses future works.

2 Backgrounds

2.1 Acoustic Forward Modelling

The governing equation of the acoustic wave forward modeling is the wave equation,

(1)

where represents the spatial location in Cartesian coordinates, denotes time, and are the material bulk modulus and density, respectively. represents the pressure wavefield, and is the source term that specifies the location and time history of the source. Then, the compressional- (P-) velocity can be represented as .

Forward wave propagation modeling entails calculating equation 1 given the boundary conditions and source, as well as subsurface geophysical parameters, including velocity and density . For simplify, We can denote the forward modeling problems expression as

(2)

where represents the highly nonlinear forward mapping and

is the subsurface geophysical parameter vector.

In the simulation of the following datasets, we use the finite-difference method [moczo2007finite] with 2nd-order accurate in time and 8th-order accurate in space. The absorbing boundary conditions [engquist1977absorbing] are applied to all the boundaries.

3 OpenFWI Datasets

3.1 The FlatVEL & CurVEL Dataset

3.1.1 Data Description

FlatVEL and CurvedVEL are two large-scale geophysical dataset, which each consists of 50K pairs of seismic data and velocity maps. FlatVEL are velocity maps with flat layers and CurvedVEL are velocity maps with curved layers. The seismic data and velocity maps are splitted as 45K/5K for training, validation and testing respectively. Samples are shown in figure 3 and figure 4.

The size of the velocity maps in FlatVEL and CurvedVEL are both 100

150 grid points and the grid size is 15 meter in both x and z direction. Each velocity map contains 2 to 5 layers and the thickness of each layer ranges from 5 to 80 grids. The layers in the CurvedVEL follows a sine function. Compared to the maps with flat layers in FlatVEL, curved velocity maps yield much more irregular geological structures, making inversion more challenging. The velocity in each layer is randomly sampled from a uniform distribution between 1.5

and 3.5 . The velocity is designed to increase with depth to be more physically realistic. We also add geological faults to every velocity map. The faults shift from 0 grids to 80 grids, and the tilting angle ranges from 25 to 165.

To synthesize the seismic data, three sources are distributed on the surface at 0.975km, 1.125km and 1.275km. The seismic traces are recorded by 150 receivers positioned at each grid with an interval of 15 meter. The source is a Ricker wavelet (Wang, 2015) with a central frequency of 15 Hz. Each receiver records time-series data for 2 second, and we use a 1 millisecond sample rate to generate 2,000 timesteps. Therefore, the dimensions of seismic data become 3×2000×150.

Figure 3: Examples of velocity maps and their corresponding seismic measurements in FlatVel dataset.
Figure 4: Examples of velocity maps and their corresponding seismic measurements in CurvedVel dataset.

3.1.2 Data Generation

The generation of the synthetic FlatVEL and CurveVEL is based on the iterative deformation of randomly generated flatten subsurface structures. There are four major steps to generate a new subsurface sample:

  1. Generate random flatten subsurface structures by randomly setting each layer’s width and velocity. The velocity of each layer is sorted in descending order from the bottom to the top.

  2. Deform the flatten image by function by setting the period and the magnitude of the randomly to mimic the extrusion effect.

  3. Divide the image into two parts by generating a line randomly and shifting one of the parts to mimic the fault. Make sure there is no intersection with previous faults in the current iteration.

  4. Discard the part without shifting and go back to step 2 to fill the empty region until the whole image is fulfilled.

3.1.3 File Format and naming

Format: All samples of FlatVEL and CurvedVEL datasets are stored in .npz format. Velocity maps and seismic data are stored as sub-directories in each dataset. Each file contains a single NumPy array of one sample. The shapes of the arrays in velocity map files and seismic data files are (1, 3, 150, 100) and (1, 150, 100) respectively. It’s worth mentioning that .npz is nothing more than a compressed format of .npy, we converted the data only for the sake of space consumption reducement.

Naming: The naming of files follows the format of {data}_{n} for seismic data and {model}_{n} for velocity maps, n denotes the index of a file (starting from 0). Notice that for the same n, data and model becomes a pair. Here are some examples:

  • data_1.npz is the file that contains seismic data, is the first file among all.

  • model_99.npz is the file containing the 99-th velocity map.

3.2 The Kimberlina-CO Dataset

3.2.1 Data Description

The Kimberlina dataset contains 991 CO2 leakage scenarios, each simulated over a duration of 200 years, with 20 leakage velocity maps provided (i.e., at every 10 years) for each scenario [jordan2017characterizing]. Excluding missing values, in total, it has 19430 pairs of seismic data and velocity maps. The data are splitted as 15K/4430 for training and testing. Samples are shown in Figure figure 5.

Figure 5: Examples of velocity maps and their corresponding seismic measurements in Kimberlina-CO dataset.

The size of the velocity maps in Kimberlina-CO dataset are 401141 grid points and the grid size is 10 meter in both x and z direction. To synthesize the seismic data, 9 sources are evenly distributed along the top of the model, with depths of 5 . The seismic traces are recorded by 141 receivers positioned at each grid with an interval of 10 meter. The source is a Ricker wavelet with a central frequency of 25 Hz. Each receiver records time-series data for 2 second, and we use a 0.5 ms sample rate to generate 2,000 timesteps, then the seismic data is downsampled to 1251101 to save the memory and the computation cost for data-driven FWI. Therefore, the dimensions of seismic data become 91251101.

3.2.2 File Format and naming

All samples of Kimberlina-CO datasets are stored in .npz format. Velocity maps and seismic data are stored as sub-directories in each dataset. ach file contains a single NumPy array of one sample. The shapes of the arrays in velocity map files and seismic data files are (9, 1251, 101) and (401, 141) respectively. It’s worth mentioning that .npz is nothing more than a compressed format of .npy, we converted the data only for the sake of space consumption reducement.

Naming: The naming of files follows the format of {data}_sim{n}_t{m} for seismic data and {label}_sim{n}_t{m} for velocity maps, n denotes the 4 digital index of a simulation (starting from 0), and m represents the timesteps from 0 to 190 (at every 10 years). Notice that for the same n and m, data and model becomes a pair. Here are some examples:

  • data_sim0000_t0.npz is the file that contains seismic data, is the first file among all.

  • label_sim0991_t160.npz is the file containing the velocity map of the 991-th scenarios at the 160-th year.

3.3 The Style-transfer Dataset

3.3.1 Data Description

Style-transfer dataset is a large-scale geophysical dataset built by style-transfer method, which contains seismic data for two types of velocity maps: high-resolution maps and low-resolution maps. Each type has 67K pairs of seismic data and velocity maps. The data are splitted as 65K/2K for training and testing. Samples are shown in Figure 6.

The COCO dataset 

[lin2014microsoft] is set as the content images and The Marmousi model is set as the style image. A image transfer network is trained to transfer the COCO dataset to velocity perturbations. Then a 1D velocity model is added to the velocity perturbations to construct the velocity maps. Then the velocity maps are smoothed by a Gaussian filter with random standard derivation from 0 to 5 to build the high-resolution velocity maps and from 5 to 10 to build the low-resolution velocity maps. This dataset can be applied to the data-driven FWI and traveltime inversion, more details about the velocity building and the applications of this dataset can be found in[feng2020physically] and [feng2021multiscale].

The size of the velocity maps in Style-transfer dataset are 200200 grid points and the grid size is 5 meter in both x and z direction. To synthesize the seismic data, 10 sources are evenly distributed on the surface with a spacing of 100 . The seismic traces are recorded by 200 receivers positioned at each grid with an interval of 5 meter. The source is a Ricker wavelet with a central frequency of 15 Hz. Each receiver records time-series data for 2 second, and we use a 0.0002s sample rate to generate 5,000 timesteps, then the seismic data is downsampled from 5000200 to 200200 to save the memory and the computation cost for data-driven FWI. Therefore, the dimensions of seismic data become 10×200×200.

3.3.2 File Format and naming

All samples of Style-transfer datasets are stored in .npz format. Velocity maps and seismic data are stored as sub-directories in each dataset. and indicate the low resolution velocity maps and their seismic data while and indicate the high resolution velocity maps and their seismic data. It’s worth mentioning that .npz is nothing more than a compressed format of .npy, we converted the data only for the sake of space consumption reducement.

Naming: The naming of files follows the format of {data}_{n} for seismic data and {model}_{n} for velocity maps, n denotes the index of a file (starting from 0). Notice that for the same n, data and model becomes a pair. Here are some examples:

  • data_1.npz is the file that contains seismic data, is the first file among all.

  • model_99.npz is the file containing the 99-th velocity map.

Figure 6: Examples of velocity maps and their corresponding seismic measurements in Style-transfer dataset. The top three rows are samples of low-resolution maps, and the bottom three rows are those in high-resolution maps.

3.4 The FlatFault & CurvedFault Dataset

3.4.1 Data Description

FlatFault and CurvedFault are two large-scale geophysics dataset for FWI, each of which consists of 54K seismic data including 30K with paired velocity maps (labeled) and 24K unlabeled. Velocity maps in FlatFault and CurvedFault contain flat layers and curved layers, respectively. The 30K labeled pairs of seismic data and velocity maps are splitted into training set (24K), validation set (3K) and testing set (3K). Samples are shown in Figure 7.

The aim of CurvedFault dataset is to better validate effectiveness of FWI methods on curved topography, and the shape of those curves follows a sine function. All velocity maps in FlatFault and CurvedFault contain 2 to 4 layers, and the velocity in each layer is uniformly sampled between 3,000 meter/second and 6,000 meter/second. The dimensions of each velocity map are 7070 grid points, and the grid size is 15 meter in both directions. The layer thickness ranges from 15 to 35 grids. The velocity is designed to increase with depth to be more physically realistic. Geological faults are also added to every velocity map. The faults shift from 10 to 20 grids, and the tilting angle ranges from -123 to 123.

Five Ricker [wang2015frequencies] sources with a central frequency of 25 Hz are used for seismic data synthesis, and they are evenly distributed on the surface with a spacing of 255 meter. The seismic traces are recorded by 70 receivers positioned at each grid with an interval of 15 meter. Each receiver records time-series data for 1 second, and we use a 1 millisecond sample rate to generate 1,000 timesteps. Therefore, the dimensions of seismic data become 5100070.

3.4.2 File Format and naming

Format: All samples in OpenFWI are stored in .npy format. Velocity maps and seismic data are stored in separate files. Each file contains a single NumPy array of 500 samples. The shapes of the arrays in velocity map files and seismic data files are (500, 1, 70, 70) and (500, 5, 1000, 70), respectively.

Naming: The naming of files can be described as {vel|seis}_{n}_1_{i}.npy, where vel and seis specify whether a file stores velocity maps or seismic data, n denotes the number of layers in (corresponding) velocity maps and i is the index of a file (start from 0) among the ones with the same n. Here are several examples:

  • [itemsep=0.2em, topsep=0em, leftmargin=*]

  • vel_2_1_3.npy is the file that contains velocity maps with two layers, and it is the fourth file among all the files with two-layer velocity maps.

  • vel_4_1_0.npy is the file that contains velocity maps with four layers, and it is the first file among all the files with four-layer velocity maps.

  • seis_4_1_0.npy is the file that contains the seismic data corresponding to the velocity maps in vel_4_1_0.npy.

Velocity

Seismic Measurements in Five Channels

Channel 1

Channel 2

Channel 3

Channel 4

Channel 5
Figure 7: Examples of velocity maps and their corresponding seismic measurements in FlatFault (row 1 to 3) dataset and CurvedFault (row 4 to 6).

3.5 Dataset Statistics

We summarize the basic statistics of all datasets included in OpenFWI in table 1 below.

Dataset Size Training Set Testing Set Seismic Data Size Velocity Map Size

 

FlatVEL 247GB 45K 5K
Note Simple situation with flat layers.

 

CurvedVEL 247GB 45K 5K
Note Simple situation with curved layers.

 

Kimberlina-CO2 96GB 15K 4430
Note Simulation of co2 leakage in 991 scenarios over a duration of 200 years.

 

Style-transfer 172GB 65K 2K
Note Velocity maps transfer from real-life images, and contains two resolutions.

 

FlatFault 96GB 24K + 24K 6K
Note Except for 30K labeled data pairs, It contains 24K unlabeled seicmic data.

 

CurvedFault 96GB 24K + 24K 6K
Note Except for 30K labeled data pairs, It contains 24K unlabeled seicmic data.

 

3D Kimberlina-V1 1.4TB 1664 163
Note
Experimental version generated based on limited number of full-size
velocity models. Simulations non-uniformly cover a time range of 200 years.
Table 1: Dataset Summary

4 OpenFWI Benchmarks

In this section, we demonstrate experimental results on each dataset InversionNet [wu2019inversionnet] and InversionNet3D [zeng2021inversionnet3d]. InversionNet is an encoder-decoder based fully-convolutional neural network model, and has shown promising results on data-driven FWI [wu2019inversionnet, feng2021multiscale, zhang2020data, rojas2020physics]. Due to the distinct data size of each dataset, we change the network architecture and parameters slightly, which are summarized in the supplementary materials.

4.1 Experimental Settings

We perform extensive empirical study and provide the best results for each dataset. Two most commonly used loss functions are adopted:

-loss and -loss. Notice that the optimization method and other training details vary between datasets, thus are introduced separately for each dataset. To illustrate a comprehensive evaluation, we introduce three metrics: MAE loss, MSE loss and structral similarity (SSIM) [wang2004image] for results obtained by both loss functions. MAE loss and MSE loss both capture the numerical difference between the predicted and ground truth velocity maps. In our experiments, we normalize the data entries to 0-1 for the convenience during training. The SSIM, however, measures the perceptual similarity between two images.

4.2 Experimental Results

4.2.1 Style-transfer Dataset

We have tested the Style-transfer dataset with the InversionNet. The network is composed of 6 encoder layers and 6 decoder layers. Each encoder layer is composed of 2 convolutional layer with stride equal to 1 and 2, respectively. Each decoder layer is composed of a convolutional layer with stride equal to 1 and a transposed convolutional layer with stride equal to 2. There are 32 channels in the first encoder layer and the number of channels is doubled in the latter encoder layers and then halved in the latter decoder layers. In the last decoder layer, a

activation function is applied to give the final result. We have run 12 epochs with and loss function. The examples of the predicted velocity maps are given in figure 8 and the quantitative results are given in table 2.

Figure 8: The velocity map examples of Style-transfer dataset with and loss function setting. The first row is the predicted velocity maps with loss function. The second row is the predicted velocity maps with loss function. The third row is the true velocity maps.
Dataset Loss Velocity Error
-loss -loss MAE MSE SSIM
Style-transfer 0.1423 0.0496 0.7786
0.1327 0.0401 0.7898
Table 2: Quantitative results of Style-transfer dataset with and loss function settings in terms of MAE, MSE and SSIM.

4.2.2 FlatFault & CurvedFault Dataset

Table 3 shows the quantitative results of InversionNet trained with loss function and loss function respectively. Since the number of receivers and the number of timesteps in seismic data are unbalanced, we modify the encoder part in InversionNet to first extract temporal features until the temporal dimension is close to the spatial dimension and then extract spatial-temporal features. The dimension of the bottleneck latent feature is 512. We also keep the center-crop layer in the decoder part to transform feature maps into desired dimensions.

Dataset Loss Velocity Error
-loss -loss MAE MSE SSIM
FlatFault 0.0111 0.0012 0.9799
0.0106 0.0007 0.9866
CurvedFault 0.0174 0.0029 0.9625
0.0177 0.0021 0.9676
Table 3: Quantitative results of FlatFault and CurvedFault dataset with and loss function settings in terms of MAE, MSE and SSIM.

5 Conclusion

In this paper we present OpenFWI, the first open-source dataset and benchmark platform for data-driven seismic FWI. OpenFWI aims to facilitate research in both geoscience and machine learning community with plenteous, realistic and diverse data resources, along with detailed, comparable and reproducible experimental results. The seven benchmark datasets encompass FWI in 2D and 3D scenarios, cover both supervised and unsupervised learning setting, and are distinct in terms of size and model complexity. OpenFWI is expected to contribute to the advancement of the frontier of the data-driven FWI research.

References