1 Introduction
The subsurface geophysical properties (velocity, density, bulk modulus, etc.) are critical to a myriad of subsurface applications, such as carbon sequestration, earthquake detection and earlyalarming, etc [Virieux2009Overview]. These important properties can be inferred by solving seismic full waveform inversion (FWI), which is illustrated in figure 2. With the target of minimizing the residual between predicted and observed seismic signals, FWI falls into the family of nonconvex optimization problems with PDE constraints. Such problems have been intensively studied in the paradigm of physicsdriven approaches [fichtner2010full, zhang2012wave, ma2012image, zhang2013double, feng2019transmission+, feng2021mpi]. Notorious complications of these approaches include extravagant computational power consumption, cycleskipping, and illposedness issues. In spite of substantial efforts by imposing various regularization terms [lin2014acoustic, lin2015quantifying, hu2009simultaneous, guitton2012blocky, chen2020multiscale] to mitigate the aforementioned problems, the issue of expensive computation is still inevitable in the physicsdriven framework.
With the advance of deep learning techniques, researchers have been actively exploring endtoend solutions for complicated scientific computational imaging problems [Deep2021Adler, willard2020integrating, zhu2019applications, mehta2019high, yang2019deep]. These methods are usually regarded as Datadriven approaches, since they yield a hefty data dependency, just to name a few: quantity, diversity, authenticity, etc. In recent years, promising results have been demonstrated for datadriven FWI. The early attempt by [araya2018deep]
introduced an 8layer fully neural network model to obtain velocity maps from shot gathers. Inspired by the renowned results of using convolutional neural networks in Computer Vision, prolific works have been proposed based on encoderdecoder fully convolutional networks
[yang2019deep, wu2019inversionnet, wang2018velocity, feng2021multiscale]. These models focus on 2D solutions. Very recently, the first solution for 3D imaging has been demonstrated by [zeng2021inversionnet3d], which employs group convolution in the encoder and invertible layers in the decoder for high efficiency and scalability. For all the aforementioned methods, the models are supervised, meaning that both velocity maps and seismic data are required for the training purpose. But in practice, it is unrealistic to obtain such a large volume of labeled data in advance. To alleviate the heavy reliance on labels, [Jin2021Unsupervised] recently developed an unsupervised inversion model incorporating both InversionNet model and the governing wave equation. For a thorough view of the research line of datadriven FWI, please refer to a survey paper [adler2021deep].Undeterred by the delights on this celebrated progress, we have a crucial observation: All experiments are implemented on individual datasets scarcely published! This immediately leads to several defective implications on (1) Unified Evaluation: How do we confirm the empirical superiority of one machine learning solution to another? The experimental protocol has to be identical, which is not possible with data not being published. A unified evaluation also helps our community with a global perspective of what we have achieved and what lies ahead. (2) Further Improvement: What if motivated researchers would like to further improve the proposed algorithm but have no data to test with? The data access denial only strangles valuable inspirations in the cradle. (3) Reproducibility and Integrity: Although it sounds a pathological scenario, people may still preserve the skeptics on the integrity of such a blossom of results, as implementing deep neural networks becomes more trivial. We also remark that the lack of opensourced data is reasonable due to the difficulty of data acquisition. We either collect data from a realworld field study, which requires arduous human labor, or synthesize velocity maps and then generate seismic data by forward modeling. Because of these concerns and conditions, we believe our community has arrived at a complete and sound junction to establish an opensource data and benchmark platform.
We present OpenFWI, the first collection of largescale seismic FWI datasets. OpenFWI is dedicated to facilitating diversified, rigorous, and reproducible research on datadriven FWI. Seven benchmark datasets will be released ^{2}^{2}2More pending approval by Los Alamos National Laboratory and U.S. Department of Energy, will be released chronologically upon approval. in OpenFWI, together with baseline experimental results by InversionNet[wu2019inversionnet], an encoderdecoder based deep learning solution for FWI. The benchmark datasets in OpenFWI demonstrate the following favorable characteristics:

Multiscale: OpenFWI covers multiple scales of data samples. The smallest one has around 20K data samples and can fit into the memory of a single GPU, which supports training without massive computation power. The large datasets contain about 70K data samples, which are usually trained in the distributed setting, further expediting the development of scalable algorithms for deep learningdriven FWI.

Diverse Domains: OpenFWI empowers both 2D and 3D scenarios of FWI. The 3D Kimberlina dataset is the first 3D dataset synthesized very recently. The 2D datasets include velocity models that are representative of realistic subsurface applications. The Kimberlina CO datasets are accompanied by timestamps and can be used for timelapse imaging.

Various levels of model complexity:
Depending on the subsurface modeling, OpenFWI encompasses a wide range of ’model complexity’. We provide an estimation on the difficulty of learning based on the synthesis process and our training experience. As a result, OpenFWI supports researchers to start with simple datasets and refine for the harder ones.
The rest of the paper is organized as follows: Section 2 introduces the background of the governing equation, and the method used for data synthesis. Section 3 provides a detailed description for each dataset with illustrations. Section 4 includes experimental results as baselines for each dataset. Section 5 concludes the paper and discusses future works.
2 Backgrounds
2.1 Acoustic Forward Modelling
The governing equation of the acoustic wave forward modeling is the wave equation,
(1) 
where represents the spatial location in Cartesian coordinates, denotes time, and are the material bulk modulus and density, respectively. represents the pressure wavefield, and is the source term that specifies the location and time history of the source. Then, the compressional (P) velocity can be represented as .
Forward wave propagation modeling entails calculating equation 1 given the boundary conditions and source, as well as subsurface geophysical parameters, including velocity and density . For simplify, We can denote the forward modeling problems expression as
(2) 
where represents the highly nonlinear forward mapping and
is the subsurface geophysical parameter vector.
In the simulation of the following datasets, we use the finitedifference method [moczo2007finite] with 2ndorder accurate in time and 8thorder accurate in space. The absorbing boundary conditions [engquist1977absorbing] are applied to all the boundaries.
3 OpenFWI Datasets
3.1 The FlatVEL & CurVEL Dataset
3.1.1 Data Description
FlatVEL and CurvedVEL are two largescale geophysical dataset, which each consists of 50K pairs of seismic data and velocity maps. FlatVEL are velocity maps with flat layers and CurvedVEL are velocity maps with curved layers. The seismic data and velocity maps are splitted as 45K/5K for training, validation and testing respectively. Samples are shown in figure 3 and figure 4.
The size of the velocity maps in FlatVEL and CurvedVEL are both 100
150 grid points and the grid size is 15 meter in both x and z direction. Each velocity map contains 2 to 5 layers and the thickness of each layer ranges from 5 to 80 grids. The layers in the CurvedVEL follows a sine function. Compared to the maps with flat layers in FlatVEL, curved velocity maps yield much more irregular geological structures, making inversion more challenging. The velocity in each layer is randomly sampled from a uniform distribution between 1.5
and 3.5 . The velocity is designed to increase with depth to be more physically realistic. We also add geological faults to every velocity map. The faults shift from 0 grids to 80 grids, and the tilting angle ranges from 25 to 165.To synthesize the seismic data, three sources are distributed on the surface at 0.975km, 1.125km and 1.275km. The seismic traces are recorded by 150 receivers positioned at each grid with an interval of 15 meter. The source is a Ricker wavelet (Wang, 2015) with a central frequency of 15 Hz. Each receiver records timeseries data for 2 second, and we use a 1 millisecond sample rate to generate 2,000 timesteps. Therefore, the dimensions of seismic data become 3×2000×150.
3.1.2 Data Generation
The generation of the synthetic FlatVEL and CurveVEL is based on the iterative deformation of randomly generated flatten subsurface structures. There are four major steps to generate a new subsurface sample:

Generate random flatten subsurface structures by randomly setting each layer’s width and velocity. The velocity of each layer is sorted in descending order from the bottom to the top.

Deform the flatten image by function by setting the period and the magnitude of the randomly to mimic the extrusion effect.

Divide the image into two parts by generating a line randomly and shifting one of the parts to mimic the fault. Make sure there is no intersection with previous faults in the current iteration.

Discard the part without shifting and go back to step 2 to fill the empty region until the whole image is fulfilled.
3.1.3 File Format and naming
Format: All samples of FlatVEL and CurvedVEL datasets are stored in .npz format. Velocity maps and seismic data are stored as subdirectories in each dataset. Each file contains a single NumPy array of one sample. The shapes of the arrays in velocity map files and seismic data files are (1, 3, 150, 100) and (1, 150, 100) respectively. It’s worth mentioning that .npz is nothing more than a compressed format of .npy, we converted the data only for the sake of space consumption reducement.
Naming: The naming of files follows the format of {data}_{n} for seismic data and {model}_{n} for velocity maps, n denotes the index of a file (starting from 0). Notice that for the same n, data and model becomes a pair. Here are some examples:

data_1.npz is the file that contains seismic data, is the first file among all.

model_99.npz is the file containing the 99th velocity map.
3.2 The KimberlinaCO Dataset
3.2.1 Data Description
The Kimberlina dataset contains 991 CO2 leakage scenarios, each simulated over a duration of 200 years, with 20 leakage velocity maps provided (i.e., at every 10 years) for each scenario [jordan2017characterizing]. Excluding missing values, in total, it has 19430 pairs of seismic data and velocity maps. The data are splitted as 15K/4430 for training and testing. Samples are shown in Figure figure 5.
The size of the velocity maps in KimberlinaCO dataset are 401141 grid points and the grid size is 10 meter in both x and z direction. To synthesize the seismic data, 9 sources are evenly distributed along the top of the model, with depths of 5 . The seismic traces are recorded by 141 receivers positioned at each grid with an interval of 10 meter. The source is a Ricker wavelet with a central frequency of 25 Hz. Each receiver records timeseries data for 2 second, and we use a 0.5 ms sample rate to generate 2,000 timesteps, then the seismic data is downsampled to 1251101 to save the memory and the computation cost for datadriven FWI. Therefore, the dimensions of seismic data become 91251101.
3.2.2 File Format and naming
All samples of KimberlinaCO datasets are stored in .npz format. Velocity maps and seismic data are stored as subdirectories in each dataset. ach file contains a single NumPy array of one sample. The shapes of the arrays in velocity map files and seismic data files are (9, 1251, 101) and (401, 141) respectively. It’s worth mentioning that .npz is nothing more than a compressed format of .npy, we converted the data only for the sake of space consumption reducement.
Naming: The naming of files follows the format of {data}_sim{n}_t{m} for seismic data and {label}_sim{n}_t{m} for velocity maps, n denotes the 4 digital index of a simulation (starting from 0), and m represents the timesteps from 0 to 190 (at every 10 years). Notice that for the same n and m, data and model becomes a pair. Here are some examples:

data_sim0000_t0.npz is the file that contains seismic data, is the first file among all.

label_sim0991_t160.npz is the file containing the velocity map of the 991th scenarios at the 160th year.
3.3 The Styletransfer Dataset
3.3.1 Data Description
Styletransfer dataset is a largescale geophysical dataset built by styletransfer method, which contains seismic data for two types of velocity maps: highresolution maps and lowresolution maps. Each type has 67K pairs of seismic data and velocity maps. The data are splitted as 65K/2K for training and testing. Samples are shown in Figure 6.
The COCO dataset
[lin2014microsoft] is set as the content images and The Marmousi model is set as the style image. A image transfer network is trained to transfer the COCO dataset to velocity perturbations. Then a 1D velocity model is added to the velocity perturbations to construct the velocity maps. Then the velocity maps are smoothed by a Gaussian filter with random standard derivation from 0 to 5 to build the highresolution velocity maps and from 5 to 10 to build the lowresolution velocity maps. This dataset can be applied to the datadriven FWI and traveltime inversion, more details about the velocity building and the applications of this dataset can be found in[feng2020physically] and [feng2021multiscale].The size of the velocity maps in Styletransfer dataset are 200200 grid points and the grid size is 5 meter in both x and z direction. To synthesize the seismic data, 10 sources are evenly distributed on the surface with a spacing of 100 . The seismic traces are recorded by 200 receivers positioned at each grid with an interval of 5 meter. The source is a Ricker wavelet with a central frequency of 15 Hz. Each receiver records timeseries data for 2 second, and we use a 0.0002s sample rate to generate 5,000 timesteps, then the seismic data is downsampled from 5000200 to 200200 to save the memory and the computation cost for datadriven FWI. Therefore, the dimensions of seismic data become 10×200×200.
3.3.2 File Format and naming
All samples of Styletransfer datasets are stored in .npz format. Velocity maps and seismic data are stored as subdirectories in each dataset. and indicate the low resolution velocity maps and their seismic data while and indicate the high resolution velocity maps and their seismic data. It’s worth mentioning that .npz is nothing more than a compressed format of .npy, we converted the data only for the sake of space consumption reducement.
Naming: The naming of files follows the format of {data}_{n} for seismic data and {model}_{n} for velocity maps, n denotes the index of a file (starting from 0). Notice that for the same n, data and model becomes a pair. Here are some examples:

data_1.npz is the file that contains seismic data, is the first file among all.

model_99.npz is the file containing the 99th velocity map.
3.4 The FlatFault & CurvedFault Dataset
3.4.1 Data Description
FlatFault and CurvedFault are two largescale geophysics dataset for FWI, each of which consists of 54K seismic data including 30K with paired velocity maps (labeled) and 24K unlabeled. Velocity maps in FlatFault and CurvedFault contain flat layers and curved layers, respectively. The 30K labeled pairs of seismic data and velocity maps are splitted into training set (24K), validation set (3K) and testing set (3K). Samples are shown in Figure 7.
The aim of CurvedFault dataset is to better validate effectiveness of FWI methods on curved topography, and the shape of those curves follows a sine function. All velocity maps in FlatFault and CurvedFault contain 2 to 4 layers, and the velocity in each layer is uniformly sampled between 3,000 meter/second and 6,000 meter/second. The dimensions of each velocity map are 7070 grid points, and the grid size is 15 meter in both directions. The layer thickness ranges from 15 to 35 grids. The velocity is designed to increase with depth to be more physically realistic. Geological faults are also added to every velocity map. The faults shift from 10 to 20 grids, and the tilting angle ranges from 123 to 123.
Five Ricker [wang2015frequencies] sources with a central frequency of 25 Hz are used for seismic data synthesis, and they are evenly distributed on the surface with a spacing of 255 meter. The seismic traces are recorded by 70 receivers positioned at each grid with an interval of 15 meter. Each receiver records timeseries data for 1 second, and we use a 1 millisecond sample rate to generate 1,000 timesteps. Therefore, the dimensions of seismic data become 5100070.
3.4.2 File Format and naming
Format: All samples in OpenFWI are stored in .npy format. Velocity maps and seismic data are stored in separate files. Each file contains a single NumPy array of 500 samples. The shapes of the arrays in velocity map files and seismic data files are (500, 1, 70, 70) and (500, 5, 1000, 70), respectively.
Naming: The naming of files can be described as {velseis}_{n}_1_{i}.npy, where vel and seis specify whether a file stores velocity maps or seismic data, n denotes the number of layers in (corresponding) velocity maps and i is the index of a file (start from 0) among the ones with the same n. Here are several examples:

[itemsep=0.2em, topsep=0em, leftmargin=*]

vel_2_1_3.npy is the file that contains velocity maps with two layers, and it is the fourth file among all the files with twolayer velocity maps.

vel_4_1_0.npy is the file that contains velocity maps with four layers, and it is the first file among all the files with fourlayer velocity maps.

seis_4_1_0.npy is the file that contains the seismic data corresponding to the velocity maps in vel_4_1_0.npy.
3.5 Dataset Statistics
We summarize the basic statistics of all datasets included in OpenFWI in table 1 below.
Dataset  Size  Training Set  Testing Set  Seismic Data Size  Velocity Map Size  


FlatVEL  247GB  45K  5K  
Note  Simple situation with flat layers.  


CurvedVEL  247GB  45K  5K  
Note  Simple situation with curved layers.  


KimberlinaCO2  96GB  15K  4430  
Note  Simulation of co2 leakage in 991 scenarios over a duration of 200 years.  


Styletransfer  172GB  65K  2K  
Note  Velocity maps transfer from reallife images, and contains two resolutions.  


FlatFault  96GB  24K + 24K  6K  
Note  Except for 30K labeled data pairs, It contains 24K unlabeled seicmic data.  


CurvedFault  96GB  24K + 24K  6K  
Note  Except for 30K labeled data pairs, It contains 24K unlabeled seicmic data.  


3D KimberlinaV1  1.4TB  1664  163  
Note 

4 OpenFWI Benchmarks
In this section, we demonstrate experimental results on each dataset InversionNet [wu2019inversionnet] and InversionNet3D [zeng2021inversionnet3d]. InversionNet is an encoderdecoder based fullyconvolutional neural network model, and has shown promising results on datadriven FWI [wu2019inversionnet, feng2021multiscale, zhang2020data, rojas2020physics]. Due to the distinct data size of each dataset, we change the network architecture and parameters slightly, which are summarized in the supplementary materials.
4.1 Experimental Settings
We perform extensive empirical study and provide the best results for each dataset. Two most commonly used loss functions are adopted:
loss and loss. Notice that the optimization method and other training details vary between datasets, thus are introduced separately for each dataset. To illustrate a comprehensive evaluation, we introduce three metrics: MAE loss, MSE loss and structral similarity (SSIM) [wang2004image] for results obtained by both loss functions. MAE loss and MSE loss both capture the numerical difference between the predicted and ground truth velocity maps. In our experiments, we normalize the data entries to 01 for the convenience during training. The SSIM, however, measures the perceptual similarity between two images.4.2 Experimental Results
4.2.1 Styletransfer Dataset
We have tested the Styletransfer dataset with the InversionNet. The network is composed of 6 encoder layers and 6 decoder layers. Each encoder layer is composed of 2 convolutional layer with stride equal to 1 and 2, respectively. Each decoder layer is composed of a convolutional layer with stride equal to 1 and a transposed convolutional layer with stride equal to 2. There are 32 channels in the first encoder layer and the number of channels is doubled in the latter encoder layers and then halved in the latter decoder layers. In the last decoder layer, a
activation function is applied to give the final result. We have run 12 epochs with and loss function. The examples of the predicted velocity maps are given in figure 8 and the quantitative results are given in table 2.Dataset  Loss  Velocity Error  
loss  loss  MAE  MSE  SSIM  
Styletransfer  ✓  0.1423  0.0496  0.7786  
✓  0.1327  0.0401  0.7898 
4.2.2 FlatFault & CurvedFault Dataset
Table 3 shows the quantitative results of InversionNet trained with loss function and loss function respectively. Since the number of receivers and the number of timesteps in seismic data are unbalanced, we modify the encoder part in InversionNet to first extract temporal features until the temporal dimension is close to the spatial dimension and then extract spatialtemporal features. The dimension of the bottleneck latent feature is 512. We also keep the centercrop layer in the decoder part to transform feature maps into desired dimensions.
Dataset  Loss  Velocity Error  
loss  loss  MAE  MSE  SSIM  
FlatFault  ✓  0.0111  0.0012  0.9799  
✓  0.0106  0.0007  0.9866  
CurvedFault  ✓  0.0174  0.0029  0.9625  
✓  0.0177  0.0021  0.9676 
5 Conclusion
In this paper we present OpenFWI, the first opensource dataset and benchmark platform for datadriven seismic FWI. OpenFWI aims to facilitate research in both geoscience and machine learning community with plenteous, realistic and diverse data resources, along with detailed, comparable and reproducible experimental results. The seven benchmark datasets encompass FWI in 2D and 3D scenarios, cover both supervised and unsupervised learning setting, and are distinct in terms of size and model complexity. OpenFWI is expected to contribute to the advancement of the frontier of the datadriven FWI research.