Log In Sign Up

InversionNet3D: Efficient and Scalable Learning for 3D Full Waveform Inversion

Recent progress in the use of deep learning for Full Waveform Inversion (FWI) has demonstrated the advantage of data-driven methods over traditional physics-based approaches in terms of reconstruction accuracy and computational efficiency. However, due to high computational complexity and large memory consumption, the reconstruction of 3D high-resolution velocity maps via deep networks is still a great challenge. In this paper, we present InversionNet3D, an efficient and scalable encoder-decoder network for 3D FWI. The proposed method employs group convolution in the encoder to establish an effective hierarchy for learning information from multiple sources while cutting down unnecessary parameters and operations at the same time. The introduction of invertible layers further reduces the memory consumption of intermediate features during training and thus enables the development of deeper networks with more layers and higher capacity as required by different application scenarios. Experiments on the 3D Kimberlina dataset demonstrate that InversionNet3D achieves state-of-the-art reconstruction performance with lower computational cost and lower memory footprint compared to the baseline.


page 1

page 2

page 8

page 9

page 10

page 11

page 12


Physics-Consistent Data-driven Waveform Inversion with Adaptive Data Augmentation

Seismic full-waveform inversion (FWI) is a nonlinear computational imagi...

Coupling Deep Learning with Full Waveform Inversion

Full waveform inversion (FWI) aims at reconstructing unknown physical co...

An Intriguing Property of Geophysics Inversion

Inversion techniques are widely used to reconstruct subsurface physical ...

Data-driven Full-waveform Inversion Surrogate using Conditional Generative Adversarial Networks

In the Oil and Gas industry, estimating a subsurface velocity field is a...

Deep Learning for Simultaneous Inference of Hydraulic and Transport Properties

Identifying the heterogeneous conductivity field and reconstructing the ...

A data-driven choice of misfit function for FWI using reinforcement learning

In the workflow of Full-Waveform Inversion (FWI), we often tune the para...

Implicit Full Waveform Inversion with Deep Neural Representation

Full waveform inversion (FWI) commonly stands for the state-of-the-art a...

I Introduction

Fig. 1: Network Architecture of InversionNet3D.

The shape of displayed tensors is adjusted for better perception and thus does not reflect the real size of data. The number above each tensor represents the number of channels. There are 25 available seismic data records per sample in our experiment, each resulted from one seismic source and they are considered as different channels in the input. We extract one of the channels for display. We crop out a sub-volume from seismic data and velocity map, respectively, for clearer visualization of their internal structure.

Full waveform inversion (FWI) is a high-resolution seismic imaging technique that reconstructs the velocity maps by making use of seismic waveform information [tarantola2005inverse]. Given an initial velocity map, FWI aims to find an improved map by minimizing the residual between the predicted and observed seismic data [feng2019transmission+]. Key ingredients of FWI are an efficient forward-modeling engine and a local differential approach to calculate the gradient and the Hessian operators [virieux2009overview]. As 3D surveys become widely implemented, 3D FWI are required to solve complex geological cases [warner2013anisotropic, vigh2014elastic, fu2020multiscale]. However, both the computational cost of forward-modeling and the gradient calculation are much higher in the 3D cases than the 2D cases [zheng20183d, wang20193d]. Thus 3D FWI is still a challenging problem due to the high computational cost.

FWI and many image generation tasks in computer vision such as style transfer

[gatys2015, gatys2016, luan2017, johnson2016]

, image-to-image translation

[isola2017, zhu2017b, zhu2017c], and cross-modal image synthesis [reed2016, zhang2017b, zhang2019a, xu2018] share similar problem background and learning objectives. Due to the advances in image generation based on deep learning, some researchers have made attempts in using deep networks to reconstruct velocity maps from seismic data in an end-to-end manner. Deep models with recurrent network [fabien-ouellet2019a], encoder-decoder architecture [li2020c, wang2018, yang2019, wu2020a]

or generative adversarial networks (GAN)

[zhang2019c] achieve acceptable performance in FWI. However, the limitation of these prior methods is also obvious since they only address 2D FWI problems. Although 2D FWI can be useful in some scenarios, real-world applications often require 3D velocity map reconstruction for a more detailed and more comprehensive analysis of subsurface structures.

Nevertheless, simple inflation of 2D convolution kernels into their 3D counterparts is not adequate for solving the problem. 3D networks typically involve a large number of parameters, which brings about extra obstacles in optimization. Considering that current 3D FWI datasets [aminzadeh1996three, vigh2014elastic, guo2017elastic] contain very limited numbers of samples, large-scale 3D networks are very likely to overfit to the dataset during training. Moreover, 3D FWI leads to much larger memory consumption due to input, output and intermediate features all being volumetric. This further hinders building deeper networks or establishing models with special components. It has previously been demonstrated that deeper networks with larger capacity can serve as better feature extractors [simonyan2015, he2016]. Some complementary methods such as attention mechanism [xu2018], multi-task learning [li2020b], and multi-scale processing [leong2019] have also been proved to help generate image of higher quality. These techniques may also lead to better reconstruction performance in 3D FWI. However, when a basic encoder-decoder network already occupies most of the memory, adding extra layers or components becomes nearly impossible. Therefore, it is essential to have a compact, efficient, and scalable backbone network in order to address complex challenges in 3D FWI.

In this paper, we propose InversionNet3D (abbreviated as InvNet3D), an efficient and scalable network to solve 3D FWI. InvNet3D employs group convolution in its encoder to reduce unnecessary global operations on channel dimension and establishes a hierarchical framework of learning and aggregating channel information. Compared to prior networks with conventional convolution with a global receptive field over channels, InvNet3D better utilizes information from multiple sources with fewer parameters and lower computational complexity. Furthermore, in order to control the memory footprint resulting from intermediate representations, InvNet3D adopts invertible layers in every convolutional block, which transforms the model into a partially reversible network. Input tensors to invertible layers no longer need to be stored during inference since they can be easily restored from the output in the backward pass, which saves memory during training at an affordable cost of training time, and enables the introduction of additional layers and structures to the model according to a specific environment and task requirements.

Our paper is organized as follows. Section II briefly reviews related works. Section III provides detailed description of the baseline model (Section III-A), the Channel-Separated Encoder built upon group convolution (Section III-B), and Invertible Modules (Section III-C). Section IV introduces InversionNet3D, compares it to the baseline models through experiments, and analyzes the improvement in detail. Section V conducts ablation studies on the influence of different strategies of temporal subsampling and source selection on the performance. Concluding remarks are presented in Section VI.

Ii Related Works

Ii-a Deep Learning for Full Waveform Inversion

The success of deep networks in many image generation tasks indicates their great potential in handling the ill-posedness of seismic inversion. Compared to traditional FWI methods, deep models are much more efficient since the inference can be done within seconds after a few hours’ training. Many prior explorations focused on data-driven end-to-end reconstruction, which regarded FWI as a pure image-to-image translation problem. GeoDNN [araya-polo2018] made an early attempt at using a fully connected network with 8 layers for FWI. InversionNet [wu2020a]

then introduced a encoder-decoder network with modern convolutional neural network (CNN) features. ModifiedFCN

[wang2018] employed a similar network design and SeisInvNet [li2020c] enhanced each seismic trace with auxiliary knowledge from neighborhood traces for better spatial correspondence. VelocityGAN [zhang2019c]

achieved improved performance by using generative adversarial networks (GANs). Long Short-Term Memory (LSTM) 


was also used for decoding representations of seismic data into velocity vectors 

[fabien-ouellet2019a]. Some researchers have explored the combination of deep learning and traditional FWI pipelines, which takes the physical background into consideration. NNFWI [zhu2020a] used deep models to generate a physical velocity model, which is then fed to a PDE solver to simulate seismic waveforms. A thorough review on deep learning for FWI can be found in [Adler2021Deep]

. Although these effective neural networks are inspiring to the area and demonstrated the advantage of deep learning in performance and efficiency, they all focused on 2D FWI, which limits their application in real-world problems. To the best of our knowledge, InversionNet3D, proposed in this paper, is the first deep-learning-based solution in addressing 3D FWI.

Ii-B Invertible Networks

Invertible networks are a special kind of neural networks that allow the input to be reconstructed from the output. The invertibility is typically obtained via coupling layers [dinh2015a, dinh2017, kingma2018, gomez2017, jacobsen2018]

, autoregressive models

[papamkarios2017, kingma2016], or numerical inversion [chen2018, behrmann2019, chen2019]. Invertible networks has been demonstrated to provide desirable performance in discriminative tasks [dinh2015a, dinh2017, kingma2018], generative tasks [gomez2017, jacobsen2018], and solving both problems with one model [nalisnick2019]. Moreover, a full-invertible networks built upon a sequence of invertible operations alleviates the need to store intermediate activations in memory for gradient calculation during back-propagation [rumelhart1986]. Partial invertibility could also lead to substantial memory saving [gomez2017, brugger2019] when non-invertible operators, such as downsampling and deconvolution, exist in certain stage of the network.

Layer Output Shape InvNet3D-Encoder conv1_x ,

, stride

, conv2_x , , stride , conv3_x , , stride , conv4_x , , stride , conv5_x , , stride , conv6_x , , stride , conv7 , , stride GAP global average pooling
Layer Output Shape InvNet3D-Decoder conv1_x , , stride , conv2_x , , stride , conv3_x , , stride , conv4_x , , stride , conv5_x , , stride , conv6_x , , stride , conv7 , Crop center crop
TABLE I: InversionNet3D-Simple (InvNet3DS) Architecture. The properties of each convolutional/deconvolutional layer are listed as kernel size (), number of kernels, and stride (). Layers with kernel size colored in blue are convolutional layers while the others with kernel size colored in green are deconvolutional layers. If not specified, the kernel is with a default stride of

. All the layers are followed by a batch normalization

[ioffe] layer and a LeakyReLU activation [maas2013], except layer conv_7 in the decoder, which is followed by a batch normalization layer and a hyperbolic tangent activation.

Ii-C Efficient Deep Learning

Deep models with an enormous number of parameters trained on large-scale dataset achieve superior performance on many tasks. However, restricted computing resources and limited labeled data in real-world applications often impede the deployment of deep models. In order to adapt these over-parameterized, data-hungry, and computationally intensive models to a real-world environment, recent attempts have been made to reduce the dependency of deep networks on computing hardware and large-scale datasets. 3D FWI is a typical case where the input and output data are very large and are very expensive to acquire. Therefore, it is desirable that deep models designed for 3D FWI are light-weight and memory-friendly and can be trained with as little data as possible. The proposed method in this paper mainly focuses on achieving the first two goals. Several approaches have been proposed to reduce the number of parameters and/or the number of operations through group convolution [howard2017, sandler2018, zhang2018b, ma2018a], or neural architecture search [tan2019b, tan2019a], or the replacement of normal operator with hardware-friendly equivalent ones [wu2018a, he2019, lin2019]. The reduction of memory footprint in a memory-friendly model has been obtained via domain-specific network design [liu2019a] or storing fewer activations during training [chen2016a, gomez2017]. Similar to Partially Reversible U-Net [brugger2019] for medical image segmentation, our model based on invertible layers falls within the second category.

Iii Proposed Methods

Iii-a Baseline

In order to demonstrate the effectiveness of the proposed methods, we first establish a baseline model with simple network structures. We inflate all the 2D convolutions in InversionNet [wu2020a] (InvNet) into corresponding 3D counterparts and denote this baseline model as InversionNet3D-Simple (InvNet3DS). The detailed network architecture of InvNet3DS is provided in Table I.

Although this baseline model is designed with reference to InvNet and thus has the same number of layers and similar topological structures, we made extra adjustments on necessary layer parameters due to the large memory consumption of 3D FWI. As the decoder network gradually magnifies the input vector via deconvolution, the intermediate representations also become larger in deeper layers and occupy more GPU memory. In order to alleviate the heavy burden of memory usage, InvNet3DS presents a much faster shrinking of the size of the channel dimension. Convolutional blocks that are close to the output layer have many fewer filters compared to InvNet and thus the number of large volumetric output features becomes smaller. We also place the layers with a larger amplification factor between the size of output and input feature at the deeper stage of the decoder, leading to lower overall memory consumption. We found empirically that our computational environment supports a maximum of 16 filters in conv6_x. However, for the following experiments, the number of filters in this block is set to 4 due to the need to compare with deeper network architectures.

Considering the temporal length of input seismic data is typically much larger than its width and height, the encoder network of InvNet3DS perform more aggressive downsampling on temporal dimension than on spatial dimensions. In addition, unlike InvNet which uses a convolutional layer with a large kernel for final spatio-temporal downsampling, InvNet3DS utilizes global average pooling after conv_7 in the encoder network to generate a 512-dimensional compact representation for seismic data. This enables the network to handle input data of different spatial sizes and temporal lengths without modifying the structure of preceding layers.

Iii-B Channel-Separated Encoder

Conventional convolution aggregates information on channel dimension in a fully-connected manner. Every convolutional filter receives the input of all the channels. However, these dense connections between input and output feature maps on channel dimension lead to high computational cost and larger model size. Consider a convolutional layer with the input of size and filter of size , then the number of parameters and floating-point operations (FLOPs) can be calculated as


By dividing filters of a conventional convolutional layer into group with each group corresponding to channels in the input, a group convolutional layer also divides #Params and FLOPs by since the channel size of every filter becomes . Note that a group convolutional layer with the group size of is equivalent to a conventional convolutional layer. Although group convolution makes the network smaller and more efficient, stacked group convolutional layers with the same group size actually form several sub-networks without interactions in between. This strong regularization and restriction on information exchange may lead to inferior performance. To enforce the interaction between sub-networks, channel shuffle [zhang2018b] is introduced to swap feature maps on channel dimension after one group convolutional layer so that every group in the next group convolutional layer perceives information from different proceeding sub-networks. An illustration of a GConv-CS-GConv block is given in Figure (a)a, where GConv and CS stand for group convolution and channel shuffle, respectively. We also provide an example of the computing workflow of such a group convolutional block in Figure (b)b. It can be observed that, with the group size equal to the channel size of the input, the first group convolutional layer in the block generate representations for each input channel and every channel in the final output feature maps represents certain information generated from channels in the input.

Fig. 2: Illustration of a two-layer group convolution block with channel shuffle. (a) The conversion of a convolution block into a group convolution block with channel shuffle. Three components in the new block share the same group size. (b) Dependencies and reorganization of information on channel dimension. The example describes the computational workflow with a 4-channel input, an 8-channel output and a group size of 4.

By converting all the convolutional blocks in the encoder into group convolutional blocks, we obtain a more light-weighted and more efficient encoder. We denote the result of such conversion as a Channel-Separated Encoder. Channel-Separated Encoder in seismic data processing, offers extra benefits in addition to higher efficiency. Typically, a complete set of seismic data consists of multiple records resulting from sources placed at different locations on the testing area. In prior research [li2020c, wang2018, yang2019, wu2020a, zhang2019c], these records are placed along the channel dimension of the input tensor. However, unlike RGB channels in natural images, which represent different color components, channels in seismic input correspond to different spatial information. Conventional convolutional layers, in the context of FWI, implement a sub-optimal strategy of fusing information from these channels since the semantic-level relationship and correspondence between records could hardly be found at shallow layers in the encoder. Feeding every filter with all the channels of raw input or low-level features may also lead to information loss due to the limited amount of filters at shallow layers. A progressive channel fusion scheme should serve as a better solution, which is what a Channel-Separated Encoder implements. Consider stacking the aforementioned GConv-CS-GConv block illustrated in Figure (a)a. Different components generated from each input channel will be gradually aggregated, leading to a hierarchical learning scheme where lower-level filters are dedicated to extracting useful information from local channels and higher-level filters fuse local representations to form a global description for seismic records of multiple sources. In particular, layer conv_7 in this encoder is with a group size of 512, which is referred to as depthwise convolution [chollet2017].

Iii-C Invertible Module

Fig. 3: Computational workflow for an invertible layer in forward pass (a) and backward pass (b). and in our models are convolutional layers that preserve the shape of input tensor.

Since back-propagation [rumelhart1986] is now the standard method of weight updating in modern neural networks, the inputs to all the layers have to be stored during inference for future gradient calculation in the backward pass. Therefore, in the training stage, most of the memory consumption of a deep network results from a large number of intermediate activations so that minimizing the need to store activations can substantially reduce memory usage. Gradient Checkpointing [chen2016a] involves a strategy of only storing activations at certain layers and recomputing the missing ones through additional partial forward pass when needed. Invertible networks make a step further by enabling the reconstruction of input from the output at all the invertible layers.

Invertible layers can be realized in many ways, as reviewed in Section II. Similar to [dinh2015a], our implementation is based on additive coupling. This kind of invertible layer consists of a pair of operators and entangled with each other in a way that is also used in Lifting Scheme [sweldens1998] for fast in-place calculation of the wavelet transform. In the forward pass, the input will be divided into two parts along channel dimension, denoted as and and their corresponding output and can be calculated via


The complete output can be obtained by simple concatenation of and on channel dimension. In the backward pass, the input can be figured out in the same way based on and as


Figure 3 provides an intuitive illustration of the computation above. Generally speaking, and

can be arbitrary types of operators; in our context, we would expect them to be both convolutional layers followed by an activation function. However, it can be observed from Equations (

3) and (4) that convolutional operators in an invertible layer must have a stride of 1 because otherwise the layer discards information and generates an output of different shape from the input. Considering the network architecture of InvNet3DS, the replacement of conventional layers with invertible layers can only take place at every second layer in each convolutional or deconvolutional block, e.g., conv1_2, as illustrated in Figure 4. This means an entire model with invertible layers is partially reversible since activations at non-invertible layers have to be preserved.

By stacking several invertible layers, we obtain an Invertible Module. Compared to a Non-invertible Module with stacked conventional layers, an Invertible Module of the same scale reduces the internal memory consumption from to since back-propagation could work as long as the activations and their derivatives for the top layer are given and thus all the memory taken by intermediate representation could be freed up. Invertible Modules introduce extra computational overhead during training due to another equivalent forward pass at each layer in gradient calculation. However, since the inference speed of a modern network is typically high, slightly longer training time is generally affordable. Note that Invertible Modules do not lead to slower inference and thus would not affect efficiency in a testing environment.

It is worth noticing that Invertible Modules implicitly introduce group convolution with a group size of 2. Therefore, Invertible Modules would also lead to a lighter-weighted network and potential inference acceleration. Furthermore, Invertible Modules could work with the proposed Channel-Separated Encoder as long as the latter is built with even group size. Suppose a Channel-Separated Encoder is with a group size of , in order to replace its every second convolutional layer with an Invertible Module, we simply need to convert the sub-layers, and , in every invertible layer into group convolutional layers with a group size of .

Fig. 4: Illustration of a convolution/deconvolution block with an invertible module. (a) and (b) illustrate the replacement of the second convolutional layer in the corresponding block with an invertible module, which may consist of multiple invertible layers. The second convolutional layers are typically established with the stride of 1, leading to an unchanged shape between input and output features, which makes the direct replacement possible.

Iv Experiments

Iv-a Experimental Settings

Iv-A1 Dataset

The 3D Kimberlina dataset [wagoner2009] is generated from a hypothetical numerical model built on the geologic structure of a commercial-scale geologic carbon sequestration (GCS) reservoir at the Kimberlina site in the southern San Joaquin Basin, 30 km northwest of Bakersfield, CA, USA. Particularly, the 3D geological model is described in [wagoner2009]. The details of reservoir models are provided in [Wainwright2013Modeling] and [Birkholzer2011Sensitivity]. The geophysical modeling approach used to generate the 3D P-wave velocity maps is further given in [Modeling-2018-Wang]. A portion of the Kimberlina Simulations can be downloaded from the DOE-EDX platform [NETL-2018-Kimberlina].

Our experimental dataset consists of 1,827 pairs of seismic data and 3D velocity maps, which we randomly split into a training subset and a validation subset with the ratio of 1664:163. Seismic data are simulated using the finite-difference method with 1,600 receivers uniformly distributed over the 2D earth surface and each of them captures vibration signals as time-series data of length 5,001 with a time spacing of 0.001s. In order to simplify the evaluation of our models, the seismic data in this dataset only involves acoustic-wave signals. However, our model is designed without any restrictions on input seismic wave types and thus should work with elastic-wave signals as well. There are 25 sources serving as stimulus placed evenly on the 2D spatial grid over the surface, each of which leads to one seismic data record. Therefore, the shape of raw seismic data is

(). The corresponding velocity map as output, reflecting subsurface structures beneath the testing area, is with the size of ().

Iv-A2 Training Configurations

We found that it is not necessary to include all 25 channels in the input due to the redundancy among seismic data records, so our input only contains 8 channels (6, 7, 8, 11, 13, 16, 17, 18). See Figure V-B for an explanation of these serial numbers. We provide detailed discussion on channel selection in Section V-B. Moreover, since the temporal size of the raw seismic data is too large for a normal spatio-temporal encoder and the information on temporal dimension is visually redundant, we uniformly sample 896 frames from the raw sequence. The temporal downsampling strategy is further discussed in Section V-A. Therefore, the default input seismic data has the shape . The ground truth velocity maps are rescaled into via min-max normalization in order to be compatible with the final output layer in the decoder with a activation.

We implement our models with PyTorch

[paszke2019] and train the models on 32 NVIDIA Tesla P100 GPUs with 16GB memory each. Batch Normalization [ioffe] layers are synchronized across processes. We adopt the AdamW [loshchilov2019] optimizer in all of our experiments with the initial learning rate of and a weight decay of

. We use the first 10 epochs for warm-up

[he2016] and divide the learning rate by 10 at epoch 40, 60, and 70. A typical training cycle consists of 80 epochs.

Iv-A3 Evaluation Metrics

Since the output velocity map represents the transmission velocity of the seismic waves at each spatial position, we adopt Mean Absolute Error (MAE) as the main metric to measure the prediction error. We also employ Root Mean Squared Error (RMSE), which implicitly gives a higher weight to large errors, as a complementary metric. Since we found most large errors occur at the location where complex subsurface layers are present, RMSE should better reflect the prediction accuracy in important high-frequency areas. In addition, considering the velocity maps are often displayed as images for visual analysis, we also employ mean Structural Similarity (SSIM) [zhouwang2004] for perceptual similarity evaluation. Note that MSE and RMSE are calculated with denormalized prediction and ground truth, i.e., velocity maps in their original data scale, while SSIM is computed in normalized scale as required by the algorithm.

Iv-B Experimental Results

In order to demonstrate the effectiveness of the proposed model, we first analyze the influence of Channel-Separated Encoder and Invertible Module on the reconstruction performance in Section IV-B1. Then we expand the Invertible Module to construct deeper networks and compare them with the models with stacked conventional convolution of the same scale in Section IV-B2 for a demonstration of the model’s scalability resulted from partial invertibility.

Model Channel-Separated Encoder Invertible Module
TABLE II: Naming convention of networks involved in experiments. A checked box indicates the network contains corresponding component(s).

Iv-B1 Comparisons with the baseline

figureTraining and validation loss of models with . tablePerformance Comparison between the baseline model (InvNet3DS) and the model with a channel-separated encoder (InvNet3DG). Computational complexity measured by floating point operations (FLOPs) is calculated during inference. Model #Blocks #Params GFLOPs MAE RMSE SSIM InvNet3DS 1 35.95M 3062.90 10.20 26.95 0.9820 InvNet3DI 30.97M 2953.02 10.38 27.36 0.9809 InvNet3DG 15.60M 2760.88 9.82 26.00 0.9831 InvNet3D 14.42M 2734.54 9.83 26.11 0.9826
Fig. 5: Histograms of pixel-wise absolute error. (a) Comparison between InvNet3DS and InvNet3DG. (b) Comparison between InvNet3DI and InvNet3D. Statistical data obtained from evaluations on the whole validation subset.
Fig. 6: Visualization of ground truth 3D velocity map (first row), a sampled 2D slice for comparison (second row), corresponding 2D slices extracted from predicted velocity map generated by InvNet3DS, InvNet3DI, InvNet3DG, and InvNet3D (third row to sixth row). The seventh row displays vertical velocity profiles (velocity - depth) at the location of the white cursor on the 2D slices. Rectangles with white dashed borderline highlight the region of interest in comparison. All the models are built with 26 layers. Each column presents one sample.

We introduced two components, Channel-Separated Encoder and Invertible Module, to improve the baseline, InvNet3DS. The proposed InvNet3D is equipped with both components. For a deeper analysis of the impact of these components, our experiments also involve the comparison with models inserted with either of them. We denote the model with only a Channel-Separated Encoder as InvNet3DG and the model with only Invertible Modules as InvNet3DI. Table II provides a summary for the naming convention. In this experiment, for a fair comparison, all the models have the same number of layers as the baseline model InvNet3DS, described in Table I, i.e., 13 layers in encoder and 13-layer in decoder. Therefore, the number of layers in Invertible Modules is set to be for InvNet3DI and InvNet3D.

Table IV-B1 shows the evaluation results. It is clear that InvNet3D outperforms InvNet3DS. The former is also 60% smaller and 10.7% faster than the latter. The results reported in this table can be further analyzed along two axes.

[font=, leftmargin=0pt, style=sameline]

The impact of Channel-Separated Encoder

Comparing InvNet3DS and InvNet3DI with InvNet3DG and InvNet3D, performance measured by all the three metrics is improved due to the use of group convolution. This improvement can probably be attributed to the reduced number of pixels with high prediction error, as displayed in Figure

5. Models with Channel-Separated Encoder obviously have a much smaller maximum pixel-wise error compared to their counterparts with an ordinary encoder, which demonstrates that the former models are better at utilizing multiple seismic data records and generating representations of higher quality. Furthermore, the former ones are also much lighter-weight and computationally efficient due to smaller filters in the group convolution.

The impact of Invertible Module

It can be observed that there are only minor differences between the performance of the models with invertible layers and that of those without invertible layers, which demonstrates that Invertible Modules, though learning representations in a workflow other than conventional convolution, would not heavily influence the capacity of models. Considering invertible layers implicitly introduce group convolution with a group size of 2, we believe the inferior performance of InvNet3DI compared to InvNet3DS results from such a non-ideal way of channel separation. In the comparison between InvNet3DG and InvNet3D, the influence of non-ideal channel separation becomes very insignificant since invertible layers in InvNet3D are enforced to consist of two convolutional layers with a group size of 4 in order to be compatible with the overall framework built upon group convolution with a group size of 8.

The visualizations of the generated velocity maps provided in Figure 6 intuitively illustrate the impact of the proposed components. These displayed samples mainly aim at comparing the models’ ability to reconstruct the detailed subsurface structure. For a better visibility, we enclose the distinct region of interest by rectangles with a white dashed borderline. It is worth mentioning that within this region, there are several low-velocity reflectors, which correspond to the shale layers existed in the actual geologic formation at the Kimberlina site [wagoner2009].

Iv-B2 Experiments with deeper networks

While deeper models can be expected to reconstruct velocity maps of higher quality due to the increased model capacity, there is an associated computational cost because the memory consumption by intermediate representations grows linearly with the number of layers. Invertible networks, whose memory consumption is irrelevant to the volume of activations, can be made very deep if needed. The same is also true for partially reversible networks like InvNet3D. For a demonstration of the benefits of invertibility and deeper networks, we expand InvNet3D by stacking more invertible layers in its Invertible Module. More precisely, we train and test InvNet3D models with the number of layers () in all Invertible Modules of 2, 3, and 4, respectively and compare them with the minimum version with . Accordingly, we also establish expanded InvNet3DG models, which employ Channel-Separated Encoder but are not partially reversible, for a fair comparison. In the following experiments, we will denote these models by ”[model]x[]”.

Fig. 7: Training and validation loss of InversionNet3D with #blocks varying from one to four.
tablePerformance Comparison between InvNet3DG and InvNet3D. Memory column presents maximum memory consumption in training stage. Model #Blocks #Params GFLOPs Memory 111Maximum memory consumption is obtained via torch.cuda.max_memory_allocated(), which provides statistics of memory used by tensors. Actual memory usage would be larger than the reported value due to memory reserved by caching memory allocator and context manager. MAE RMSE SSIM InvNet3DG 1 15.60M 2760.88 7.50GB 9.82 26.00 0.9831 2 17.99M 2824.14 9.77GB 9.45 25.43 0.9841 3 20.38M 2885.42 12.03GB 11.25 32.55 0.9798 4 22.77M 2946.68 14.31GB - - - InvNet3D 1 14.42M 2734.54 9.93GB 9.83 26.11 0.9826 2 15.63M 2767.48 9.93GB 9.52 25.35 0.9838 3 16.85M 2800.40 9.95GB 9.74 26.25 0.9826 4 18.06M 2833.34 9.98GB 10.33 28.01 0.9804

Table 7 shows the evaluation results. Since P100-16GB GPUs do not support the training of InvNet3DGx4 due to memory limitations, the model’s performance is not reported. In contrast, InvNet3Dx4 can be successfully trained on the same hardware. Although the maximum memory consumption InvNet3Dx1/2 is larger than that of InvNet3DGx1/2 since backward pass in the former requires extra memory for temporary results, deeper-version InvNet3DGs consumes much more memory than InvNet3Ds of the same scale. The growth of peak memory usage of InvNet3D is fairly stable as the extra memory is taken by model parameters instead of activations. Both models achieve the best performance with but start to perform worse when becomes larger than 2. We believe it is because the limited scale of our dataset (with only 1664 training samples in total) and the relatively homogeneous geologic structure described by these data make it difficult to train very deep networks. Should there be a dataset with more samples and more complex or diversified subsurface layers, the advantage of deeper networks could be better demonstrated and the tendency of improvement could continue with . That being said, InvNet3Dx3 still outperforms InvNet3DGx3, which should result from the reduced due to the use of Invertible Module.

We also visualize the predicted velocity map in Figure 8. It is clear that the highest-quality prediction is obtained with and deterioration starts at , which follows the same tendency as numerical results.

Fig. 8: Visualization of predicted velocity map samples generated by InvNet3Dx1, InvNet3Dx2, InvNet3Dx3 and InvNet3Dx4 (first row to fourth row). The fifth row displays vertical velocity profiles (velocity - depth) at the location of the white cursor on 2D slices. Rectangles with white dashed borderline highlight the region of interest in comparison. Each column presents one sample. 2D Ground truth velocity maps are provided in the second row of Figure 6.

V Ablation Study

In section IV, we briefly describe the subsampling strategy on the temporal and channel dimensions of input seismic data. We provide more detailed analysis of the influence of different strategies and seek more insight into the proposed model. For simplicity, all the following experiments are conducted with a InvNet3Dx1 model. All the training and testing configurations other than the testing factor are the same with the default option described in Section IV.

V-a Influence of Input Temporal Resolution

In prior research on 2D FWI [wu2020a, zhang2019c], the temporal length () of input data was set to 1000. In this paper, we adopt a similar value, 896 as the default configuration. Although the raw seismic data in 3D Kimberlina dataset have a temporal length of 5,001 samples, which means uniformly extracting 896 slices would lead to more than information on the temporal dimension being discarded, the extracted data sequences are still visibly redundant. In order to understand how sparse the seismic data could be subsampled temporally without affecting the model’s performance, we continue to decrease the temporal subsampling rate and the results are shown in Table III.

It can be observed that the model with performs very similar to that with , which indicates that there is high redundancy on temporal dimension even when only fewer than frames being sampled. However, leads to performance deterioration, though the influence of such highly sparsely sampled input seems to be even slighter than that of the model itself, considering the results shown in Table IV-B1 and 7. Visualization in Figure 9 reflects the same tendency. Due to the temporal downsampling scheme of InvNet3D, feeding the network with the input of is infeasible, but from these results, we could already conclude that the raw seismic data does have high redundancy on temporal dimension and the performance of our model is stable within a wide range of temporal subsampling rate.

896 9.83 26.11 0.9826
448 9.86 25.99 0.9832
224 10.12 26.50 0.9823
TABLE III: The influence of input temporal length () on reconstruction performance. All the experiments are implemented with an InvNet3Dx1 model.
Fig. 9: Visualization of predicted velocity map samples generated by InvNet3Dx1 with the input temporal length of 896, 448, and 224 (first row to third row). Rectangles with white dashed borderline highlight the region of interest in comparison. Each column presents one sample. Ground truth velocity maps are provided in the second row of Figure 6.

V-B Selection of Sources

There are 25 seismic records per sample in the 3D Kimberlina dataset. Figure V-B displays the placement of corresponding seismic sources on a 2D grid. Our default training and testing configurations make use of 8 out of 25 records, namely records 6, 7, 8, 11, 13, 16, 17, and 18, generated by seismic sources in the middle annular region. This strategy is implemented based on the consideration of the capacity and characteristics of Channel-Separated Encoder, for the reasons outlined below.

We first study the impact of the number of selected channels () by training and testing InvNet3Dx1 with 5 randomly generated strategies of channel selection for every in . Note that when , InvNet3Dx1 is equivalent to InvNet3DGx1. The averaged results for each value are reported in Table IV. It is clear that the model performs the best with . It can be easily understood that a smaller number of records contain fewer information of subsurface structure, thereby resulting in inferior performance. However, the model with also performs worse. This result can be explained based on the workflow of Channel-Separated Encoder. Since we always set the group size equal to , the number of filters corresponding to each input channel would be larger when using a smaller , which means more features per record will be extracted at the very first layer and fewer low-level information would be discarded.

The choice of also influence the stage in the network where filters start to possess a global receptive field over all the raw input channels, which will further impact the learning of high-level representations from the input. When becomes excessively large, filters receiving information from all the raw input channels exist in very deeper layers and thus there may not be sufficient subsequent layers for processing global information on channel dimension. Increasing the number of filters may help to improve the performance when or even larger, but it will also enlarge the model capacity as well. Therefore, for a given model capacity, more seismic data records being sampled for input do not necessarily contribute to higher performance.

We further analyze the influence of the detailed strategy of channel selection. For in , we select two typical sampling plans each, which are intuitively described in Figure V-B. Strategies in the same column sample the same number of sources, while those in the first-row sample sources with an inner-ward arrangement compared to those in the second row. Performance evaluation results are listed below the corresponding figure. With regard to the influence of , these detailed experiments lead to conclusions in accordance with Table IV. However, there are some surprising results in terms of the placement of chosen sources. It can be observed that models trained and tested with achieve better performance when sources are selected more inward, but a completely contradictory tendency can be found when . We believe this phenomenon implies that the reconstruction performance at the boundary area is sensitive to the coverage of seismic sources. When is large, critical information for reconstructing this marginal can be well presented. However, when becomes smaller, outward strategies have to be adopted in order to alleviate the deterioration of prediction for this area. This can be also inferred from the higher stability of performance with outer-ward strategies.

16 10.73 28.49 0.9798
8 10.41 27.57 0.9811
4 10.93 29.12 0.9795
1 11.25 29.94 0.9786
TABLE IV: The influence of number of selected sources on reconstruction performance. The reported results are averaged from five experiments with different channel selection strategy of each #channels configuration. All the experiments are implemented with an InvNet3Dx1 model.
figureSpatial Placement of Sources. Each source independently stimulates seismic wave propagating in subsurface layers. The direct wave and the reflected wave are then captured by receivers on the ground and recorded as seismic data of one channel in the input. 10.21 / 27.38 / 0.9809 9.83 / 26.11 / 0.9826 11.35 / 30.18 / 0.9773 11.35 / 30.21 / 0.9777 10.43 / 27.76 / 0.9805 10.01 / 27.15 / 0.9816 10.36 / 27.58 / 0.9813 10.41 / 27.61 / 0.9810 figureInfluence of different source selection strategy on reconstruction performance. The grid with a red dot indicates the seismic data generated by the source placed at the topological location is selected as one of the input channels for training and testing InvNet3D. The numbers below each figure provides the performance resulted from the corresponding strategy, listed in the format of MAE / RMSE / SSIM.

Vi Conclusion

We developed an efficient and scalable encoder-decoder network, InversionNet3D, for 3D full-waveform inversion. This model is partially reversible due to the use of invertible modules, which enables training very deep networks on limited memory. The encoder of this model is built upon group convolution and thus is able to learn representations from seismic data resulted from multiple sources with better hierarchical information flow, a smaller amount of parameters, and lower computational cost. We evaluate the model on the 3D Kimberlina dataset and demonstrate its superior performance over the baseline. We believe our InversionNet3D model not only serves as an effective solution for 3D full-waveform inversion, but also implies more possibilities of deep-learning-based approaches in this area.

Appendix A Comparison with Physics-based Methods

In order to compare the data-driven FWI with the physics-based FWI, we apply 3D physics-based FWI on the Kimberlina data. The velocity map without CO storage is used as the initial model and the propagating wavefields are simulated using finite difference method [carcione2002seismic]. The hybrid parallel implementation using MPI [mpi1993] and OpenMPI [gabriel04] are applied to increase the computational efficiency. However, the snapshot of the forward wavefields must be stored in the local file system due to the limitation of memory; thus the cross-correlation of the forward and backward wavefields heavily relies on the I/O (input/output) system [imbert2011tips]. The iterative updating of the velocity map is achieved by the conjugated gradient method and each iteration costs around 33 minutes. As the comparison with predicted velocity maps generated by InvNet3Dx1, the inverted results after 20 iterations are given in Figure 10. Different from the InversionNet3D, which only require the data from a few sources, the 3D physics-based FWI require dense acquisition configuration to remove the acquisition footprint. However, only 25 sources are aligned on the surface in the Kimberlina data; thus there are clear footprint artifacts near the surface in the inverted result. Moreover, only the center parts of the velocity map can be updated by the physics-based FWI due to the illumination of the ray paths.

Fig. 10: Visualization of ground truth 2D slice sample (first and second rows), corresponding prediction generated by InvNet3Dx1 and physics-based method (second and third row). Each column presents one sample.


This work was co-funded by the Laboratory Directed Research and Development program of LANL under project numbers 20210542MFR and 20200061DR, and by the U.S. DOE Office of Fossil Energy SMART Program.