I Introduction
Full waveform inversion (FWI) is a highresolution seismic imaging technique that reconstructs the velocity maps by making use of seismic waveform information [tarantola2005inverse]. Given an initial velocity map, FWI aims to find an improved map by minimizing the residual between the predicted and observed seismic data [feng2019transmission+]. Key ingredients of FWI are an efficient forwardmodeling engine and a local differential approach to calculate the gradient and the Hessian operators [virieux2009overview]. As 3D surveys become widely implemented, 3D FWI are required to solve complex geological cases [warner2013anisotropic, vigh2014elastic, fu2020multiscale]. However, both the computational cost of forwardmodeling and the gradient calculation are much higher in the 3D cases than the 2D cases [zheng20183d, wang20193d]. Thus 3D FWI is still a challenging problem due to the high computational cost.
FWI and many image generation tasks in computer vision such as style transfer
[gatys2015, gatys2016, luan2017, johnson2016][isola2017, zhu2017b, zhu2017c], and crossmodal image synthesis [reed2016, zhang2017b, zhang2019a, xu2018] share similar problem background and learning objectives. Due to the advances in image generation based on deep learning, some researchers have made attempts in using deep networks to reconstruct velocity maps from seismic data in an endtoend manner. Deep models with recurrent network [fabienouellet2019a], encoderdecoder architecture [li2020c, wang2018, yang2019, wu2020a]or generative adversarial networks (GAN)
[zhang2019c] achieve acceptable performance in FWI. However, the limitation of these prior methods is also obvious since they only address 2D FWI problems. Although 2D FWI can be useful in some scenarios, realworld applications often require 3D velocity map reconstruction for a more detailed and more comprehensive analysis of subsurface structures.Nevertheless, simple inflation of 2D convolution kernels into their 3D counterparts is not adequate for solving the problem. 3D networks typically involve a large number of parameters, which brings about extra obstacles in optimization. Considering that current 3D FWI datasets [aminzadeh1996three, vigh2014elastic, guo2017elastic] contain very limited numbers of samples, largescale 3D networks are very likely to overfit to the dataset during training. Moreover, 3D FWI leads to much larger memory consumption due to input, output and intermediate features all being volumetric. This further hinders building deeper networks or establishing models with special components. It has previously been demonstrated that deeper networks with larger capacity can serve as better feature extractors [simonyan2015, he2016]. Some complementary methods such as attention mechanism [xu2018], multitask learning [li2020b], and multiscale processing [leong2019] have also been proved to help generate image of higher quality. These techniques may also lead to better reconstruction performance in 3D FWI. However, when a basic encoderdecoder network already occupies most of the memory, adding extra layers or components becomes nearly impossible. Therefore, it is essential to have a compact, efficient, and scalable backbone network in order to address complex challenges in 3D FWI.
In this paper, we propose InversionNet3D (abbreviated as InvNet3D), an efficient and scalable network to solve 3D FWI. InvNet3D employs group convolution in its encoder to reduce unnecessary global operations on channel dimension and establishes a hierarchical framework of learning and aggregating channel information. Compared to prior networks with conventional convolution with a global receptive field over channels, InvNet3D better utilizes information from multiple sources with fewer parameters and lower computational complexity. Furthermore, in order to control the memory footprint resulting from intermediate representations, InvNet3D adopts invertible layers in every convolutional block, which transforms the model into a partially reversible network. Input tensors to invertible layers no longer need to be stored during inference since they can be easily restored from the output in the backward pass, which saves memory during training at an affordable cost of training time, and enables the introduction of additional layers and structures to the model according to a specific environment and task requirements.
Our paper is organized as follows. Section II briefly reviews related works. Section III provides detailed description of the baseline model (Section IIIA), the ChannelSeparated Encoder built upon group convolution (Section IIIB), and Invertible Modules (Section IIIC). Section IV introduces InversionNet3D, compares it to the baseline models through experiments, and analyzes the improvement in detail. Section V conducts ablation studies on the influence of different strategies of temporal subsampling and source selection on the performance. Concluding remarks are presented in Section VI.
Ii Related Works
Iia Deep Learning for Full Waveform Inversion
The success of deep networks in many image generation tasks indicates their great potential in handling the illposedness of seismic inversion. Compared to traditional FWI methods, deep models are much more efficient since the inference can be done within seconds after a few hours’ training. Many prior explorations focused on datadriven endtoend reconstruction, which regarded FWI as a pure imagetoimage translation problem. GeoDNN [arayapolo2018] made an early attempt at using a fully connected network with 8 layers for FWI. InversionNet [wu2020a]
then introduced a encoderdecoder network with modern convolutional neural network (CNN) features. ModifiedFCN
[wang2018] employed a similar network design and SeisInvNet [li2020c] enhanced each seismic trace with auxiliary knowledge from neighborhood traces for better spatial correspondence. VelocityGAN [zhang2019c]achieved improved performance by using generative adversarial networks (GANs). Long ShortTerm Memory (LSTM)
[hochreiter1997]was also used for decoding representations of seismic data into velocity vectors
[fabienouellet2019a]. Some researchers have explored the combination of deep learning and traditional FWI pipelines, which takes the physical background into consideration. NNFWI [zhu2020a] used deep models to generate a physical velocity model, which is then fed to a PDE solver to simulate seismic waveforms. A thorough review on deep learning for FWI can be found in [Adler2021Deep]. Although these effective neural networks are inspiring to the area and demonstrated the advantage of deep learning in performance and efficiency, they all focused on 2D FWI, which limits their application in realworld problems. To the best of our knowledge, InversionNet3D, proposed in this paper, is the first deeplearningbased solution in addressing 3D FWI.
IiB Invertible Networks
Invertible networks are a special kind of neural networks that allow the input to be reconstructed from the output. The invertibility is typically obtained via coupling layers [dinh2015a, dinh2017, kingma2018, gomez2017, jacobsen2018]
[papamkarios2017, kingma2016], or numerical inversion [chen2018, behrmann2019, chen2019]. Invertible networks has been demonstrated to provide desirable performance in discriminative tasks [dinh2015a, dinh2017, kingma2018], generative tasks [gomez2017, jacobsen2018], and solving both problems with one model [nalisnick2019]. Moreover, a fullinvertible networks built upon a sequence of invertible operations alleviates the need to store intermediate activations in memory for gradient calculation during backpropagation [rumelhart1986]. Partial invertibility could also lead to substantial memory saving [gomez2017, brugger2019] when noninvertible operators, such as downsampling and deconvolution, exist in certain stage of the network.IiC Efficient Deep Learning
Deep models with an enormous number of parameters trained on largescale dataset achieve superior performance on many tasks. However, restricted computing resources and limited labeled data in realworld applications often impede the deployment of deep models. In order to adapt these overparameterized, datahungry, and computationally intensive models to a realworld environment, recent attempts have been made to reduce the dependency of deep networks on computing hardware and largescale datasets. 3D FWI is a typical case where the input and output data are very large and are very expensive to acquire. Therefore, it is desirable that deep models designed for 3D FWI are lightweight and memoryfriendly and can be trained with as little data as possible. The proposed method in this paper mainly focuses on achieving the first two goals. Several approaches have been proposed to reduce the number of parameters and/or the number of operations through group convolution [howard2017, sandler2018, zhang2018b, ma2018a], or neural architecture search [tan2019b, tan2019a], or the replacement of normal operator with hardwarefriendly equivalent ones [wu2018a, he2019, lin2019]. The reduction of memory footprint in a memoryfriendly model has been obtained via domainspecific network design [liu2019a] or storing fewer activations during training [chen2016a, gomez2017]. Similar to Partially Reversible UNet [brugger2019] for medical image segmentation, our model based on invertible layers falls within the second category.
Iii Proposed Methods
Iiia Baseline
In order to demonstrate the effectiveness of the proposed methods, we first establish a baseline model with simple network structures. We inflate all the 2D convolutions in InversionNet [wu2020a] (InvNet) into corresponding 3D counterparts and denote this baseline model as InversionNet3DSimple (InvNet3DS). The detailed network architecture of InvNet3DS is provided in Table I.
Although this baseline model is designed with reference to InvNet and thus has the same number of layers and similar topological structures, we made extra adjustments on necessary layer parameters due to the large memory consumption of 3D FWI. As the decoder network gradually magnifies the input vector via deconvolution, the intermediate representations also become larger in deeper layers and occupy more GPU memory. In order to alleviate the heavy burden of memory usage, InvNet3DS presents a much faster shrinking of the size of the channel dimension. Convolutional blocks that are close to the output layer have many fewer filters compared to InvNet and thus the number of large volumetric output features becomes smaller. We also place the layers with a larger amplification factor between the size of output and input feature at the deeper stage of the decoder, leading to lower overall memory consumption. We found empirically that our computational environment supports a maximum of 16 filters in conv6_x. However, for the following experiments, the number of filters in this block is set to 4 due to the need to compare with deeper network architectures.
Considering the temporal length of input seismic data is typically much larger than its width and height, the encoder network of InvNet3DS perform more aggressive downsampling on temporal dimension than on spatial dimensions. In addition, unlike InvNet which uses a convolutional layer with a large kernel for final spatiotemporal downsampling, InvNet3DS utilizes global average pooling after conv_7 in the encoder network to generate a 512dimensional compact representation for seismic data. This enables the network to handle input data of different spatial sizes and temporal lengths without modifying the structure of preceding layers.
IiiB ChannelSeparated Encoder
Conventional convolution aggregates information on channel dimension in a fullyconnected manner. Every convolutional filter receives the input of all the channels. However, these dense connections between input and output feature maps on channel dimension lead to high computational cost and larger model size. Consider a convolutional layer with the input of size and filter of size , then the number of parameters and floatingpoint operations (FLOPs) can be calculated as
(1)  
(2) 
By dividing filters of a conventional convolutional layer into group with each group corresponding to channels in the input, a group convolutional layer also divides #Params and FLOPs by since the channel size of every filter becomes . Note that a group convolutional layer with the group size of is equivalent to a conventional convolutional layer. Although group convolution makes the network smaller and more efficient, stacked group convolutional layers with the same group size actually form several subnetworks without interactions in between. This strong regularization and restriction on information exchange may lead to inferior performance. To enforce the interaction between subnetworks, channel shuffle [zhang2018b] is introduced to swap feature maps on channel dimension after one group convolutional layer so that every group in the next group convolutional layer perceives information from different proceeding subnetworks. An illustration of a GConvCSGConv block is given in Figure (a)a, where GConv and CS stand for group convolution and channel shuffle, respectively. We also provide an example of the computing workflow of such a group convolutional block in Figure (b)b. It can be observed that, with the group size equal to the channel size of the input, the first group convolutional layer in the block generate representations for each input channel and every channel in the final output feature maps represents certain information generated from channels in the input.
By converting all the convolutional blocks in the encoder into group convolutional blocks, we obtain a more lightweighted and more efficient encoder. We denote the result of such conversion as a ChannelSeparated Encoder. ChannelSeparated Encoder in seismic data processing, offers extra benefits in addition to higher efficiency. Typically, a complete set of seismic data consists of multiple records resulting from sources placed at different locations on the testing area. In prior research [li2020c, wang2018, yang2019, wu2020a, zhang2019c], these records are placed along the channel dimension of the input tensor. However, unlike RGB channels in natural images, which represent different color components, channels in seismic input correspond to different spatial information. Conventional convolutional layers, in the context of FWI, implement a suboptimal strategy of fusing information from these channels since the semanticlevel relationship and correspondence between records could hardly be found at shallow layers in the encoder. Feeding every filter with all the channels of raw input or lowlevel features may also lead to information loss due to the limited amount of filters at shallow layers. A progressive channel fusion scheme should serve as a better solution, which is what a ChannelSeparated Encoder implements. Consider stacking the aforementioned GConvCSGConv block illustrated in Figure (a)a. Different components generated from each input channel will be gradually aggregated, leading to a hierarchical learning scheme where lowerlevel filters are dedicated to extracting useful information from local channels and higherlevel filters fuse local representations to form a global description for seismic records of multiple sources. In particular, layer conv_7 in this encoder is with a group size of 512, which is referred to as depthwise convolution [chollet2017].
IiiC Invertible Module
Since backpropagation [rumelhart1986] is now the standard method of weight updating in modern neural networks, the inputs to all the layers have to be stored during inference for future gradient calculation in the backward pass. Therefore, in the training stage, most of the memory consumption of a deep network results from a large number of intermediate activations so that minimizing the need to store activations can substantially reduce memory usage. Gradient Checkpointing [chen2016a] involves a strategy of only storing activations at certain layers and recomputing the missing ones through additional partial forward pass when needed. Invertible networks make a step further by enabling the reconstruction of input from the output at all the invertible layers.
Invertible layers can be realized in many ways, as reviewed in Section II. Similar to [dinh2015a], our implementation is based on additive coupling. This kind of invertible layer consists of a pair of operators and entangled with each other in a way that is also used in Lifting Scheme [sweldens1998] for fast inplace calculation of the wavelet transform. In the forward pass, the input will be divided into two parts along channel dimension, denoted as and and their corresponding output and can be calculated via
(3)  
The complete output can be obtained by simple concatenation of and on channel dimension. In the backward pass, the input can be figured out in the same way based on and as
(4)  
Figure 3 provides an intuitive illustration of the computation above. Generally speaking, and
can be arbitrary types of operators; in our context, we would expect them to be both convolutional layers followed by an activation function. However, it can be observed from Equations (
3) and (4) that convolutional operators in an invertible layer must have a stride of 1 because otherwise the layer discards information and generates an output of different shape from the input. Considering the network architecture of InvNet3DS, the replacement of conventional layers with invertible layers can only take place at every second layer in each convolutional or deconvolutional block, e.g., conv1_2, as illustrated in Figure 4. This means an entire model with invertible layers is partially reversible since activations at noninvertible layers have to be preserved.By stacking several invertible layers, we obtain an Invertible Module. Compared to a Noninvertible Module with stacked conventional layers, an Invertible Module of the same scale reduces the internal memory consumption from to since backpropagation could work as long as the activations and their derivatives for the top layer are given and thus all the memory taken by intermediate representation could be freed up. Invertible Modules introduce extra computational overhead during training due to another equivalent forward pass at each layer in gradient calculation. However, since the inference speed of a modern network is typically high, slightly longer training time is generally affordable. Note that Invertible Modules do not lead to slower inference and thus would not affect efficiency in a testing environment.
It is worth noticing that Invertible Modules implicitly introduce group convolution with a group size of 2. Therefore, Invertible Modules would also lead to a lighterweighted network and potential inference acceleration. Furthermore, Invertible Modules could work with the proposed ChannelSeparated Encoder as long as the latter is built with even group size. Suppose a ChannelSeparated Encoder is with a group size of , in order to replace its every second convolutional layer with an Invertible Module, we simply need to convert the sublayers, and , in every invertible layer into group convolutional layers with a group size of .
Iv Experiments
Iva Experimental Settings
IvA1 Dataset
The 3D Kimberlina dataset [wagoner2009] is generated from a hypothetical numerical model built on the geologic structure of a commercialscale geologic carbon sequestration (GCS) reservoir at the Kimberlina site in the southern San Joaquin Basin, 30 km northwest of Bakersfield, CA, USA. Particularly, the 3D geological model is described in [wagoner2009]. The details of reservoir models are provided in [Wainwright2013Modeling] and [Birkholzer2011Sensitivity]. The geophysical modeling approach used to generate the 3D Pwave velocity maps is further given in [Modeling2018Wang]. A portion of the Kimberlina Simulations can be downloaded from the DOEEDX platform [NETL2018Kimberlina].
Our experimental dataset consists of 1,827 pairs of seismic data and 3D velocity maps, which we randomly split into a training subset and a validation subset with the ratio of 1664:163. Seismic data are simulated using the finitedifference method with 1,600 receivers uniformly distributed over the 2D earth surface and each of them captures vibration signals as timeseries data of length 5,001 with a time spacing of 0.001s. In order to simplify the evaluation of our models, the seismic data in this dataset only involves acousticwave signals. However, our model is designed without any restrictions on input seismic wave types and thus should work with elasticwave signals as well. There are 25 sources serving as stimulus placed evenly on the 2D spatial grid over the surface, each of which leads to one seismic data record. Therefore, the shape of raw seismic data is
(). The corresponding velocity map as output, reflecting subsurface structures beneath the testing area, is with the size of ().IvA2 Training Configurations
We found that it is not necessary to include all 25 channels in the input due to the redundancy among seismic data records, so our input only contains 8 channels (6, 7, 8, 11, 13, 16, 17, 18). See Figure VB for an explanation of these serial numbers. We provide detailed discussion on channel selection in Section VB. Moreover, since the temporal size of the raw seismic data is too large for a normal spatiotemporal encoder and the information on temporal dimension is visually redundant, we uniformly sample 896 frames from the raw sequence. The temporal downsampling strategy is further discussed in Section VA. Therefore, the default input seismic data has the shape . The ground truth velocity maps are rescaled into via minmax normalization in order to be compatible with the final output layer in the decoder with a activation.
We implement our models with PyTorch
[paszke2019] and train the models on 32 NVIDIA Tesla P100 GPUs with 16GB memory each. Batch Normalization [ioffe] layers are synchronized across processes. We adopt the AdamW [loshchilov2019] optimizer in all of our experiments with the initial learning rate of and a weight decay of. We use the first 10 epochs for warmup
[he2016] and divide the learning rate by 10 at epoch 40, 60, and 70. A typical training cycle consists of 80 epochs.IvA3 Evaluation Metrics
Since the output velocity map represents the transmission velocity of the seismic waves at each spatial position, we adopt Mean Absolute Error (MAE) as the main metric to measure the prediction error. We also employ Root Mean Squared Error (RMSE), which implicitly gives a higher weight to large errors, as a complementary metric. Since we found most large errors occur at the location where complex subsurface layers are present, RMSE should better reflect the prediction accuracy in important highfrequency areas. In addition, considering the velocity maps are often displayed as images for visual analysis, we also employ mean Structural Similarity (SSIM) [zhouwang2004] for perceptual similarity evaluation. Note that MSE and RMSE are calculated with denormalized prediction and ground truth, i.e., velocity maps in their original data scale, while SSIM is computed in normalized scale as required by the algorithm.
IvB Experimental Results
In order to demonstrate the effectiveness of the proposed model, we first analyze the influence of ChannelSeparated Encoder and Invertible Module on the reconstruction performance in Section IVB1. Then we expand the Invertible Module to construct deeper networks and compare them with the models with stacked conventional convolution of the same scale in Section IVB2 for a demonstration of the model’s scalability resulted from partial invertibility.
Model  ChannelSeparated Encoder  Invertible Module 

InvNet3DS  
InvNet3DI  
InvNet3DG  
InvNet3D 
IvB1 Comparisons with the baseline
We introduced two components, ChannelSeparated Encoder and Invertible Module, to improve the baseline, InvNet3DS. The proposed InvNet3D is equipped with both components. For a deeper analysis of the impact of these components, our experiments also involve the comparison with models inserted with either of them. We denote the model with only a ChannelSeparated Encoder as InvNet3DG and the model with only Invertible Modules as InvNet3DI. Table II provides a summary for the naming convention. In this experiment, for a fair comparison, all the models have the same number of layers as the baseline model InvNet3DS, described in Table I, i.e., 13 layers in encoder and 13layer in decoder. Therefore, the number of layers in Invertible Modules is set to be for InvNet3DI and InvNet3D.
Table IVB1 shows the evaluation results. It is clear that InvNet3D outperforms InvNet3DS. The former is also 60% smaller and 10.7% faster than the latter. The results reported in this table can be further analyzed along two axes.

[font=, leftmargin=0pt, style=sameline]
 The impact of ChannelSeparated Encoder

Comparing InvNet3DS and InvNet3DI with InvNet3DG and InvNet3D, performance measured by all the three metrics is improved due to the use of group convolution. This improvement can probably be attributed to the reduced number of pixels with high prediction error, as displayed in Figure
5. Models with ChannelSeparated Encoder obviously have a much smaller maximum pixelwise error compared to their counterparts with an ordinary encoder, which demonstrates that the former models are better at utilizing multiple seismic data records and generating representations of higher quality. Furthermore, the former ones are also much lighterweight and computationally efficient due to smaller filters in the group convolution.  The impact of Invertible Module

It can be observed that there are only minor differences between the performance of the models with invertible layers and that of those without invertible layers, which demonstrates that Invertible Modules, though learning representations in a workflow other than conventional convolution, would not heavily influence the capacity of models. Considering invertible layers implicitly introduce group convolution with a group size of 2, we believe the inferior performance of InvNet3DI compared to InvNet3DS results from such a nonideal way of channel separation. In the comparison between InvNet3DG and InvNet3D, the influence of nonideal channel separation becomes very insignificant since invertible layers in InvNet3D are enforced to consist of two convolutional layers with a group size of 4 in order to be compatible with the overall framework built upon group convolution with a group size of 8.
The visualizations of the generated velocity maps provided in Figure 6 intuitively illustrate the impact of the proposed components. These displayed samples mainly aim at comparing the models’ ability to reconstruct the detailed subsurface structure. For a better visibility, we enclose the distinct region of interest by rectangles with a white dashed borderline. It is worth mentioning that within this region, there are several lowvelocity reflectors, which correspond to the shale layers existed in the actual geologic formation at the Kimberlina site [wagoner2009].
IvB2 Experiments with deeper networks
While deeper models can be expected to reconstruct velocity maps of higher quality due to the increased model capacity, there is an associated computational cost because the memory consumption by intermediate representations grows linearly with the number of layers. Invertible networks, whose memory consumption is irrelevant to the volume of activations, can be made very deep if needed. The same is also true for partially reversible networks like InvNet3D. For a demonstration of the benefits of invertibility and deeper networks, we expand InvNet3D by stacking more invertible layers in its Invertible Module. More precisely, we train and test InvNet3D models with the number of layers () in all Invertible Modules of 2, 3, and 4, respectively and compare them with the minimum version with . Accordingly, we also establish expanded InvNet3DG models, which employ ChannelSeparated Encoder but are not partially reversible, for a fair comparison. In the following experiments, we will denote these models by ”[model]x[]”.
Table 7 shows the evaluation results. Since P10016GB GPUs do not support the training of InvNet3DGx4 due to memory limitations, the model’s performance is not reported. In contrast, InvNet3Dx4 can be successfully trained on the same hardware. Although the maximum memory consumption InvNet3Dx1/2 is larger than that of InvNet3DGx1/2 since backward pass in the former requires extra memory for temporary results, deeperversion InvNet3DGs consumes much more memory than InvNet3Ds of the same scale. The growth of peak memory usage of InvNet3D is fairly stable as the extra memory is taken by model parameters instead of activations. Both models achieve the best performance with but start to perform worse when becomes larger than 2. We believe it is because the limited scale of our dataset (with only 1664 training samples in total) and the relatively homogeneous geologic structure described by these data make it difficult to train very deep networks. Should there be a dataset with more samples and more complex or diversified subsurface layers, the advantage of deeper networks could be better demonstrated and the tendency of improvement could continue with . That being said, InvNet3Dx3 still outperforms InvNet3DGx3, which should result from the reduced due to the use of Invertible Module.
We also visualize the predicted velocity map in Figure 8. It is clear that the highestquality prediction is obtained with and deterioration starts at , which follows the same tendency as numerical results.
V Ablation Study
In section IV, we briefly describe the subsampling strategy on the temporal and channel dimensions of input seismic data. We provide more detailed analysis of the influence of different strategies and seek more insight into the proposed model. For simplicity, all the following experiments are conducted with a InvNet3Dx1 model. All the training and testing configurations other than the testing factor are the same with the default option described in Section IV.
Va Influence of Input Temporal Resolution
In prior research on 2D FWI [wu2020a, zhang2019c], the temporal length () of input data was set to 1000. In this paper, we adopt a similar value, 896 as the default configuration. Although the raw seismic data in 3D Kimberlina dataset have a temporal length of 5,001 samples, which means uniformly extracting 896 slices would lead to more than information on the temporal dimension being discarded, the extracted data sequences are still visibly redundant. In order to understand how sparse the seismic data could be subsampled temporally without affecting the model’s performance, we continue to decrease the temporal subsampling rate and the results are shown in Table III.
It can be observed that the model with performs very similar to that with , which indicates that there is high redundancy on temporal dimension even when only fewer than frames being sampled. However, leads to performance deterioration, though the influence of such highly sparsely sampled input seems to be even slighter than that of the model itself, considering the results shown in Table IVB1 and 7. Visualization in Figure 9 reflects the same tendency. Due to the temporal downsampling scheme of InvNet3D, feeding the network with the input of is infeasible, but from these results, we could already conclude that the raw seismic data does have high redundancy on temporal dimension and the performance of our model is stable within a wide range of temporal subsampling rate.
MAE  RMSE  SSIM  

896  9.83  26.11  0.9826 
448  9.86  25.99  0.9832 
224  10.12  26.50  0.9823 
VB Selection of Sources
There are 25 seismic records per sample in the 3D Kimberlina dataset. Figure VB displays the placement of corresponding seismic sources on a 2D grid. Our default training and testing configurations make use of 8 out of 25 records, namely records 6, 7, 8, 11, 13, 16, 17, and 18, generated by seismic sources in the middle annular region. This strategy is implemented based on the consideration of the capacity and characteristics of ChannelSeparated Encoder, for the reasons outlined below.
We first study the impact of the number of selected channels () by training and testing InvNet3Dx1 with 5 randomly generated strategies of channel selection for every in . Note that when , InvNet3Dx1 is equivalent to InvNet3DGx1. The averaged results for each value are reported in Table IV. It is clear that the model performs the best with . It can be easily understood that a smaller number of records contain fewer information of subsurface structure, thereby resulting in inferior performance. However, the model with also performs worse. This result can be explained based on the workflow of ChannelSeparated Encoder. Since we always set the group size equal to , the number of filters corresponding to each input channel would be larger when using a smaller , which means more features per record will be extracted at the very first layer and fewer lowlevel information would be discarded.
The choice of also influence the stage in the network where filters start to possess a global receptive field over all the raw input channels, which will further impact the learning of highlevel representations from the input. When becomes excessively large, filters receiving information from all the raw input channels exist in very deeper layers and thus there may not be sufficient subsequent layers for processing global information on channel dimension. Increasing the number of filters may help to improve the performance when or even larger, but it will also enlarge the model capacity as well. Therefore, for a given model capacity, more seismic data records being sampled for input do not necessarily contribute to higher performance.
We further analyze the influence of the detailed strategy of channel selection. For in , we select two typical sampling plans each, which are intuitively described in Figure VB. Strategies in the same column sample the same number of sources, while those in the firstrow sample sources with an innerward arrangement compared to those in the second row. Performance evaluation results are listed below the corresponding figure. With regard to the influence of , these detailed experiments lead to conclusions in accordance with Table IV. However, there are some surprising results in terms of the placement of chosen sources. It can be observed that models trained and tested with achieve better performance when sources are selected more inward, but a completely contradictory tendency can be found when . We believe this phenomenon implies that the reconstruction performance at the boundary area is sensitive to the coverage of seismic sources. When is large, critical information for reconstructing this marginal can be well presented. However, when becomes smaller, outward strategies have to be adopted in order to alleviate the deterioration of prediction for this area. This can be also inferred from the higher stability of performance with outerward strategies.
#Channels  MAE  RMSE  SSIM 

16  10.73  28.49  0.9798 
8  10.41  27.57  0.9811 
4  10.93  29.12  0.9795 
1  11.25  29.94  0.9786 
Vi Conclusion
We developed an efficient and scalable encoderdecoder network, InversionNet3D, for 3D fullwaveform inversion. This model is partially reversible due to the use of invertible modules, which enables training very deep networks on limited memory. The encoder of this model is built upon group convolution and thus is able to learn representations from seismic data resulted from multiple sources with better hierarchical information flow, a smaller amount of parameters, and lower computational cost. We evaluate the model on the 3D Kimberlina dataset and demonstrate its superior performance over the baseline. We believe our InversionNet3D model not only serves as an effective solution for 3D fullwaveform inversion, but also implies more possibilities of deeplearningbased approaches in this area.
Appendix A Comparison with Physicsbased Methods
In order to compare the datadriven FWI with the physicsbased FWI, we apply 3D physicsbased FWI on the Kimberlina data. The velocity map without CO storage is used as the initial model and the propagating wavefields are simulated using finite difference method [carcione2002seismic]. The hybrid parallel implementation using MPI [mpi1993] and OpenMPI [gabriel04] are applied to increase the computational efficiency. However, the snapshot of the forward wavefields must be stored in the local file system due to the limitation of memory; thus the crosscorrelation of the forward and backward wavefields heavily relies on the I/O (input/output) system [imbert2011tips]. The iterative updating of the velocity map is achieved by the conjugated gradient method and each iteration costs around 33 minutes. As the comparison with predicted velocity maps generated by InvNet3Dx1, the inverted results after 20 iterations are given in Figure 10. Different from the InversionNet3D, which only require the data from a few sources, the 3D physicsbased FWI require dense acquisition configuration to remove the acquisition footprint. However, only 25 sources are aligned on the surface in the Kimberlina data; thus there are clear footprint artifacts near the surface in the inverted result. Moreover, only the center parts of the velocity map can be updated by the physicsbased FWI due to the illumination of the ray paths.
Acknowledgment
This work was cofunded by the Laboratory Directed Research and Development program of LANL under project numbers 20210542MFR and 20200061DR, and by the U.S. DOE Office of Fossil Energy SMART Program.