Fast Neural Representations for Direct Volume Rendering
Despite the potential of neural scene representations to effectively compress 3D scalar fields at high reconstruction quality, the computational complexity of the training and data reconstruction step using scene representation networks limits their use in practical applications. In this paper, we analyze whether scene representation networks can be modified to reduce these limitations and whether such architectures can also be used for temporal reconstruction tasks. We propose a novel design of scene representation networks using GPU tensor cores to integrate the reconstruction seamlessly into on-chip raytracing kernels, and compare the quality and performance of this network to alternative network- and non-network-based compression schemes. The results indicate competitive quality of our design at high compression rates, and significantly faster decoding times and lower memory consumption during data reconstruction. We investigate how density gradients can be computed using the network and show an extension where density, gradient and curvature are predicted jointly. As an alternative to spatial super-resolution approaches for time-varying fields, we propose a solution that builds upon latent-space interpolation to enable random access reconstruction at arbitrary granularity. We summarize our findings in the form of an assessment of the strengths and limitations of scene representation networks for compression domain volume rendering, and outline future research directions.READ FULL TEXT VIEW PDF
Fast Neural Representations for Direct Volume Rendering
Learning-based lossy compression schemes for 3D scalar fields using neural networks have been proposed recently. While first approaches have leveraged the capabilities of such networks to learn general properties of scientific fields and use this knowledge for spatial and temporal super-resolution[ZHW17, HW20, GYH20, HW19], Lu et al. [LJLB21] have focused on the use of Scene Representation Networks (SRNs) [PFS19, SZW19, MST20] that overfit to a specific dataset to achieve improved compression rates.
SRNs were introduced as a compact encoding of (colored) surface models. They replace the initial model representation with a learned function that maps from domain locations to surface points. SRNs are modeled as fully connected networks where the scene is encoded in the weights of the hidden layers. This scene encoding—the so-called latent-space representation—can be trained from images of the initial object via a differentiable ray-marcher, or in object-space using sampled points that are classified as inside or outside the surface. Since SRNs allow for direct access of the encoded model at arbitrary domain points, ray-marching can work on the compact representation without having to decode the initial object.
Lu et al. [LJLB21] introduced neurcomp, an SRN where the mapping function has been trained to yield density samples instead of surface points. We subsequently refer to an SRN that predicts density samples as Volume Representation Network (V-SRN). By using a V-SRN, a ray-marcher can sample directly from the compact latent-space representation, and does not require to decode the initial volume beforehand. However, at every sample point along the view-rays, a deep network is called to infer the density sample.
Since SRNs are implemented using generic frameworks like PyTorch or Tensorflow where the basic building block is a network layer, intermediate states of each layer need to be written to global memory to make it available to the next layer. Thus, the evaluation becomes heavily memory-bound when deep networks consisting of multiple layers are used. Due to this reason, direct volume rendering using V-SRN is currently limited to non-interactive applications, with framerates that are significantly below what can be achieved on the initial data. Furthermore, the size of the networks that are used to generate the model representation drastically increases the training times.
Fig. 1 demonstrates the aforementioned properties for different datasets and a given memory budget of roughly to of the memory that is required by the original dataset. Given the internal format of the network weights, which is set to 16-bit half-precision floating-point values in the current examples, V-SRN automatically determines the internal network layout so that the memory budget is not exceeded. Compared to the initial datasets in (a), V-SRN in (c) with 18 layers (8 residual blocks) and 128 channels shows high reconstruction fidelity at the given compression rate. Compared to low-pass filtered versions of the original datasets (b), which are resampled to a resolution that matches the memory budget, even fine structures are well preserved. However, the rendering times are between 41 seconds to multiple minutes, and training times range from 39 minutes to multiple hours.
In this work, we demonstrate that the efficiency of V-SRN for volume rendering can be significantly improved, both with respect to training and data reconstruction. We achieve this by a novel compact network design called fV-SNR, which effectively utilizes the GPU TensorCores and uses a trained volumetric grid of latent features as additional network inputs. This enables fast training and significantly faster rendering from the compressed representation than prior work.
We compare fV-SRN to neurcomp by Lu et al. [LJLB21] as well as non-network-based compression schemes TThresh [BRLP19] and cudaCompress [TRAW12, TBR12] regarding compression ratio and reconstruction speed. The results indicate that fV-SRN is significantly faster than neurcomp
at similar compression rates, achieves similar compression rates than TThresh at significantly lower reconstruction times, and significantly outperforms cudaCompress in terms of compression ratio at similar decoding speed. Furthermore, since fV-SRN can render directly from the compressed representation, no additional memory is required at rendering time. Building upon the strengths of fV-SRN, we introduce the extension of fV-SRN to predict scalar field values as well as derived quantities like gradients and curvature estimates.
We further demonstrate the use of fV-SRN for temporal super-resolution tasks, to perform smooth, yet structure-preserving interpolation between given volumetric datasets at two consecutive timesteps. This enables to reduce the number of timesteps that need to be stored out of a running simulation. We analyze the possibility of latent-space interpolation to perform this task, and demonstrate that the restriction of available super-resolution schemes like TSR [HW19] and STNet [HZCW21] to obtain interpolations only at a pre-defined discrete set of timesteps can be overcome. Our specific contributions are:
The design and implementation of a fast variant of V-SRN (fV/̄SRN) using a volumetric latent grid and running completely on fast on-chip memory.
An extension of fV-SRN to jointly predict a scalar quantity as well as the gradient and curvature at the given input position.
A temporal super-resolution fV/̄SRN using latent-space interpolation as a means for feature-preserving reconstruction of time-sequences at arbitrary temporal resolution.
In an ablation study we shed light on the design decisions and training methodology, and we perform a number of experiments to demonstrate the specific properties of fV/̄SRN. Its quality and performance is compared to state-of-the-art compression schemes targeting direct volume rendering applications. Our experiments include qualitative and quantitative evaluations, which indicate high compression quality even when small networks are used. fV/̄SRN can be integrated seamlessly into ray- and path-tracing kernels, and—compared to neurcomp—improves the rendering performance about two orders of magnitude (between and ) and the performance of the training process about a factor of , see Fig. 1. Due to the use of a low-resolution latent grid, temporal super-resolution between given instances in time can be used even for large time-varying sequences (Sec. 8).
There is a vast body of literature on compression schemes for volumetric fields and scene representation networks, and a comprehensive review is beyond the scope of this paper. However, for thorough overviews and discussions of the most recent works in these fields let us refer to the articles by Balsa Rodríguez et al. [RGG13], Beyer et al. [BHP14], Hoang et al. [HSB21], and Tewari et al. [TFT20].
Our approach, since it attempts to further improve the compressive neural volume representation by Lu et al. [LJLB21], falls into the category of lossy compression schemes for volumetric scalar fields. Previous studies in this field have utilized quantization schemes to represent contiguous data blocks by a single index or a sparse combination of learned representative values [SW03, FM07, GIGM12, GG16], or lossy curve fittings like the popular SZ compression algorithm [DC16, ZDL20]. Transform coding-based schemes [YL95, Wes95, LCA08] make use, in particular, of the discrete cosine and wavelet transforms. They try to transform the data into a basis in which only few coefficients are relevant while many others can be removed. More recently, Ballester-Ripoll et al. [BRLP19] introduce tensor decomposition to achieve extremely high compression rates exceeding 1:1000.
For interactive applications, methods combining transform coding-based schemes and other techniques listed above are often applied brick-wise and embedded into streaming pipelines [TBR12, RTW13, DMG20]. They achieve significantly smaller compression ratios as, e.g., TThresh, for the sake of efficient GPU decoding. For example, Marton et al. [MAG19] present a rendering pipeline capable of decompressing over 10Gvoxels/second while reporting a compression ratio of 1:64 (0.5 bits per sample on floating point data). In our work we target the high compression rates achieved by offline schemes like TThresh, while still being able to render images of large volumes from the compressed representation within a second.
Fixed-rate texture compression formats such as ASTC, S3TC and variants [INH99, Fen03, NLP12] are implemented directly by the graphics hardware. This means that rendering, including hardware-supported interpolation, is possible directly from the compressed stream. However, the fixed-rate stage allows little or no control over the quality vs. compression rate trade-off.
With the success of convolutional neural networks, deep learning methods have started to see applications in visualization tasks. Early works use super-resolution networks to upscale the data if either storing the high-resolution data is too expensive (3D spatial data[ZHW17, HW20, GYH20], temporal data [HW19], spatiotemporal data [HZCW21]) or the rendering process is too expensive (2D data [WCTW19, WITW20]). Sahoo and Berger [SB21]et al. [LJLB21] and Wurster et al. [WSG21] utilize SRNs to learn a compact mapping from domain positions to scalar field values.
Berger et al. [BLL19] and Gavrilescu [Gav20] avoid the rendering process completely and train a network that directly predicts the rendered image from camera and transfer function parameters. This results in a compact representation of the data in the network weights from which the image can be directly predicted, but is limited concerning the generalization to new views or transfer functions if the training data does not provide this specific combination. Super-resolution methods for 3D spatial data or temporal data, on the other hand, are fixed on a regular grid in space (or time) due to the use of convolutional (recurrent) networks. Therefore, they do not allow for free interpolation and require the decompression of a whole block before rendering.
Scene Representation Networks (SRNs) address the above issues. By directly mapping a spatiotemporal position to the data value, random access is possible, as opposed to grid-based super-resolution methods. This also allows to freely move the camera during testing.
SRNs were first introduced for representing 3D opaque meshes, either as occupancy grids [MON19, MLL21] or signed distance field [TLY21, DNJ20, LJM21, CLI20]. In these methods, the networks were trained in world-space, that is, from pairs of position to data value. This principle was also adopted in the VIS area by Lu et al. [LJLB21]. The large network used in this work, however, makes interactive rendering infeasible. For image-space training, that is, training from images through the rendering process, SRNs were first introduced for 3D reconstruction [SZW19], including NeRF by Mildenhall et al. [MST20, TSM20]. Further improvements to these methods include reduction of aliasing artifacts (Mip-NeRF) [BMT21], amortized speed improvements by caching already evaluated samples (FastNeRF) [GKJ21], or incorporation of lighting effects [SDZ21].
Let us remark that in the mentioned scenarios the networks are trained to predict single surfaces from images, i.e., a computer vision task. This allows for significantly larger step sizes, or the use of sparse latent grids where only regions close to the surface are resolved at high resolution[TLY21, YLT21]. In the extreme case, a network is completely replaced by a learned sparse voxel representation [YFKT21]. These approaches cannot be transferred to the direct volume rendering scenario addressed in our work, were a surface might not be given or is permanently changed by interactively selecting iso-contours in the volumetric scalar field. Nevertheless, we see potential for future transfer of techniques between both worlds, for example, by integrating the proposed custom TensorCore kernels into computer vision tasks, or extending fV-SRN with aliasing-reducing techniques inspired by Mip-NeRF.
Regarding dynamic scenes, Park et al. [PFS19] (modeling SDFs) and Chen and Zhang [CZ19] (modeling occupancy grids) introduce a latent vector that allows interpolating between different models. This is the basis for the time interpolation described in Sec. 8. Alternatively, Pumarola et al. [PCPMMN21] introduce a second network that models affine transformations from a base model.
Let be a 3D multi-parameter field, i.e., a mapping that assigns to each point in a given domain a set of dependent parameters. In this work, we focus on 3D scalar fields () and color fields (), where at each domain point either a scalar density value is given or an RGB sample has been generated via a transfer function (TF) mapping.
SRNs [MST20, TSM20] encode and compress the field via a neural network comprised of fully-connected layers. The network takes a domain position as input and predicts the density or color at that position, i.e., a mapping . In detail, let be the input position. Then, layer of the network is computed as , where is the layer weight matrix,
the bias vector and
the element-wise activation function. The number of layers is denoted by. The output of the last layer is the final network output. The intermediate states are of size with being the number of hidden channels of the network. The matrices are the trainable parameters of the network.
Since the network processes each input position independently of the other inputs, the volumetric field can be decoded at arbitrary positions only where needed, i.e. along the ray during direct volume raycasting. In practice, batches of thousands of positions are processed in parallel. In the spirit of previous SRNs [CZ19, MON19, PFS19], neurcomp by Lu et al. [LJLB21] is trained in world-space, using training pairs of position and density . Let us refer to Sec. 5 for a study of the most relevant network parameters and the details of the training method.
As shown by Mildenhall et al. [TSM20] and Tancik et al. [TSM20], when the SRN is trained only with positional input and corresponding density or color output, the mapping function cannot faithfully represent high-frequency features in the data. To avoid this shortcoming, so-called Fourier features are used to lift 3D positions to a higher-dimensional space before sending the input to the network. In this way, the spread between spatially close positions is increased, and positional variations of the output values are emphasized.
Let and , respectively, be the input positions to the network and the desired number of Fourier features that should be used (see Sec. 4.1 for a discussion of how to chose ). Then, a matrix – the so-called Fourier matrix – is defined. Mildenhall et al. [MST20] propose to construct the Fourier matrix based on diagonal matrices of powers of two , i.e.,
where . The matrix is fixed before the training process and not part of the trainable parameters. The inputs to the network are then enriched via vector concatenation as
where indicates the concatenation operation.
Alternatively, Tancik et al. [TSM20] reported better reconstruction quality when using random Fourier features, where the entries of the matrix are sampled from
using the dataset-dependent hyperparameter. In our experiments (Sec. 5.3), however, we could not observe these improvements and, therefore, follow the construction proposed by Mildenhall et al. [MST20]. In contrast to neurcomp, which does not make use of Fourier features, we observed a significant enhancement of the networks’ learning skills when incorporating these features.
When using SRNs, the main computational bottleneck is the evaluation of the network to infer a data sample at a given domain location. In frameworks like PyTorch or Tensorflow, the basic building block a network is composed of is a single linear layer. On recent GPUs, when a layer is evaluated, the inputs and weights are loaded from global memory, updated, and the results are written back to global memory to make them available to the next layer. In direct volume rendering, if 100 steps along a ray are taken and the SRN consists of 7 layers, this amounts to 700 layer invocations and global memory read and write operations. In the following, we show that it is possible to completely avoid loading and storing the intermediate results to global memory by fusing the network into a single CUDA kernel and following certain size constraints as detailed below. This idea was previously applied for radiance caching in Monte-Carlo path tracing [MRNK21], but is extended here by the latent grid (see Sec. 4.2) and by avoiding all global memory access within the network layers (see Sec. 4.1). This gives rise to a speedup of up to of our custom CUDA TensorCore implementation, compared to a native PyTorch [PGM19] implementation, for the same V-SRN network architecture, see Sec. 5.1.
NVIDIA GPUs expose 64kB of fast on-chip memory per multiprocessor that is magnitudes faster than global memory (GBs of memory shared across all multiprocessors). These 64kB are divided into 48kB freely accessible shared memory and 16kB of L1-cache. Furthermore, the tensor core (TC) units on modern GPUs provide warp-synchronous operations to speed up matrix-matrix multiplications by a factor of [MDCL18]. A warp is a group of 32 threads that are executed in lock-step on a single multiprocessor. The core operation of the TC units is – for our purpose – a matrix-matrix multiplication of matrices of 16-bit half-precision floats . Each thread holds a part of the input and the output matrices in registers and computes a part of the matrix multiplication. The TC API comes with three main limitations: (a) matrix sizes must be a multiple of , (b) inputs and outputs can only be loaded from and stored to shared or global memory, not registers, (c) all 32 threads of the warp must execute the same code.
When evaluating an SRN, each layer computes , where is the weight matrix, the input state vector, the bias vector, and the output state. To use the TC units as described above and regarding constraint (a), however, must be a matrix with 16 columns. The first idea is to batch the evaluation so that the 32 threads per warp calculate 16 rays. Then, however, half of the threads are idle in operations like TF evaluation. Therefore, we map one thread to one ray, and block the matrix multiplication in the following way (exemplary for 48 channels per layer): The matrix is split into blocks, and are split into blocks, each block of shape . The bias is broadcasted from the bias vector
by setting the column stride to zero. In total, a layer evaluation ofchannels with rays requires invocations to the TC units.
Constraint (b) indicates that the weights and biases of the hidden layers, as well as the layer outputs, must fit into shared memory for optimal performance. In contrast, Müller et al. [MRNK21] reload the weights from global memory in every layer evaluation per warp. As an example, consider a network with layers, each with channels. Then the weights and biases require bytes and bytes, respectively, for a total of bytes that have to be stored once. Additionally, each thread stores the layer outputs, leading to bytes per warp. Therefore, with the limitation of shared memory, warps can fit into memory. More warps – up to the hardware limit of 32 – are advantageous, as they allow to hide pipeline latency by switching between the warps per multiprocessor. Note that the first (last layer) has to be handled separately, as the input (output) dimension differs. This results in the maximal network configurations given in Tab. 1.
When using V-SRN with a network configuration that is small enough to enable interactive volume rendering, we observe a significant drop in the networks’ prediction skills. The reason lies in the loss of expressive power of the network when relying solely on the few network weights to encode the volume. To circumvent this limitation, we borrow an idea proposed by Takikawa et al. [TLY21] for representing an implicit surface that is encoded as a signed distance function via an SRN. The proposed architecture employs a sparse voxel octree, which stores latent vectors at the nodes instead of distance values. Each octree node stores a trainable -dimensional vector that is interpolated across space and passed as additional input to the SRN network. Since the SRN learns to predict a single surface, an adaptive voxel octree with a finer resolution near the surface is used. We adopt this approach of a volumetric latent space but use a dense 3D grid instead of a sparse voxel octree. Especially, since in direct volume rendering it is desirable to change the TF mapping of density values to colors after training, refining adaptively toward a single surface is not suitable.
Let be a regular 3D grid with channels, i.e., parameters per grid vertex, and a resolution of vertices along each axis. In the interior of each cell, the values are tri-linearly interpolated to obtain a continuous field. When evaluating the SRN at position , the grid is interpolated at and the resulting latent vector is passed as additional input – alongside the Fourier features – to the network. The contents of are trained jointly with the network weights and biases.
With this approach, we can keep the network small enough to enable fast inference, up to networks of only two layers à 32 channels, while maintaining the reconstruction quality of V-SRN. For the evaluation of the network and grid configurations, we refer to Sec. 5.2. We found that the best compromise between speed, quality, and compression rate is achieved with a network of four layers à 32 channels and a latent grid of resolution and features. This configuration is used in Fig. 1d) and as default in the ablation studies below. The basic network architecture is illustrated in Fig. 2. Compared to neurcomp, the proposed network leads to a speedup of up to for training and for rendering. The latent grid is stored in four 3D CUDA textures with four channels each, so that hardware-supported trilinear texture interpolation can be exploited.
By using a latent grid, most of the parameters are stored in the grid instead of the network. For the configuration described above, the grid requires MB of memory (four bytes per voxel and channel), whereas the network consumes only around kB of memory. Note that in all specifications of the memory consumption of fV-SNR given in this paper, the memory consumed by the latent grid and the network weights are included. We further avoid storing the latent grid in a float-texture, by using a CUDA feature that enables the use of 8-bit integers per entry that are then linearly mapped to in hardware. Thus, we compute the minimal and maximal grid value for each channel, and use these values to first map the grid values to and then uniformly discretize them into 8-bit values. This reduces the memory footprint of the latent grid representation to a quarter of the size, while reducing the rendering times only slightly by roughly 5 due to reduced memory bandwidth. At the same time, the quality of the rendered images is slightly decreased by a factor of up to 2 of the reference SSIM and LPIPS statistics. Visually, however, the discretization does not introduce any perceptual differences, and is used in all of our experiments.
To select the network architecture with the best reconstruction quality from the possible configurations within the hardware limitations, we trained different networks on three different datasets (see Fig. 3): The ScalarFlow dataset [EUT19]—a smoke plume simulation with 500 timesteps, the Ejecta dataset—a supernova simulation with 100 timesteps, and the RM dataset—a Richtmyer-Meshkov simulation with timesteps. All datasets are given on Cartesian voxel grids, and they are internally represented with 8 Bits per voxel. All timings are obtained on a system running Windows 10, an Intel Xeon CPU with 3.60GHz, and an NVIDIA GeForce RTX 2070.
Unless otherwise noted, we analyze the capabilities of fV-SRN using world-space training on position-density encodings. The networks are training on randomly sampled positions, with a batch size of
positions over 200 epochs, anloss function on the predicted outputs, and the Adam optimizer with a learning rate of . We use a modified Snake [LHU20] activation function with enhanced overall slope, i.e.,
which results in slight improvements of the reconstruction quality, see supplementary material. After training, the networks are evaluated by rendering images of resolution from different views of the objects. The quality of the rendered images is measured using the image statistics SSIM [WBSS04] and LPIPS [ZIE18] using renderings of the initial volumes as references. For training from rendered images, we refer to the supplementary material.
First, we compare the performance of the proposed TC implementation to a native PyTorch implementation of the same architecture. Performance measures include the time to access the latent grid and to evaluate fV-SRN with the positional information augmented by Fourier features. Fig. 4 shows the timings for rather lightweight networks as well as the largest possible networks within the TC hardware constraints.
As can be seen, the largest speedup of () over a 32-bit (16-bit) PyTorch implementation is achieved for a medium-sized network of 6 layers and 48 channels. For very small networks of 2 or 4 layers with 32 channels, the speedup goes down to (). For larger networks, e.g. two layers à 128 channels, the network evaluation becomes computation bound and the reduction of memory access operations as achieved by our solution becomes less significant. However, also in these cases, a speedup of () against 32-bit (16-bit) PyTorch can still be achieved. We notice, however, that fast renderings with 5-10 FPS are only possible with small networks.
Next, we investigate the effect of the volumetric latent grid on reconstruction quality. For Ejecta with many fine-scale details, Fig. 5 shows quantitative results for different resolutions of the latent grid and different network configurations. As one can see, the reconstruction quality drastically increases with increasing latent grid resolution. At finer grids, i.e. and higher, the choice of the network has a rather limited effect on the overall reconstruction quality. The differences between networks of four and six layers are not noticeable. In these cases, a small network of only four layers à 32 hidden channels is sufficient to achieve good reconstruction quality. Only when using small grids – or no latent grid at all – can larger networks improve the overall quality.
To confirm that the quality improvement is not solely due to the volumetric latent grid while the SRN is superfluous, we compare the rendered images to images that were rendered from a low-pass filtered density grid with the same memory consumption as the latent grid (row "off" in Fig. 5). For example, for a latent grid of resolution and features, the original volume is first low-pass filtered and then down-sampled to a grid resolution of . The width of the low-pass filter is selected according to the sub-sampling frequency. As one can see from Fig. 5, and evidenced by the qualitative assessment in Fig. 6, by using a latent grid in combination with the SRN even small-scale structures are maintained. In the low-resolution density grid, many of these structures are lost.
Notably, since the latent grid can be trained very efficiently and takes the burden from the SRN to train a huge number of parameters, the training times of fV-SRN are up to a factor of faster than those of V-SRN with the same total number of parameters, see Fig. 1. Especially because SRNs overfit to a certain dataset and training has to be repeated for each new dataset, we believe that this reduction of the training times is mandatory to make SRNs applicable.
|[width=0.32trim=60 40 20 60,clip]figures/VolumetricFeatures/ejecta70_reference||[width=0.32trim=60 40 20 60,clip]figures/VolumetricFeatures/ejecta70_l32x2-G8C8||[width=0.32trim=60 40 20 60,clip]figures/VolumetricFeatures/ejecta70_l32x2-G32C16|
|[width=0.32trim=60 40 20 60,clip]figures/VolumetricFeatures/ejecta70_l0x0-G8C8||[width=0.32trim=60 40 20 60,clip]figures/VolumetricFeatures/ejecta70_l0x0-G32C16|
In the following, we shed light on the effects of Fourier features on the overall reconstruction quality of networks that were trained in world-space for density prediction. We compare the construction of Fourier features according to Mildenhall et al. [MST20], denoted “NeRF”, and Tancik et al. [TSM20]
with standard deviationas hyperparameter. In addition, we evaluate the reconstruction quality when Fourier features are not used. Networks were trained for multiple values of and three different numbers of Fourier features. The results can be found in Fig. 7. They demonstrate the general improvements due to the use of Fourier features, and furthermore indicate the superiority of “NeRF” over random Fourier feature by Tancik et al. in combination with our network design.
Next, we shed light on the reconstruction quality for networks that predict densities that are then mapped to colors via a user-defined TF (the approach we have followed so far), and networks that directly predict colors at a certain domain location. In the latter case, the network encodes colors dependent on positions, and the loss function considers the differences between the encoded colors and the colors that are obtained by post-shading at the interpolated input positions.
Density prediction enables to change the TF after a sample has been reconstructed without retraining. When predicting colors, however, the network needs to be re-trained whenever the TF is changed. Hence, this approach seems to be less useful in practice, yet it is interesting to analyze how well a network can adapt its learning skills to those regions emphasized by a TF mapping. Possibly, the network can learn to spend its capacities on those regions that are actually visible after the TF has been applied, which may result in improved reconstruction quality.
We trained four instances of fV-SRN: One that predicts densities and three networks that predict colors that have been generated via three different TFs on ScalarFlow. Reference images for the three TFs are shown in Fig. 8
, combined with the achieved reconstruction quality. For the first two TFs, there are almost no differences in image quality between density and color prediction. For the third TF with two narrow peaks, however, color prediction performs considerably worse, even though the network needs to learn significantly fewer positions at which a non-transparent color is assigned. We hypothesize that especially narrow peaks in the TF make the prediction difficult. In such cases, the absorption changes rapidly over a short interval, so that the network training on uniformly distributed locations cannot adequately learn these high frequencies.
To force the network to consider more positional samples in regions where the TF mapping generates a color, we propose the following adaptive resampling scheme: After each -th epoch ( in our experiments), we evaluate the prediction error over a coarse voxel grid of resolution . Per voxel, positions are sampled and evaluated as an approximation of the average prediction error per voxel. is then used to sample new training data for the next epochs, where the number of sampled positions is made proportional to the values in . This allows the training process to focus on regions with high prediction error and samples the volumetric field in more detail in these regions.
By using the proposed adaptive sampling scheme, e.g., the LPIPS score for TF 3 for the color-predicting network is improved from to . Even though, however, we do not believe that the quality of color prediction can match the quality of density prediction when rather sharp TFs are used. Thus, and also due to the restriction of color prediction to a specific TF, we consider this option to be useful only when a color volume is given initially.
In the following, we shed light on the use of fV-SRN to learn a mapping that not only predicts a scalar field value at a given position but also the gradient and even higher order derivatives at that position. The gradient is important in volume rendering to apply gradient-magnitude-based opacity and color selection via TF mappings [Lev88], and, since the gradient at a certain position is the normal vector of the isosurface passing through this position, to illuminate the point, e.g., via Phong lighting.
In particular, we evaluate different strategies to estimate gradients in network-based scalar field reconstruction: Using finite differences by calling the network multiple times (FD), using the adjoint method (Adjoint), and training fV-SRN to predict the gradients alongside the density. As we will show, the latter improves the rendering performance by over finite differences and over the adjoint method, while reducing the quality only slightly, see Fig. 9. All three methods are implemented using the proposed TC kernel (Sec. 4.1).
The common method to compute gradients during volume ray-casting is to use FD, more concretely central differences, between trilinearly interpolated scalar values with a step size of one voxel size [Lev88]. Compared to computing analytical derivatives of the trilinear interpolant, the use of central differences avoids discontinuities at the voxel borders. Since the use of FD (with fV-SRN) introduces a bias if the same method (w/o network) is used as reference, FD with fV-SRN leads to the best prediction in general, see Fig. 9b. This method, however, introduces a large computation overhead as seven network evaluations are required to compute the density and gradients. Therefore, in previous work the adjoint method (Adjoint) was proposed as an alternative [DNJ20, LJLB21]. Adjoint uses backpropagation through the trained reconstruction network to predict the change of the scalar value depending on changes of the position, see Fig. 9c.
The fastest prediction is achieved by extending the output of fV-SRN to predict scalar values and gradients, see Fig. 9d. However, as the network now needs to predict four outputs – density plus gradient – instead of one within the same network weights and latent grid size, the quality is slightly reduced. For an extended comparison on further datasets, including implicit functions with analytical gradients, and a detailed study on how to design the loss function to include the gradients, we refer to the supplementary material.
If higher-order derivatives are required, e.g., for TFs incorporating curvature measures [KWTM03], FD and Adjoint become increasingly intractable. For finite differences, Kindlmann et al. [KWTM03] propose a stencil with a support of samples. This would require network evaluations per sample along the ray. Similarly, the adjoint method requires an additional adjoint pass per row in the Hessian matrix. As an outlook for future research, we show that the SRNs can be trained to jointly predict densities, gradients, and also curvature estimates as a multi-valued output. First results using the shading proposed by Kindlmann et al. [KWTM03] on isosurface renderings are given in Fig. 10.
In the following, the quality and performance of fV-SRN is compared to neurcomp [LJLB21], TThresh [BRLP19], and cudaCompress [TRAW12, TBR12]. We compare to TTresh because of the extreme compression rates it can achieve, and to cudaCompress because of its decoding efficiency. The publicly available implementations of TThresh (running on the CPU) and cudaCompress (running on the GPU) are used. For the comparison, we chose Jet with Phong shading, as introduced in Sec. 6, and Ejecta at a resolution of , see Fig. 1. To achieve a given compression ratio, fV-SRN changes the latent grid resolution, neurcomp adapts the number of hidden channels, TTresh modifies the bitplane cutoff, and cudaCompress adapts the stepsize for quantizing discrete wavelet coefficients. Further results on additional datasets are given in the supplementary material.
For a quantitative evaluation, compression ratios of the four methods are plotted against a) the peak CPU and GPU memory required for decoding, b,c) the time it requires to reconstruct random locations as well as the resulting PSNR, d,e) the time it requires to render an image of resolution with 2 samples per voxel on average as well as the resulting SSIM statistics, see Fig. 12. All timing statistics are performed on an Intel Xeon CPU with 8 cores and 3.60GHz, equiped with a NVIDIA GeForce RTX 2070 GPU.
Regarding PSNR and SSIM, fV-SRN, neurcomp and TThresh are almost on-par. For high compression ratios, the network-based approaches slightly outperform TThresh, while the opposite is true at low compression ratios. However, both TThresh and cudaCompress require additional temporal memory, as they need to decode the volume before rendering. For TThresh and cudaCompress, respectively, the temporarily required memory can grow up to GB and GB. neurcomp requires temporal memory to store the hidden states during network evaluation, computed here for evaluating rays in parallel. As shown in Sec. 4.1, fV-SRN runs completely in shared memory and requires no additional temporal memory for evaluation, besides storing the latent-space representation including network weights and latent grid – for sampling and rendering.
Treib et al. [TRAW12, TBR12] propose bricked decompression and rendering in combination with cudaCompress. In our case, we use a brick size of . This drastically reduces the memory requirements from GB to around MB, while increasing the rendering time by roughly for the Ejecta dataset. Note that this bricked rendering is only possible for the regular access pattern during rendering. For random access, the whole volume still needs to be decompressed. In a similar fashion, we applied a bricked TThresh, where each brick is compressed independently. For Ejecta, this also reduces the memory requirement from GB to around GB, but also drastically reduces the achieved compression ratio.
A qualitative comparison of the errors introduced by all compression schemes is given in Fig. 12. cudaCompress quantizes the values which introduce large errors with narrow TFs. TThresh introduces slight grid artifacts, and both fV-SRN and neurcomp blur the dataset at higher compression ratios.
We now analyze the extension of fV-SRNs to interpolate between different instances in time of a scalar field. The interpolation should smoothly transition between the instances to create plausible intermediate fields, and topological changes should be handled. The proposed approach is inspired by previous works by Park et al. [PFS19] and Chen and Zhang [CZ19], where latent vectors representing different objects are interpolated to morph one object into another one in a feature-preserving manner. To achieve the aforementioned goals, we extend the volumetric latent spaces, see Sec. 5.2, to include the time domain.
Let be the indices of the timesteps that are available in the dataset. To save memory, the volumetric latent space is provided only at certain timesteps that we call keyframes. Let be the timestep indices of the keyframes and the volumetric latent space is then indexed as . For timesteps that are between two keyframes, the volumetric latent space is linearly interpolated in time and passed to the network. During training, timesteps from are used.
In addition to the time-dependent latent space, we evaluate four options to encode the time dimension in the network, so that plausible interpolation is achieved: no extra input (“latent only”); time as an additional scalar input (“direct”); time modulated by Fourier features based on Mildenhall et al. with , see Sec. 5.3 (“fourier”); time as scalar input and Fourier features (“both”). Quantitative results are given in Fig. 14 on the ScalarFlow dataset with a keyframe every 10th timestep for timesteps 30 to 100. For training, every 5th timestep (Fig. 14a) or every 2nd timestep (Fig. 14b) was used. For timesteps 60 to 70, Fig. 14 shows the qualitative results.
We found that “latent only” and “direct” lead to good generalization for in-between timesteps that were never seen during training, with no noticeable difference between both methods (Fig. 14 blue, Fig. 14b). Those two architectures lead to a semantically plausible interpolation, that becomes especially noticeable when compared against a baseline (Fig. 14 green, Fig. 14d) where the original grid is used at the keyframes and then linearly interpolated in time.
The options including Fourier features in the time domain (“fourier” and “both”), however, show chaotic behavior for in-between timesteps (Fig. 14 yellow). As opposed to Fourier features in the spatial domain where all fractional positions could have been observed due to the random sampling of the positions, in the time domain only a discrete subset of timesteps are seen. Therefore, during generalization, the Fourier encoding produces value ranges for the network that were never seen before.
Let us also emphasize that neurcomp by Lu et al. [LJLB21] also supports super-resolution in the time domain, by sending the time domain directly as input to the network, see Fig. 14 purple. Neurcomp allows an accurate prediction of the timesteps from the training datasets, using the same compression ratio as fV-SRN, but fails to generalize to in-between timesteps. This can also be clearly seen in the qualitative comparison Fig. 14c. We hypothesize that the time-interpolated latent grid acts as a regularizer in that regard. The importance of a time-varying latent grid is also supported by the following test, Fig. 14 red. Using the time encoding “direct”, but with only a single keyframe for the grid, leads to inferior results.
In total, fV-SRN allows for an efficient and plausible interpolation in time. The training time when including the time domain, however, increases drastically. Training a network on every 5th timestep requires around 3:45h. Using every 2nd timestep instead of every 5th improves the quality of the interpolation (Fig. 14b versus a), but the training time increases accordingly to almost hours.
We have analyzed SRNs for compression domain volume rendering, and introduced fV-SRN as a novel extension to achieve significantly accelerated reconstruction performance. Accelerated training as well as the adaptation of fV-SRN to facilitate temporal super-resolution have been proposed. As key findings we see that
by using custom evaluation kernels and a latent grid, SRNs have the potential to be used in interactive volume rendering applications,
fV-SRN is an alternative to existing volume compression schemes at comparable quality and significantly improved decoding speed, or similar performance but significantly higher compression ratios,
SRNs using latent space interpolation can preserve features that are lost using traditional interpolation and enable temporal super-resolution at arbitrary temporal resolution.
In the context of volume rendering, it will be important to investigate the capabilities of SRNs to learn mappings that consider a view-dependent level-of-detail (LoD). In particular, the network might be able to infer more than just values of a low-pass filtered signal, but infer values as they would be perceived when looking at the data through a pixel and perform area-weighted super-sampling. Such a view-dependent learning of LoDs can avoid missing details which are smoothed out using the classical low-pass filtering approach.
We see further potential in SRNs for scientific data visualization due to their ability to randomly access samples from the compressed feature representation. Due to this property, we see a promising application in the context of flow visualization. By using SRNs to encode position-velocity relationships, particle tracing or streamline tracing, with its sparse and highly irregular data access patterns, can work on a compactly encoded vector field representation.
As another interesting use case for fV-SRN we see ensemble visualization. In particular, we intend to investigate whether the idea of multiple latent grids introduced for time-dependent fields can be used to represent similar and dissimilar parts in each ensemble member. An interesting experiment will be to generate Mean-SRNs, which are trained using position-density encodings corresponding to different datasets. This may also give rise to alternative ensemble compression schemes, where differences to a reference are encoded. Furthermore, we plan to investigate whether time and ensemble information can be decoupled in the latent grid. This can eventually enable to retrain ensemble features for a novel ensemble member, and vary the temporal features to predict the temporal evolution. Finally, we note that including the time domain vastly increases the training time. Thus, similar to the adaptive spatial sampling presented in this work, we plan to investigate adaptive (re-)sampling strategies in time, to focus only on those timesteps that exhibit the largest prediction errors.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(2019), pp. 5939–5948.
ScalarFlow: A large-scale volumetric data set of real-world scalar transport flows for computer animation and machine learning.ACM Trans. Graph. 38, 6 (Nov. 2019).
The unreasonable effectiveness of deep features as a perceptual metric.In CVPR (2018).