## 1 Related Work

#### Differentiable Rendering

A number of differentiable renderers have been introduced for estimating scene parameters or even geometry from reference images, for example, under the assumption of local illumination and smooth shading variations

[loper2014opendr, petersen2019pix2vex, rhodin2015versatile, kato2018neural], or via edge sampling to cope with discontinuities at visibility boundaries [li2018differentiable]. Scattering parameters of homogeneous volumes have been estimated from observed scattering pattern [gkioulekas2013inverse]. Recently, Nimier-David et al. proposed Mitsuba 2 [Mitsuba2], a fully-differentiable physically-based Monte-Carlo renderer. Mitsuba 2 also handles volumetric objects, yet it requires storing intermediate results during the ray sampling process at each sampling point. This quickly exceeds the available memory and makes the approach unfeasible for direct volume rendering applications. Later, the authors have shown how to avoid storing the intermediate results [NimierDavid2020Radiative], by restricting the parameters that can be derived to, e.g., only shading and emission. However, these methods are tailored for path tracing with multiple scattering and rely on Monte-Carlo integration with delta tracking. This makes them prone to noise and leads to long computation times compared to classical DVR methods without scattering. Our method, in contrast, does not require storing intermediate results and can, thus, use large volumes with arbitrary many sampling steps without resorting to a restricted parameter set. Furthermore, it does not impose restrictions on the parameters of the volume rendering process that can be differentiated.#### Parameter Optimization for Volume Visualization

An interesting problem in volume visualization is the automatic optimization of visualization parameters like the viewpoint, the TF, or the sampling stepsize that is required to convey the relevant information in the most efficient way. This requires at first hand an image-based loss function that can be used to steer the optimizer toward an optimal parameter setting. To measure a viewpoint’s quality from a rendered image, loss functions based on image entropy [ji2006dynamic, vazquez2008representative, tao2009structure, chen2010information]

[tao2016similarity, yang2019deep] have been used. For volume visualization, the relationships between image entropy and voxel significance [bordoloi2005view] as well as importance measures of specific features like isosurfaces [takahashi2005feature] have been considered. None of these methods, however, considers the rendering process in the optimization process. Instead, views are first generated from many viewpoints, e.g., by sampling via the Fibonacci sphere algorithm [marques2013fibonacci], and then the best view regarding the used loss function is determined. We envision that by considering the volume rendering process in the optimization, more accurate and faster reconstructions can be achieved.Another challenging problem is the automatic selection of a “meaningful” TF for a given dataset, as the features to be displayed depend highly on the user expectation. Early works attempted to find a good TF using clusters in histograms of statistical image properties [haidacher2010volume] or fitting visibility histograms [correa2010visibility]. Others have focused on guiding an explorative user interaction [zhou2013transfer, maciejewski2012abstracting], also by using neural networks [berger2018generative]. For optimizing a TF based on information measures, Ruiz et al. [ruiz2011automatic] proposed to bin voxels of similar density and match their visibility distribution from multiple viewpoints with a target distribution defined by local volume features. For optimization, the authors employ a gradient-based method where the visibility derivatives for each density bin are approximated via local linearization.

Concerning the performance of direct volume rendering, it is crucial to determine the minimum number of data samples that are required to accurately represent the volume. In prior works, strategies for optimal sampling in screen-space have been proposed, for instance, based on perceptual models [Bolin98-perceptuallybasedsampling], image saliency maps [painter1989antialiased], entropy-based measures [xu2005adaptive], temporal history [martschinke2019AdaptivePathTracing], or using neural networks [weiss2019isosuperres, weiss2020adaptivesampling]. Other approaches adaptively change the sampling stepsize along the view rays to reduce the number of samples in regions that do not contribute much to the image [Danskin92-fastvolren, kratz2011adaptive, Campagnolo2015-rayadapt, morrical2019spaceskipping]

. DiffDVR’s capability to compute gradients with respect to the stepsize gives rise to a gradient-based adaptation using image-based loss functions instead of gradient-free optimizations or heuristics.

#### Neural Rendering

As an alternative to classical rendering techniques that are adapted to make them differentiable, several works have proposed to replace the whole rendering process with a neural network. For a general overview of neural rendering approaches let us refer to the recent summary article by Tewari et al. [tewari2020state]. For example, RenderNet proposed by Nguyen-Phuoc et al. [nguyen2018rendernet] replaces the mesh rasterizer with a combination of convolutional and fully connected networks. In visualization, the works by Berger et al. [berger2018generative] and He et al. [he2019insitunet] fall into the same line of research. The former trained a network on rendered images and parameters of the rendering process, and use the network to predict new renderings by using only the camera and TF parameters. The latter let a network learn the relationships between the input parameters of a simulation and the rendered output, and then used this network to skip the rendering process and create images just from given input parameters.

## 2 Background

In the following, we review the fundamentals underlying DVR using an optical emission-absorption model [max1995optical]. Then we briefly summarise the foundation of Automatic Differentiation (AD), a method to systematically compute derivatives for arbitrary code [BARTHOLOMEWBIGGS2000171].

### 2.1 Direct Volume Rendering Integral

Let be the scalar volume of densities and let be an arc-length parameterized ray through the volume. Let be the absorption and the self-emission due to a given density. Then, the light intensity reaching the eye is given by

(1) |

were the exponential term is the transparency of the line segment from , the eye, to , the far plane, and is the emission. The transparency is one if the medium between and does not absorb any light and approaches zero for complete absorption.

We assume that the density volume is given at the vertices of a rectangular grid,
and the density values are obtained via
*trilinear*interpolation.
The functions and define the mapping from density to absorption and emission.
We assume that both functions are discretized into regularly spaced control points with linear interpolation in between. This is realized on the GPU as a 1D texture map with hardware-supported linear interpolation.

For arbitrary mappings of the density to absorption and emission, the volume rendering integral in Equation 1 cannot be solved analytically. Instead, it is approximated by discretizing the ray into segments over which the absorption and emission are assumed constant. We make use of the Beer-Lambert model , where is the sampled volume density, to approximate a segment’s transparency. This leads to a Riemann sum which can be computed in front-to-back order using iterative application of alpha-blending, i.e., , and .

### 2.2 Automatic Differentiation

The evaluation of any program for a fixed input can be expressed as a computation graph, a directed acyclic graph where the nodes are the operations and the edges the intermediate values. Such a computation graph can be reformulated as a linear sequence of operations, also called a Wengert list [wengert1964simple, BARTHOLOMEWBIGGS2000171],

(2) | ||||

where the ’s are the external parameters of the operations of size and the ’s refer to the state of intermediate results after the i-th operation of size . The output has size . Note here that in DiffDVR, and are usually large, i.e., is in the order of the number of pixels and in the order of the number of sampling points along the view rays. The output is a scalar (), computed, for example, as the average per-pixel loss over the image. The goal is then to compute the derivatives

. The basic idea is to split these derivatives into simpler terms for each operation using the chain rule. For example, assuming univariate functions and

the only parameter of interest, the chain rule yields(3) |

There are two fundamentally different approaches to automatically evaluate the chain rule, which depend on the order of evaluations. If the product in the above example is evaluated left-to-right, the derivatives are propagated from bottom to top in Equation 2. This gives rise to the adjoint- or backward-mode differentiation (see subsection 3.3). If the product is evaluated right-to-left, the derivatives “ripple downward” from top to bottom in Equation 2. This corresponds to the so-called forward-mode differentiation (see subsection 3.2).

## 3 AD for Direct Volume Rendering

Now we introduce the principal procedure when using AD for DiffDVR and hint at the task-dependent differences when applied for viewpoint optimization (subsection 4.1), TF reconstruction (subsection 4.2) and volume reconstruction (subsection 4.3 and subsection 4.4). We further discuss computational aspects and memory requirements of AD in volume rendering applications and introduce the specific modifications to make DiffDVR feasible.

### 3.1 The Direct Volume Rendering Algorithm

In direct volume rendering, the pixel color represents the accumulated attenuated emissions at the sampling points along the view rays. In the model of the Wengert list (see Equation 2), a function is computed for each sample. Hence, the number of operations is proportional to the overall number of samples along the rays. The intermediate results are rgb images of the rendered object up to the -th sample, i.e., is of size , where and , respectively, are the width and height of the screen. The last operation in the optimization process is the evaluation of a scalar-valued loss function. Thus, the size of the output variable is . The parameters depend on the use case. For instance, in viewpoint optimization, the optimization is for the longitude and latitude of the camera position, i.e., . When reconstructing a TF, the optimization is for the rgb entries of the TF, i.e., .

The DVR algorithm with interpolation, TF mapping, and front-to-back blending is shown in Algorithm 1. For clarity, the variables in the algorithm are named by their function, instead of using and as in the Wengert list (Equation 2). In the Wengert list model, the step size , the camera intrinsics , the TF , and the volume density are the parameters . The other intermediate variables are represented by the states . Each function operates on a single ray but is executed in parallel over all pixels.

When Algorithm 1 is executed, the operations form the computational graph. AD considers this graph to compute the derivatives of with respect to the parameters , so that the changes that should be applied to the parameters to optimize the loss function can be computed automatically. Our implementation allows for computing derivatives with respect to all parameters, yet due to space limitations, we restrict the discussion to the computation of derivatives of with respect to the camera , the TF and the volume densities . In the following, we discuss the concrete implementations of forward and adjoint differentiation to compute these derivatives.

### 3.2 Forward Differentiation

On the elementary level, the functions in Algorithm 1 can be expressed as a sequence of scalar arithmetic operations like .
In forward-mode differentiation [neidinger2010introduction, BARTHOLOMEWBIGGS2000171],
every variable is replaced by the associated *forward variable*

(4) |

i.e., tuples of the original value and the derivative with respect to the parameter that is optimized. Each function is replaced by the respective forward function

(5) |

Constant variables are initialized with zero, , and parameters for which to trace the derivatives are initialized with one, . If derivatives for multiple parameters should be computed, the tuple of forward variables is extended.

Forward differentiation uses a custom templated datatype for the forward variable and operator overloading. Each variable is wrapped in an instance of this datatype, called c++—fvar—, which stores the derivatives with respect to up to parameters along with their current values. [backgroundcolor=gray!5!white,roundcorner=5pt]c++ template¡typename T, int p¿ struct fvar T value; T derivatives[p]; ; Next, operator overloads are provided for all common arithmetic operations and their gradients. For example, multiplication is implemented similar to: [backgroundcolor=gray!5!white,roundcorner=5pt]c++ template¡typename T, int p¿ fvar¡T, p¿ operator*(fvar¡T, p¿ a, fvar¡T, p¿ b) fvar¡T, P¿ c; //to store c = a*b and derivatives c.value = a.value * b.value; for (int i=0; i¡p; ++i) //partial derivatives c.derivative[i] = a.value*b.derivative[i] + b.value*a.derivative[i]; return c; The user has to write the functions in such a way that arbitrary input types are possible, i.e., regular floats or instances of c++—fvar—, via C++ templates. All intermediate variables are declared with type c++—auto—. This allows the compiler to use normal arithmetic if no gradients are propagated, but when forward variables with gradients are passed as input, the corresponding operator overloads are chosen.

As an example (see Figure 1 for a schematics), let us assume that derivatives should be computed with respect to a single entry in a 1D texture-based TF, e.g., the red channel of the first texel . When loading the TF from memory, is replaced by , i.e., it is wrapped in an instance of c++—fvar— with the derivative for that parameter set to . Algorithm 1 executes in the normal way until is encountered in the code for the TF lookup. Now, the operator overloading mechanism selects the forward function instead of the normal non-differentiated function. The result is not a regular color , but the forward variable of the color . All following functions (i.e., the blend and loss function) continue to propagate the derivatives. In contrast, if derivatives should be computed with respect to the camera, already the first operation requires tracing the derivatives with c++—fvar—.

It is worth noting that in the above example only the derivative of one single texel in the TF is computed. This process needs to be repeated for each texel, respectively each color component of each texel, by extending the array c++—fvar::derivatives— to store the required number of parameters. Notably, for input data that is high dimensional, like TFs or a 3D volumetric field, forward differentiation becomes unfeasible. For viewpoint selection, on the other hand, where only two parameters are optimized, forward differentiation can be performed efficiently.

The computational complexity of the forward method scales linearly with the number of parameters , as they have to be propagated through every operation. However, as every forward variable directly stores the derivative of that variable w.r.t. the parameters, gradients for an arbitrary number of outputs can be directly realized. Furthermore, the memory requirement is proportional to , as only the current state needs to be stored.

### 3.3 Adjoint Differentiation

Adjoint differentiation [McNamara.2004], also called the adjoint method, backward or reverse mode differentiation, or backpropagation, evaluates the chain rule in the inverse order than forward differentiation.
For each variable , the associated *adjoint variable*

(6) |

stores the derivative of the final output with respect to the current variable.
Tracing the derivatives starts by setting . Then, the adjoint variables are tracked backward through the algorithm, called the *backward pass*. This is equivalent to evaluating the chain rule Equation 3 from left to right, instead of right to left as in the forward method. Let be again our model function, then the adjoint variables are computed from as

(7) |

This process is repeated from the last operation to the first operation, giving rise to the *adjoint code*.
At the end, one arrives again at the derivatives with respect to the parameters .
If a parameter is used multiple times, either along the ray or over multiple rays, the adjoint variables are summed up. The reverted evaluation of the DVR algorithm with the gradient propagation from Equation 7 is sketched in Algorithm 2.
A schematic visualization is shown in Figure 2.

Because the adjoint method requires reversing the order of operation, simple operator overloading as in the forward method is no longer applicable. Common implementations of the adjoint method like TensorFlow

[abadi2016tensorflow] or PyTorch [NEURIPS2019_9015] record the operations in a computation graph, which is then traversed backward in the backward pass. As it is too costly to record every single arithmetic operation, high-level functions like the evaluation of a single layer in neural networks are treated as atomic, and only these are recorded. Within such a high-level function, the order of operations is known and the adjoint code using Equation 7 is manually derived and implemented. We follow the same idea and treat the rendering algorithm as one unit and manually derive the adjoint code.### 3.4 The Inversion Trick

One of the major limitations of the adjoint method is its memory consumption because the input values for the gradient computations need to be stored. For example, the blending operation (line 9 in Algorithm 1) is defined as follows: Let be the opacity and rgb-emission at the current sample, i.e., the components of , and let be the accumulated opacity and emission up to the current sample, i.e., the components of in Algorithm 1. Then, the next opacity and emission is given by front-to-back blending

(8) | ||||

In the following adjoint code with as input it can be seen that the derivatives again require the input values.

(9) | ||||

Therefore, the algorithm is first executed in its non-adjoint form, and the intermediate colors are stored with the computation graph. This is called the *forward pass*.
During the backward pass, when the order of operations is reversed and the derivatives are propagated (the adjoint code), the intermediate values are reused.
In DVR, intermediate values need to be stored at every step through the volume. Thus, the memory requirement scales linearly with the number of steps and quickly exceeds the available memory.
To overcome this limitation, we propose a method that avoids storing the intermediate colors after each step and, thus, has a constant memory requirement.

We exploit that the blending step is invertible (see Figure 5): If are given and the current sample is recomputed to obtain and , can be reconstructed as

(10) | ||||

With Equation 10 and , the adjoint pass can be computed with constant memory by re-evaluating the current sample and reconstructing instead of storing the intermediate results. Thus, only the output color used in the loss function needs to be stored, while all intermediate values are recomputed on-the-fly. Note that is not possible in practice, since it requires the absorption stored in the TF to be at infinity.

In the implementation, and indicated by the circled in Figure 2, the adjoint variables for the parameters are first accumulated per ray into local registers (camera, stepsize, volume densities) or shared memory (TF). Then, the variables are accumulated over all rays using global atomic functions. This happens once all rays have been traversed (camera, stepsize, transfer function) or on exit of the current cell (volume densities).

Because the adjoint variables carry only the derivatives of the output, but not of the parameters, the computational complexity is largely constant in the number of parameters. For example, in TF optimization (subsection 4.2) only the derivative of the currently accessed texel is computed when accessed in the adjoint code of TF sampling. This is significantly different from the forward method, where the derivatives of all TF entries need to be propagated in every step. On the other hand, the adjoint method considers only a single scalar output in each backward pass, requiring multiple passes to support multi-component outputs. This analysis and the following example applications show that the forward method is preferable when optimizing for a low number of parameters like the camera position, while for applications such as TF optimization, which require the optimization of many parameters, the adjoint method has clear performance advantages.

DiffDVR is implemented as a custom CUDA operation in PyTorch [NEURIPS2019_9015]. The various components of the DVR algorithm, like the parameter to differentiate or the type of TF, are selected via C++ template parameters. This eliminates runtime conditionals in the computation kernel. To avoid pre-compiling all possible combinations, the requested configuration is compiled on demand via CUDA’s JIT-compiler NVRTC [cuda-nvrtc] and cached between runs. This differs from, e.g., the Enoki library [Enoki] used by the Mitsuba renderer [Mitsuba2], which directly generates Parallel Thread Code (PTX) for translation into GPU binary code.

## 4 Applications

In the following, we apply both AD modes for best viewpoint selection, TF reconstruction, and volume reconstruction. The results are analyzed both qualitatively and quantitatively. Timings are performed on a system running Windows 10 and CUDA 11.1 with an Intel Xeon 8x@3.60Ghz CPU, 64GB RAM, and an NVIDIA RTX 2070.

### 4.1 Best Viewpoint Selection

We assume that the camera is placed on a sphere enclosing the volume and faces toward the object center. The camera is parameterized by longitude and latitude. AD is used to optimize the camera parameters to determine the viewpoint that maximized the selected cost function. As cost function, we adopt the differentiable opacity entropy proposed by Ji et al. [ji2006dynamic].

Let be the output image. We employ array notation, i.e., indicates color channel (red, green, blue, alpha) at pixel

. The entropy of a vector

is defined as(11) |

Then the opacity entropy is defined as where indicates the linearization of the alpha channel, and the color information is unused.

In a first experiment, the best viewpoint is computed for a CT scan of a human tooth of resolution . Eight optimizations are started in parallel with initial views from viewpoints at a longitude of and a latitude of . In all cases, 20 iterations using gradient descent are performed. The viewpoints selected by the optimizer are shown as paths over the sphere in Figure 9a. The values of the cost function over the course of optimization are given in Figure 9b. It can be seen that the eight optimization runs converge to three distinct local minima. The best run converges to approximately the same entropy as obtained when the best view from 256 uniformly sampled views over the enclosing sphere is taken. Differentiable Direct Volume Renderinga shows intermediate views and the view from the optimized viewpoint. Further results on other datasets, i.e, a jetstream simulation (), the potential field of a C60 molecule (), and a smoke plume (), confirm the dependency of the optimization process on the initial view (see Figure 13).

Both the adjoint and the forward method compute exactly the same gradients, except for rounding errors. As seen in Figure 9c, a single forward/backward pass in the adjoint method requires about ms/ms, respectively, giving a total of ms.
For the forward method, we compare two alternatives. First, *forward-immediate* directly evaluates the forward variables during the forward pass in PyTorch and stores these variables for reuse in the backward pass. In *forward-delayed*, the evaluation of gradients is delayed until the backward pass, requiring to re-trace the volume.
With ms, *forward-immediate* is slightly faster than the adjoint method, while *forward-delayed* is around slower due to the re-trace operation.

### 4.2 Transfer Function Reconstruction

Our second use case is TF reconstruction. Reference images of a volume are first rendered from multiple views using a target TF. Given the same volume and an arbitrary initial TF, AD is then used to optimize the TF so that the rendered images match the references. The target TF comprises of 256 rgb entries, the target TF with entries is initialized with random Gaussian noise. The density volume is rendered to a

viewport from eight camera positions that are uniformly distributed around the volume (the view direction always pointing toward the volume’s center).

Let be the TF with entries containing the rgb color and absorption, and let be the rendered image of resolution , are the reference images. In our case, and varies. We employ an loss on the images and a smoothing prior on the TF, i.e.,

(12) | ||||

The Adam optimizer [kingma2014adam] is used with a learning rate of for epochs. The use of to control the strength of the smoothing prior is demonstrated in Figure 14 for a human head CT scan as test dataset using . If is too small, the reconstructed TF contains high frequencies and introduces subtle color shifts over the whole image. If the smoothing prior is too large, small details are lost. We found that a value of around leads to the best results, visually and using the Learned Perceptual Image Patch Similarity metric (LPIPS) [zhang2018perceptual]

, and is thus used in our experiments. We chose the LPIPS metric as we found that it can accurately distinguish the perceptually best results when the peak-signal-to-noise ratio (PSNR) and the structural similarity index (SSIM)

[wang2004image] result in similar scores. The initialization of the reconstruction and the final result for a human head dataset are shown in Differentiable Direct Volume Renderingb.Next, we analyze the impact of the TF resolution on reconstruction quality and performance (see Figure 15).
For TF reconstruction, the backward AD mode significantly outperforms the forward mode.
Because of the large number of parameters, especially when increasing the resolution of the TF, the derivatives of many parameters have to be traced in every operation when using the forward AD mode. Furthermore, the forward variables may no longer fit into registers and overflow into global memory. This introduces a large memory overhead that leads to a performance decrease that is even worse than the expected linear decrease.
A naïve implementation of the adjoint method that directly accumulates the gradients for the TF in global memory using atomics is over slower than the non-adjoint forward pass (*adjoint-immediate*). This is because of the large number of memory accesses and write conflicts.
Therefore, we employ delayed accumulation (*adjoint-delayed*). The gradients for the TF are first accumulated in shared memory. Then, after all threads have finished their assigned rays, the gradients are reduced in parallel and then accumulated into global memory using atomics. As seen in Figure 15, this is the fastest of the presented methods. The whole optimization for epochs requires around 5 minutes including I/O. However, as only 48kB of shared memory are freely available per multiprocessor, the maximal resolution of the TF is 96 texels. If a higher resolution is required, *adjoint-immediate* must be employed.
At smaller values of the reconstruction quality is decreased (see Figure 15). We found that a resolution of leads to the best compromise between reconstruction performance and computation time.

initial | final | reference | difference (10x) | |

Thorax |
[inner sep=0pt] at (0,0) ; [inner sep=0pt,fill=white] at (0,-0.35) ; | [inner sep=0pt] at (0,0) ; [inner sep=0pt,fill=white] at (0,-0.35) ; | [inner sep=0pt] at (0,0) ; [inner sep=0pt,fill=white] at (0,-0.35) ; | |

Plume |
[inner sep=0pt] at (0,0) ; [inner sep=0pt,fill=white] at (0,-0.4) ; | [inner sep=0pt] at (0,0) ; [inner sep=0pt,fill=white] at (0,-0.4) ; | [inner sep=0pt] at (0,0) ; [inner sep=0pt,fill=white] at (0,-0.4) ; |

). The used hyperparameters are the same as for the skull dataset. From left to right: Start configurations for the optimizer, optimized results, ground truths, pixel differences (scaled by a factor of

for better perception) between ground truths and optimized results.To evaluate the capabilities of TF reconstruction to generalize to new datasets with the same hyperparameters as described above, we run the optimization on two new datasets, a CT scan of a human thorax and a smoke plume, both of resolution . As one can see in Figure 16, the renderings with the reconstructed TF closely match the reference, demonstrating stability of the optimization for other datasets.

We envision that TF optimization with respect to losses in screen space can be used to generate “good” TFs for a dataset for which no TF is available. While a lot of research has been conducted on measuring the image quality for viewpoint selection, quality metrics specialized for TFs are still an open question to the best of our knowledge. In future work, a first approach would be to take renderings of other datasets with a TF designed by experts and transform the “style” of that rendering to a new dataset via the style loss by Gatys et al. [gatys2016image].

### 4.3 Density Reconstruction

In the following, we shed light on the use of DiffDVR for reconstructing a 3D density field from images of this field. For pure absorption models, the problem reduces to a linear optimization problem. This allows for comparisons with specialized methods, such as filtered backpropagation or algebraic reconstruction [GORDON1970471, dudgeon1984multidimensional, herman2009fundamentals]. We compare DiffDVR to the CUDA implementation of the SIRT algebraic reconstruction algorithm [gregor2008sirt] provided by the ASTRA-toolbox [vanAarle2015astra, vanAarle2016astra]. Furthermore, we compare the results to those computed by Mitsuba 2 [NimierDavid2020Radiative, Mitsuba2], a general differentiable path tracer. Density reconstruction uses 64 uniformly sampled views on a surrounding sphere. Each image is rendered at a resolution of . The reconstructed volume has a resolution of . ASTRA and Mitsuba are used with their default optimization settings. DiffDVR performs a stepsize of voxels during reconstruction. The Adam optimizer with a batch size of 8 images and a learning rate of is used. To speed up convergence, we start with a volume of resolution and double the resolution in each dimension after iterations. At the highest resolution, the optimization is performed for iterations. The same loss function as for TF reconstruction (see Equation 12) is used, except that the smoothing prior is computed on the reconstructed volume densities in 3D, with .

Three experiments with datasets exhibiting different characteristics are carried out. The results are shown in Figure 17. As one can see, DiffDVR consistently outperforms algebraic reconstruction via ASTRA and density reconstruction via the Mitsuba framework. In particular, Mitsuba suffers from noise in the volume due to the use of stochastic path tracing. Only for the rendering of the thorax dataset, Mitsuba shows a slightly better SSIM score than DiffDVR. For the plume dataset, intermediate results of the optimization process until convergence are shown in Differentiable Direct Volume Renderingc.

Note that all compared algorithms serve different purposes. Algebraic reconstruction methods (ASTRA) are specialized for absorption-only optical models and support only such models. Mitsuba is tailored to Monte Carlo path tracing with volumetric scattering, an inherently computational expensive task. DifffDVR is specialized for direct volume rendering with an emission-absorption model and a TF, yet emissions and a TF were disabled in the current experiments. These differences clearly reflect in the reconstruction times. For instance, for reconstructing the human skull dataset, ASTRA requires only 53 seconds, DiffDVR requires around 12 minutes, and Mitsuba runs for multiple hours.

### 4.4 Color Reconstruction

Next, we consider an optical emission-absorption model with a TF that maps densities to colors and opacities, as it is commonly used in DVR. To the best of our knowledge, we are the first to support such a model in tomographic reconstruction.

For TFs that are not a monotonic ramp, as in the absorption-only case, density optimization becomes a non-convex problem. Therefore, the optimization can be guided into local minima by a poor initialization. We illustrate this problem in a simple 1D example. A single unknown density value of a 1D “voxel” – a line segment with two values and

at the end points and linear interpolation in between – should be optimized. A single Gaussian function with zero mean and variance

is used as TF, and the ground truth value for is . For varying , Figure 19 shows the -loss between the color obtained from and the ground truth, and the corresponding gradients. As can be seen, for initial values of the gradient points away from the true solution. Thus, the optimization “gets stuck” at the other side of the Gaussian, never reaching the target density of . This issue worsens in 2D and 3D, as the optimizer needs to reconstruct a globally consistent density field considering many local constraints. This failure case is also shown in Figure 18b, where the tooth dataset cannot be reconstructed faithfully due to the initialization with a poorly matching initial field.To overcome this shortcoming, it is crucial to start the optimization with an initial guess that is not “too far” from the ground truth in the high-dimensional parameter space. We account for this by proposing the following optimization pipeline: First, a pre-shaded color volume of resolution (Figure 18c) is reconstructed from images using the same multi-resolution optimization as in the case of an absorption-only model. The color volume stores the rgb-emission and scalar absorption per voxel, instead of a scalar density value that is mapped to color via a TF. By using this color volume, trapping into local minima with non-monotonic TFs can be avoid. Intermediate results of the optimization process until convergence are shown in Differentiable Direct Volume Renderingd for the tooth dataset. Then, density values that match the reconstructed colors after applying the TF are estimated. For each voxel, random values are sampled, converted to color via the TF, and the best match is chosen. To avoid inconsistencies between neighboring voxels, an additional loss term penalises differences to neighbors. Let be the target color from the color volume and the sampled density with mapped color , then the cost function is

(13) |

Here, and are weights, and loops over the 6-neighborhood of the current voxel. The logarithm accounts for the vastly different scales of the absorption, similar to an inverse of the transparency integral Equation 1. In the example, we set to normalize for the maximal absorption in the color volume, and . This process is repeated until the changes between subsequent iterations fall below a certain threshold, or a prescribed number of iterations have been performed.

Finally, the estimated density volume is used as initialization for the optimization of the density volume from the rendered images (Figure 18d). We employ the same loss as before with a smoothing prior of . The total runtime for a volume is roughly 50 minutes. Even though the proposed initialization overcomes to a certain extent the problem of non-convexity and yields reasonable results, Figure 18 indicates that some fine details are lost and spurious noise remains. We attribute this to remaining ambiguities in the sampling of densities from colors that still lead to suboptimal minima in the reconstruction. This also shows in the slice view of Figure 18d, especially for the thorax dataset. Here, some areas that are fully transparent due to the TF are arbitrarily mapped to a density value of zero, while the reference has a density around 0.5 – between the peaks of the TF – in these areas.

## 5 Conclusion

In this work, we have introduced a framework for differentiable direct volume rendering (DiffDVR), and we have demonstrated its use in a number of different tasks related to data visualization. We have shown that differentiability of the direct volume rendering process with respect to the viewpoint position, the TF, and the volume densities is feasible, and can be performed at reasonable memory requirements and surprisingly good performance.

Our results indicate the potential of the proposed framework to automatically determine optimal parameter combinations regarding different loss functions. This makes DiffDVR in particular interesting in combination with neural networks. Such networks might be used as loss functions – providing blackboxes, which steer DiffDVR to an optimal output for training purposes, e.g., to synthesize volume-rendered imagery for transfer learning tasks. Furthermore, derivatives with respect to the volume from rendered images promise the application to scene representation networks trained in screen space instead of from points in object space. We see this as one of the most interesting future works, spawning future research towards the development of techniques that can convert large data to a compact representation -– a code -– that can be permanently stored and accessed by a network-based visualization. Besides neural networks, we imagine possible applications in the development of lossy compression algorithms, e.g. via wavelets, where the compression rate is not determined by losses in world space, but by the quality of rendered images. The question we will address in the future is how to generate such (visualization-)task-dependent codes that can be intertwined with differentiable renderers.

## Acknowledgments

The authors wish to thank Jakob Wenzel and Merlin Nimier-David for their help and valuable suggestions on the Mitsuba 2 framework.