Fast Differentiable Raycasting for Neural Rendering using Sphere-based Representations

04/16/2020 ∙ by Christoph Lassner, et al. ∙ Facebook 0

Differentiable rendering in combination with deep learning promises great advantages: deep learning models can produce realistic scenes rapidly, while differentiable rendering offers consistent scene representations and respective gradients. However, gradient based optimization of classical mesh representations is cumbersome because of the explicit topology encoding. Moreover, complex scenes may need detailed geometric representation, requiring many geometric primitives and a fast rendering operation. We propose to break up the rendering process into multiple parts: (1) the scene representation, (2) a differentiable geometry projection and (3) neural shading. While mature, off-the-shelf models for scene representation and neural shading are widely available, we propose pulsar as a general purpose differentiable geometry engine tightly integrated with PyTorch. By replacing mesh representations with sphere clouds for the scene representation, the operation is fast compared to existing differentiable renderers and avoids problems with surface topology. It provides gradients for the full scene parameterization, i.e., sphere positions, colors, radiuses, opacity and the camera parameters. pulsar can execute many times, up to orders of magnitudes faster than existing renderers and allows real-time rendering and optimization of scenes with millions of spheres. It can be used for 3D reconstruction, rendering and volumetric scene optimization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In this paper, we present pulsar, a fast, sphere-based, differentiable geometry engine for neural rendering. Standard neural rendering breaks up the rendering process roughly into two major parts: (1) a projection from 3D data to a consistent 2D representation (the projection step) and (2) processing the 2D data using a statistical model, usually a neural network, to produce the rendered image (the neural shading step). This strategy combines the strengths of classical rendering and neural networks: through the projection step, a consistent geometric representation of the scene is generated. The neural shading step can produce realistic images through the use of the latest generative neural networks that can approximate natural image formation phenomena without the need to explicitly model them.

Ideally, such a rendering system can be used and trained end-to-end: a 3D representation of the scene is generated and sent through the first and second steps. The resulting image can then be compared to ground truth data and be used for an optimization process, not only to improve the generative model in the second part of the system but also the representation of the scene and the parameters of the image formation process. This process must be scalable to process the geometry of detailed scenes and fast enough for optimization on a high number of images.

We aim to fulfill these requirements with pulsar through a variety of measures, from design choices regarding the scene representation down to implementation details and optimized CUDA code. First, we choose an entirely sphere-based representation for the 3D data. This makes it easy to handle point cloud data from 3D sensors directly, allows for the optimization of the scene representation without problems of changing topology (as they would exist for meshes) and is efficient for rendering. We deliberately keep lighting separate from the geometry projection step as it can be easily handled in a separate step. The sphere-based representation even eliminates the need for acceleration structures such as -d-trees. Additionally, it leads to a well-defined, simple render and blending function that can be differentiated without approximation. We implement the rendering process in CUDA and make use of acceleration techniques to leverage modern GPU architectures. Lastly, we integrate this code with the PyTorch (Paszke et al., 2019) optimization framework to make use of auto-differentiation and ease the integration with deep learning models.

The strategy described above allows the pulsar engine to render scenes with millions of spheres on consumer GPUs fast enough to optimize their representations on large datasets. Up to spheres can be rendered and optimized in real time settings for a pixel image (the time spent executing C++ code is less than for rendering and less than for gradient calculation). The framework supports a pinhole and orthogonal camera model and computes gradients for camera parameters as well as for scene representations. We have successfully used it for 3D reconstruction, volumetric optimization and viewpoint synthesis.

Our main contributions are the following:

  • We introduce a fast, general purpose, sphere-based, differentiable geometry engine that is tightly integrated in PyTorch. It is trivial to use with existing models as an additional torch.nn.Module in the autodiff framework, enabling end-to-end training of deep models with geometry and projection components.

  • In benchmarks, we observe up to multiple orders of magnitude speedups compared with existing renderers and geometry engines. In this paper, we describe in detail how we are able to achieve these speedups.

  • The increased performance enables the exploration of high density sphere representations in real world scenarios: for image sizes that are corresponding to current image sensor resolutions and scenes with millions of spheres. In this way, scenes with a significant amount of details can be modelled and realistically reproduced. We present results comparing classical 3D reconstruction examples with existing renderers and geometry engines as well as results for a high resolution, realistic human head model.

  • With neural rendering in mind, we generalize the ‘number of channels’ in our projection implementation, allowing the projection of spheres with a latent representation of the shading parameters.

2. Related Work

method number of points number of faces avg. forward time in ms avg. backward time in ms
Soft Rasterizer
DSS n.a.
PyTorch3D (mesh)
PyTorch3D (points) / SynSin n.a.
pulsar n.a.
pulsar (CUDA only) n.a.
Soft Rasterizer
DSS n.a.
PyTorch3D (mesh)
PyTorch3D (points) / SynSin n.a.
pulsar n.a.
pulsar (CUDA only) n.a.
Table 1.

Runtime performance comparison for generic, open source, differentiable renderers with PyTorch integration. For

pulsar we measure the performance using the full Python interface (as for the other renderers) as well as the runtime of the CUDA kernel. The concurrent work PyTorch3D (points) uses a fixed point size for all points and the runtime doesn’t scale well for larger point sizes. pulsar runtime is largely sphere size agnostic and scales favorably with the number of spheres: for 1 million spheres we measure execution times of less than 33ms (19ms in CUDA) forward and 11ms (4.7ms in CUDA) backward. All times are measured on an NVIDIA RTX 2080 GPU.
method objective position update depth update normal update occlusion silhouette change topology change
OpenDR mesh via position change
NMR mesh via position change
Paparazzi mesh limited limited via position change
Soft Rasterizer mesh via position change
Pix2Vex mesh via position change
Tensorflow Graphics mesh via position change
PyTorch3D mesh / points via position change
DSS points
pulsar spheres via extra channels
Table 2. Feature comparison of generic differentiable renderers (compare to (Yifan et al., 2019), Tab. 1). DSS and PyTorch3D are the only other renderers that do not require a mesh-based geometry representation, facilitating topology changes. In contrast to DSS, pulsar uses 3D spheres but without normals. Normals can be optimized by using the generic channel model in pulsar: extra channels can be used to capture the normal information.

We make the rendering process differentiable to optimize the scene representation and the rendering process itself. In this way, this approach can be understood as a subfield of inverse graphics

, which has been a part of computer vision research since its early days 

(Baumgart, 1974). Inverse graphics has many applications and we refer to the related work in (Loper and Black, 2014) for a good overview.

Differentiable rendering and neural rendering are more recent ideas that gained popularity with the advent of deep neural networks. Deep neural networks stack many differentiable ‘modules’ to represent a function and optimize its parameters. Additional ‘modules’ that model a projection and are differentiable can naturally be combined with existing models. Since our proposed module is meant to be used with neural rendering, we focus our analysis of related work on other differentiable rendering methods and neural rendering approaches that work together with a differentiable renderer (for an excellent overview of neural rendering in general, we refer to (Tewari et al., 2020)).

Differentiable Rendering

(Loper and Black, 2014) propose a differentiable renderer to render meshes, including lighting and textures. It is built on top of OpenGL and uses local Taylor expansions and filter operations to find gradients, excluding depth. This has the advantage that it can leverage existing OpenGL infrastructure for the rendering process, but has the disadvantage that it introduces approximations and has a large overhead in running filtering and approximation operations. It is the first paper to our knowledge that proposes to differentiate the renderer instead of differentiating the image formation process for a specific object. The Neural Mesh Renderer (NMR) (Kato et al., 2018) renders meshes using a custom function to address object boundaries. Paparazzi (Liu et al., 2017) is another mesh renderer that is implemented using image filters. (Petersen et al., 2019) introduce Pix2Vex, which is a mesh renderer that uses soft blending of triangles. In the same spirit, (Liu et al., 2019b) introduce a probabilistic map of mesh triangles. They use a soft buffering to obtain a differentiable representation. Their rendering function strongly inspired our own formulation. Tensorflow Graphics (Valentin et al., 2019) is a differentiable rendering package for Tensorflow (Abadi et al., 2015) with support for mesh geometry. Similarly, PyTorch3D (Ravi et al., 2020) is a differentiable rendering package for PyTorch and initially focused on mesh rendering. A recent extension makes point-based rendering available and has been used for creating SynSin (Wiles et al., 2019) (discussed in the following section).

Physics-based Rendering

Several renderers aim to be close to the underlying physical processes. (Li et al., 2018b; Azinovic et al., 2019) implement differentiable ray tracers to be able to get gradients for physics-based rendering effects. These implementations explicitly model the image formation process in much greater detail and are at the same time significantly slower in execution. Similarly, the Mitsuba 2 renderer (Nimier-David et al., 2019) and Difftaichi (Hu et al., 2019) are physics-based differentiable renderers with slower execution times but a lot more physical details in the rendering process.

Scene Representations

In contrast to explicitly differentiable graphics engines, neural rendering can also be implemented solely through deep learning models. This is, for example, attempted in (Kulkarni et al., 2015). Similarly, RenderNet (Nguyen-Phuoc et al., 2018) is a CNN architecture with a projection unit. (Tulsiani et al., 2018) use a layered depth image representation and develop a differentiable renderer for optimizing this representation. Instead of prescribing a fixed input size or discretizing the underlying 3D scene space, (Sitzmann et al., 2019) and (Mildenhall et al., 2020) represent the scene implicitly in the network structure and use variations of ray-casting to reconstruct images from arbitrary cameras.

Signed Distance Functions

SDFs are another popular scene representation. In particular, they can well be parameterized and expressed through neural networks (Park et al., 2019). (Liu et al., 2019a) optimize a signed distance function through sphere samples. Similarly, (Saito et al., 2019) model humans through an implicit function. (Zeng et al., 2020) optimize a similar function through sampled sphere positions using a differentiable rendering function. (Jiang et al., 2019) is implementing differentiable rendering directly for SDFs, including lighting.

Sphere-based representations

(Insafutdinov and Dosovitskiy, 2018)

propose to work with a differentiable point cloud representation. They train a CNN to predict the shape and pose of an object in 3D and use an orthogonal projection and use ray termination probabilities to obtain a differentiable representation. In contrast to our method, their method is strongly limited in resolution (they use

pixel image resolution and up to points in their paper); this is too low for our intended scenarios. (Yifan et al., 2019) use a point based representation with a position and normal parameterization. Each point is rendered as a small ‘surface’ with position, radius and normal. In the paper, they do not show applications in combination with neural rendering. In the examples shown in their paper, they use representations with up to spheres and report orders of magnitude slower runtimes than our approach ( forward and backward for a image). (Lin et al., 2018) define a renderer for point cloud generation, but only provide gradients for depth values. (Roveri et al., 2018) define a point based differentiable projection module for neural networks that produces ‘soft’ depth images. (Aliev et al., 2019) propose to model room point clouds with a deferred neural rendering step. The concurrent work on SynSin (Wiles et al., 2019) and the PyTorch3D point renderer follow a very similar approach to ours. They use a slightly different blending function and use a fixed point size defined in screen space, whereas in our model every sphere can have a different, optimizable radius in world space. Furthermore, they use only the first few points per pixel to determine the pixel colors. We have found this to lead to high frequency artifacts in complex scenes and allow an unlimited number of spheres to contribute to pixel color (or set a bound based on minimum contribution) and use only the first few spheres for gradient propagation.

3. Method

Modern GPU architectures offer a tremendous amount of processing power through a large number of streaming multiprocessors and threads, as well as enough memory to store scene representations and rendered images in GPU memory. For example, even an NVIDIA RTX 2080 Ti consumer GPU has CUDA cores on streaming multiprocessors with access to up to of memory. The CUDA cores/threads are grouped in warps of 32 threads. Multiple warps again can work together in groups. Warps have particularly fast local shared memory and operations, however all threads in a warp execute exactly the same command on potentially different data, or a part of them must sleep. For example, half of the threads follow a different execution path due to an ‘if’ statement than the rest; in this case the half not following the branch will sleep while the rest executes the relevant commands.

All these architectural peculiarities must be taken into account to use GPU hardware well. This requires making smart use of parallel code and finding good memory access patterns to not block execution through excessive IO load. Because both of these aspects tightly connect, non-intuitive solutions often turn out to be the most efficient and experimentation is required to identify them.

We found a way to keep the computation throughput high by elegantly switching between parallelization schemes and by using finely tuned memory structures as ‘glue’ between the computations. In the following sections, we aim to discuss these steps, the underlying memory layout and the parallelization choices made.

3.1. The forward pass

For the forward pass, the renderer receives a set of spheres with positions , channel information , radiuses and Opacity for each sphere (we use upper case letters for user provided values, lower case letters for inferred ones). Additionally, the camera configuration must be provided. To simplify the forward and backward equations by dropping extrinsic transformations, we normalize the position of the points w.r.t. to the camera coordinate system using PyTorch autodiff functions before passing them to the low level renderer. The remaining camera parameters are minimal, mainly defining the projection plane depending on the camera model. Using this information, the channel values for each pixel of the image needs to be calculated in a differentiable way. Assuming we have a per-pixel rendering equation (described in Sec. 5), the first fundamental choice to make is whether to parallelize the rendering process over the pixels or the spheres.

Parallelizing over the spheres can be beneficial through the re-use of information to evaluate the rendering equation for pixels close to each other. However, this approach leads to memory access collisions for the results (writing access to all pixel values must be protected by a mutex), which obliterates runtime performance. The second alternative is to parallelize rendering over the pixels. To make this strategy efficient, it is critical to find a good way to exploit spatial closeness between pixels during the search for relevant spheres. It is important to reduce the amount of candidate spheres (spheres that could influence the color of a pixel) for each pixel as quickly as possible. This can be achieved by mapping spatial closeness in the image to ‘closeness’ on GPU hardware: thread groups can analyze spheres together and share information. Overall, a two step process becomes apparent: (1) find out which spheres are relevant for a pixel (group), (2) draw the relevant spheres for each pixel. Both steps must be tightly interconnected so that memory accesses are reduced to a minimum.

By design of our scene parameterization the enclosing rectangle of the projection of each sphere is simple to find. But even in this simple case we would do the intersection part of the calculation repeatedly: every pixel (group) would calculate the enclosing rectangle for each sphere. This is why we separate the enclosing rectangle computation as step (0) into its own GPU kernel. Importantly, through this separation, we can parallelize step (0) over the spheres and use the full device resources.

3.1.1. Step 0: enclosing rectangle calculation

This step is parallelized over the spheres. It aims at determining the relevant region in the image space and encoding the region and draw information in an efficient way for the following steps. The standard choice for such an encoding is a -d-tree, bounding volume hierarchy (BVH) or a similar acceleration structure. We experimented with (extended) Morton codes (Morton, 1966; Vinkler et al., 2017) and the fast parallel BVH implementations (Karras, 2012; Karras and Aila, 2013) and found their performance inferior111We used our own implementation that closely follows Karras et al.’s papers but is likely slower than theirs. We evaluated the patented tr-BVH implementation in the NVIDIA OPTIX package (https://developer.nvidia.com/optix). However, OPTIX does not provide access to the acceleration structure and just allows to query it. This is insufficient for our use case because we need to find an arbitrary number of closest spheres to the camera. compared to the following strategy using bounding box projections.

Instead of using acceleration structures, the sphere geometry allows us to find the projection bounds of the sphere onto the sensor plane. This is done with only a few computations for the orthogonal but also the pinhole projection model. In the pinhole model, the distortion effects make slightly more complex computations necessary; trigonometric functions can be avoided for higher numerical accuracy through the use of several trigonometric identities.

Additional steps must be taken to robustify the calculated boundaries for numerical inaccuracies. We make the design choice to have every sphere rendered with at least at one pixel: in this way, every sphere always receives gradients and no spheres are ‘lost’ between pixel rays. We store the results of these calculations in two data structures:

Intersection information:

This is a struct with four unsigned short values and contains the calculated and limits for each sphere. This data structure needs 8 bytes of memory. One cache line on the NVIDIA Turing GPUs holds  bytes, meaning that all 32 threads in a warp can load one of these data structures with one command. This makes coalesced iteration fast, which helps to process large amounts of intersection data structures in parallel.

Draw information:

This is a struct

with all the information needed to draw a sphere once an intersection has been detected. We store the position vector, up to three channel value floats or a float pointer (in case of more than 3 channels), the distance to the sphere center and the sphere radius. This requires

of storage per sphere. The importance of this step is to localize all required information and convert a ‘struct of arrays’ (separate arrays with position, radius, colors) to an ‘array of structs’ (one array of draw information structures) with the required information.

The computation and storage run in for spheres. We additionally store the earliest possible intersection depth for each sphere in a separate array. For spheres that are outside of the sensor area, this value is set to infinity. Then, we use the CUB library222http://nvlabs.github.io/cub/ to sort the intersection information and draw information arrays by earliest possible intersection depth. This step takes another for spheres. The sorting is important for the following steps: a sphere intersection search may be stopped early once it has been determined that no sphere with a greater distance can still have a notable impact.

3.1.2. Step 1: intersection search

The aim for this step is to narrow down the number of spheres relevant for pixels at hand as much and as quickly as possible, leveraging as much shared infrastructure as possible. That’s why in a first processing step, we divide the entire image into nine parts333During sorting, we also find the enclosing rectangle for all visible spheres and use this information for tighter bounds of the region to draw. (an empirically derived value). The size of each of the nine parts is a multiple of thread block launch sizes (we determined this to be pixels on current GPU architectures). All nine parts are processed sequentially. For each part, we first use the full GPU to iterate over the intersection information array to find spheres that are relevant for pixels in the region (we can again parallelize over the spheres). Using the CUB select_flags routine, we then quickly create arrays with the sorted, selected subset of intersection information and draw information data structures for all spheres. From this point on, we parallelize over the pixels and use blocks and warps to use coalesced processing of spheres.

The next level is the block-wise intersection search. We use a block size of threads, so eight warps per block. We observed that larger block sizes for this operation always improved performance but reached a limit of current hardware at a size of 256 due to the memory requirements. This indicates that the speed of the proposed algorithm will scale favorably with future GPUs.

We implement the intersection search through coalesced loading of the intersection information structures and testing of the limits of the current pixel block. The sphere draw information for spheres with intersections are stored in a block-wide shared memory buffer with a fixed size. This size is a multiple of the block size to always be able to accommodate all sphere hits. Write access to this buffer needs to be properly synchronized. If the buffer becomes too full or the spheres are exhausted, Step 2 execution is invoked to clear it. In Step 2, each pixel thread works autonomously and care must be taken to introduce appropriate synchronization boundaries to coordinate block and single thread execution. Additionally, each pixel thread can vote whether it is ‘done’ with processing spheres and future spheres would have not enough impact; if all pixels in a block vote ‘done’, execution is terminated. The vote is implemented through a thread-warp-block stage-wise reduction operation.

3.1.3. Step 2: the draw operation

The draw operation is executed for each pixel separately and for each sphere draw information that has been loaded into the shared memory buffer. Because every pixel is processed by its own thread, write conflicts for the channel information are avoided and each pixel thread can work through the list of loaded draw information at full speed. The intersection depths for each sphere are tracked: we use a small (in terms of number of spheres to track; this number is fixed at compile time) optimized priority queue to track the IDs and intersection depths of the closest five spheres. Additionally, updating the denominator of the rendering equation allows us to continuously have a tracker for the minimum required depth that a sphere must have for an an percent contribution to the color channels. If set (default value: 1%), this allows for early termination of the raycasting process for each pixel.

3.1.4. Preparing for the backward pass

If a backward pass is intended (this can be optionally deactivated), some information of the forward pass is written into a buffer. This buffer contains for each pixel the normalization factor as well as the intersection depths and IDs of the closest five spheres hit.

We experimented with various ways to speed up the backward calculation, and storing this information from the forward operation is vastly superior to all others. It allows to skip the intersection search altogether at the price of having to write and load the backward information buffer. Since writing and loading can be performed for each thread without additional synchronization, it still turned out to be the most efficient way.

3.2. The backward pass

Even with the intersection information available, there remain multiple options on how to implement the backward pass. It’s possible to parallelize over the spheres (this requires for each thread to iterate over all pixels a sphere impacts, but it avoids synchronization to accumulate gradient information) or over the pixels (this way each thread only processes the spheres that have been detected at the pixel position, but requires synchronization for gradient accumulation for each sphere). We found that parallelizing over the pixels is superior, especially since this implementation is robust to large sphere sizes.

Again, minimizing memory access is critical to reach high execution speeds. To achieve this, we reserve the memory to store all sphere specific gradients. Additionally, we also allocate a buffer for the camera gradient information per sphere. We found that accumulating the sphere gradients through synchronized access from each pixel thread is viable, but synchronizing the accumulation of the camera gradients, for which every sphere for every pixel has a contribution, causes too much memory pressure. Instead, we accumulate the camera gradients sphere-wise and run a device-wide reduction as a post-processing step. This reduces the runtime cost to only for pheres.

Overall, this implementation proved robust and fast in a variety of settings. Apart from being nearly independent of sphere sizes, it scales well with image resolution and the number of spheres. We found additional normalization helpful to make the gradients better suited for gradient descent optimization:

  • sphere gradients are averaged over the number of pixels from which they are computed. This avoids parameters of small spheres converging slower than those of large spheres. In principle, large spheres have a larger impact on the image, hence receive larger gradients. From an optimization point of view, we found the normalized gradients much better suited for stable loss reduction with gradient descent techniques.

  • camera gradients need to take the sphere size into account to lead to a stable optimization. We use the area that each sphere covers in the image as a normalization factor (together with the constant , which we found approximately suitable to avoid having to scale sphere and camera gradients differently in gradient descent optimization). The area normalization makes this calculation very similar to Monte Carlo integration.

The gradient computation for each of the gradients is only performed if the gradients are required by the PyTorch autodiff framework.

3.3. The rendering equation

We build our rendering equation inspired by the SoftRasterizer (Liu et al., 2019b). On a high level, this means that we define a function that combines position, channel information, radius and opacity into one weight that is being used to merge the channel information using a depth-weighed SoftMax function with the vectors of other spheres.

Overall, for each ray we aim to find the blending weight for every sphere . If can be differentiated w.r.t. all relevant parameters, we have a differentiable rendering function. Indeed, we use simple linear blending for the weights and the sphere information (usually multiple channels): . Assuming is user-provided, we need to find the values.

The depth must have the major impact on the blending factor. Similar to (Liu et al., 2019b), we choose a weighed softmax function with the sphere intersection depth as the softmax quantity as the basis for our definition:

(1)

The intersection depth is calculated according to the current projection model, either projective or orthogonal. A factor is used as a scaling factor to push the representation to be more rigorous with respect to depth: for smaller values such as a ‘hard’ blending is performed that is very close to a regular non-blended setting. For a value such as , a ‘soft’ blending occurs. Depending on the quantities optimized it makes sense to use different values for gamma; the two values presented here are the limits we allow. The additional offset is the ‘weight’ for the background color of the scene, for a fixed small constant .

This equation allows us to find gradients for the depth w.r.t. the camera, but no gradients would allow repositioning in other directions. That’s why we incorporate the orthogonal distance of the ray to the sphere center, , as a linear factor for each weight as follows:

(2)

This distance, since always orthogonal to the ray direction, automatically provides gradients for the remaining two directions444Strictly speaking, for one ray this direction gradient could be non-existent if the ray hit the sphere in its center; or it could just provide gradients in one of the two remaining directions if it hit the sphere perfectly above or to the side of its center. We provide position gradients only for spheres that have more than three pixels projected radius because we observed that the gradients are numerically not stable otherwise. This means, that if position gradients are provided they can move spheres in all directions in space.. We calculate as , where is the vector pointing orthogonal from the ray to the sphere center, and is the Euclidean norm. Like this, becomes a linear scaling factor in .

For performing volumetric optimization we need to incorporate opacity. It is non trivial to integrate opacity in this equation in a differentiable way because it must be ‘soft’. Assuming there’s a per sphere opacity value , it could be integrated as a factor into the exponential function or as another linear scaling factor. We observed that integrating it only as a depth scaling factor leaves spheres visible in front of the background. Using it as a ‘distance’ scaling factor only has depth ‘override’ opacity in most cases and does not lead to appropriate results. Using it in both places is a feasible and numerically stable solution (see (Zeng et al., 2020)). This leads to the full equation:

(3)

3.4. Implementation details

The renderer is implemented in C++ and CUDA C as a PyTorch extension. The core can be used without any PyTorch facilities. We define a minimal set of macros to capture all device specific commands. This allows us to compile the CUDA C code unchanged for serial execution on the CPU. Even though compiling the CUDA code directly for the CPU is not optimal due to the different use of parallelism, it allows for easy debugging, testing and numerical comparisons of computation results. For smaller images and numbers of spheres it is usable for optimization; for millions of spheres the computation on the CPU is vastly too slow.

All CUDA operations are pushed to CUDA streams, for which we respect the active streams for PyTorch, if available. Instead of naively implementing the weighed softmax operation, we extend the numerically stable softmax (Milakov and Gimelshein, 2018) to incorporate weights.

For verification of our gradients, we rely on symbolic differentiation of the render function using the sympy package (Meurer et al., 2017) and comparing the gradients in automated tests. Since we use 32 bit float values throughout the renderer (we experimented with partially using 16 bit float values, with resulted in poor quality) finite difference comparisons are brittle. On the other hand, we found the fastermath approximations555http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html for the and functions useful with little loss in quality. All time measurements in this paper have been performed on an NVIDIA RTX 2080 GPU.

Figure 2. Coarse reconstruction data, and optimization stages. (a) Two example renderings of the 3D head model. 80 images with random azimuth and elevation are used for the 3D reconstruction process. (b) from left to right, five snapshots over the course of the reconstruction process. The rightmost three images are scaled for visualization purposes. 523 gradient descent steps are performed in 73s (400000 spheres, 8001280 image resolution).

4. Experiments

In several experiments, we first show how it is possible to reproduce 3D reconstruction examples from two other state of the art differentiable renderers and then move beyond their limitations to demonstrate how pulsar can be used to create a highly realistic head model. We also briefly explain application possibilities that are beyond the scope of this paper. In all of our experiments we use solely a photometric

loss and no additional regularizers, stabilizers or gradient clipping for the scene parameter optimization. Utmost care has been taken to calculate the gradients with as little numerical noise as possible, and this allows us to avoid such additional strategies and makes the renderer easy to use in many scenarios. We show how well behaved the gradients are in Fig. 

4, where we display the progress through an optimization over time.

4.1. Reconstruction from Silhouettes

Figure 3. Reproducing the plane reconstruction example from silhouette views, as provided by the SoftRas package. We show results after only 150 steps of gradient descent with a simple photometric loss. In Fig. 3, we show example silhouette projections, in Fig. 3 the full reconstruction result (1352 spheres, 6464 image resolution).

In this first experiment, we reproduce an example from the SoftRas (Liu et al., 2019b) package for 3D reconstruction using silhouettes. images with object silhouettes are available to reconstruct the shape of a 3D object. The starting point for the optimization is a sphere with points which is warped stepwise to an airplane. SoftRas uses images of size , which pushes the sizes for a sphere to the lower limits in pixel size. To work with a sphere instead of a mesh representation, we place spheres at all mesh vertices. Instead of the more intricate and computationally complex IOU, Laplacian and flattening losses that are used in the SoftRas demo, we can solely use a photometric loss on the generated and target images. The additional losses are required to keep the mesh surface consistent; in contrast we can shift sphere locations without taking surface topology into account. We present the reconstruction results in Fig. 3. For this low number of spheres and resolution SoftRas is faster than pulsar, but scales badly (c.t. Tab. 1).

4.2. 3D Reconstruction with Lighting

Figure 4. A visualization of the optimization of a sphere morphing to a teapot through gradient updates. The top row shows the results using pulsar with diffuse shading and three parallel light sources. The bottom row shows the progression of DSS (Yifan et al., 2019) over the same amount of time and the result (8003 spheres, 256256 image resolution).

In a second experiment, we reproduce two experiments from the DSS (Yifan et al., 2019) project. This renderer includes a lighting model. We demonstrate how a simple lighting model can be added to our renderer by implementing diffuse shading with parallel light sources. We use three light sources and the dot product between the light direction vector and the normalized position vector to calculate the light intensity for each step (this obviously does not take occlusion into account). This highlights the versatility of a dedicated geometry projection step and demonstrates how easy it can be combined with additional refinement models. Similar to DSS, we use camera positions, selected at random azimuth and elevation angles but with a constant distance to the object center. In this experiment, steps of gradient descent suffice to bring the optimization to convergence. Using pulsar, we manage to complete the optimization within

, whereas DSS requires more complex loss functions and needs nearly

to converge after 477 steps.

4.3. Detailed Reconstruction and Neural Shading

Figure 5. Refined sphere representations and neural shading results. We provide visualizations of the first three color channels of the spheres and the results of a shallow (left) and deep (right) neural shading model (139794 spheres, 8001280 image resolution).
Figure 6. Neural shading results using the Pix2PixHD (Wang et al., 2018) architecture for a perspective from the validation set for varying training set sizes. The ground truth image is shown on the right.

All of these examples were reproduced to show the flexibility of pulsar, and to show how it can be integrated into existing pipelines. The latter example was using spheres and image resolution, which is far below the resolution of current cameras. The number of spheres is sufficient to represent a single object, however each sphere has to be quite large to create a closed surface. With the scaling performance of pulsar, can we represent much more detailed objects in image resolutions that match contemporary cameras?

We set out to investigate this question with an example of a synthetic head model (see Fig. 2). This head model originates from a light stage scan and has been artist refined and extended to have realistic hair and eyes. In a first step, we aim to find a coarse reconstruction of the head geometry. For this purpose, we render 100 images in resolution of the head with a randomized camera location, by assigning a random azimuth and limited elevation range. Then we store the images and extrinsic and intrinsic camera parameters for every image.

All the preceding experiments deform an initial geometry to match a target geometry. We propose to use another strategy that exploits the fact that the proposed renderer handles large quantities of spheres: we initialize a volume filled with randomly distributed spheres with fixed radius and quickly eliminate them if their color converges towards the background color (see Fig. 2). This allows us to find a coarse head model in only . It is easy to eliminate spheres inside of the head and obtain a proper surface model by casting rays from all camera positions onto the head and remove all spheres that do not receive any gradient updates. This results in a hull representation of the head with a virtual thickness of several centimeters: the hull is still formed of several spheres at every point, because every sphere is regarded at least as partially transparent. After this cleanup (), a model consisting of  spheres remains.

This coarse model might be sufficient for a powerful neural shading model applied in a successive step. To be able to use a neural shading model that works mainly pixel wise and locally with the advantage of better generalization and faster execution times, a more detailed model is required. To achieve this, we increase the number of spheres three times through subdivision. It is important not to create ‘holes’ in the model in each refinement. To avoid this, we refine each sphere with 12 spheres with radius , where is the previous radius, and place them in a face-centered cubic packing scheme. After this replacement, we run an optimization of all sphere parameters for all spheres and obtain an increasingly refined model. After the optimization is done, we again remove all spheres that are not reached by any ray from any of the viewpoints. After three stages, the resulting model has spheres. The refinement steps finish in and temporarily produce models with up to spheres between ‘cleanup’ steps. The cleanup steps are currently implemented in Python and well suited for a more efficient implementation. We integrate this sphere representation into a deep neural network. A discussion of potential generative neural network architectures is out of scope for this paper and any generative neural network can be used for this purpose. Furthermore, a generative parameterized model can be used to provide the sphere position and appearance parameters. To showcase the potential of the proposed pipeline, we experiment with two architectures: a shallow, three layer stacked , ,

convolution/ReLU architecture to demonstrate the effect of local neural shading, and an off-the-shelf Pix2PixHD 

(Wang et al., 2018) architecture to show how detailed results can be generated.

For both architectures, we generalize the number of channels to 8/15 (shallow/deep) with arbitrary, latent information that is being optimized during the optimization of the neural shading network. All hyper parameters, including the number of channels, were optimized through a hyper parameter sweep using Ray Tune (Liaw et al., 2018) and the HyperOpt (Bergstra et al., 2013) and ASHA (Li et al., 2018a) algorithms. This includes weights for multiple loss functions: 1) a photometric loss, 2) a ‘perceptual’ loss matching the outputs of a VGG model (Simonyan and Zisserman, 2014)

and 3) an adversarial loss. To obtain a more detailed representation of the face, we experiment with different numbers of sampled frontal camera positions: with 80 images in the training set we can already obtain a reconstruction that interpolates well between perspectives, but still with a visible loss in detail. With more than

images there’s hardly any perceptual difference between training and validation results visible any more (see Fig. 6).

Results of the optimized network can be found in Fig. 1 and Fig. 5. We visualize the first three color channels and neural shading results in Fig. 5. For the shallow model, more information is stored in the sphere channels. The model is expected to generalize better to unseen surfaces; due to a lack of an encoder-decoder structure it can not perform any global reasoning. The deeper model nearly perfectly reproduces the training data, but is more used as an interpolator between seen views and can not be expected to generalize to completely unseen surfaces. Both produce compelling results that can be rendered in near real time on consumer hardware (we achieve 30+ FPS for the geometry projection step, the shading step takes and for the shallow and deep model respectively, without any optimizations) whereas the original model has been created using raytracing at per frame. While we only demonstrate two off-the-shelf neural shading architectures as (1) the simplest and (2) a highly parameterized option, finding dedicated neural shading architectures and potentially introducing further intermediate processing steps will be a promising direction for future research.

4.4. Other applications

Figure 7. Detailed volumetric person reconstruction result ((Zeng et al., 2020), Fig. 7). A neural network is trained to predict opacity, color and normal values for a sphere based representation. Gradients are back-propagated through the geometry engine to refine the predictions.

The proposed geometry engine is versatile and can be easily integrated into models different from the presented, classical use cases. For example, we used the geometry engine to implement (Aliev et al., 2019), which focuses on viewpoint synthesis through point based scene representations. Another promising application direction is volumetric reconstruction. This assumes that certain areas in a volume are occupied with spheres, but these spheres may be transparent or only partially transparent. pulsar can represent this through the opacity parameter, for which we calculate gradients. This application area has shown promising results for implicit geometric representations, for example used in (Saito et al., 2019). We use pulsar

to refine the volumetric human reconstruction performed by a deep neural network. The neural network predicts an implicit surface representation. We sample points around the expected outline of the body using a normal distribution. Each of these points has a predicted opacity, color and normal. We refine the normal and color predictions using ground truth information using an

loss. You can find an example reconstruction, normal and color prediction in Fig. 7. For details, please see (Zeng et al., 2020).

5. Conclusion

In this paper, we have presented the pulsar differentiable geometry engine. Its architecture builds on recent insights in the fields of differentiable rendering and neural networks and makes deliberate choices to limit complexity in the projection process for high speed and scalability. This is complemented with easy integration into deep learning models in the PyTorch autodiff framework.

The speedup—up to orders of magnitude in comparison with existing differentiable renderers—and the improved scaling behavior paired with high rendering quality make new applications and settings feasible that were out of reach before. Higher resolution images can be rendered, and larger numbers of spheres can be used to represent complex and detailed scene geometry. Using entirely sphere-based representations facilitates optimizing not only for color, but also position and sphere size, making adjustments to the scene geometry possible. The implementation of a differentiable opacity per sphere enables volumetric optimization.

In our experiments, we show that the proposed module is highly flexible and can easily be integrated into complex pipelines, including lighting, in a straightforward and performant way. We have shown promising results in 3D reconstruction applications and hope that the presented framework will make the exploration of previously untenable ideas possible.

Acknowledgements.
We thank Carsten Stoll, Christophe Hery and Olivier Maury for many inspiring and insightful discussions and Christophe and Olivier for sharing the head model that we used to perform the 3D reconstruction experiments. Rania Briq and Zhaoyang Lv provided great feedback and suggestions for improvement. Zeng Huang, Yuanlu Xu and Tony Tung were involved in the human reconstruction project and provided inspiration and feedback for the opacity implementation. Thanks to David Altman for his many comments and suggestions for corrections for the paper.

References

  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    .
    Note: Software available from tensorflow.org External Links: Link Cited by: §2.
  • K. Aliev, D. Ulyanov, and V. Lempitsky (2019) Neural point-based graphics. arXiv preprint arXiv:1906.08240. Cited by: §2, §4.4.
  • D. Azinovic, T. Li, A. Kaplanyan, and M. Nießner (2019)

    Inverse path tracing for joint material and lighting estimation

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2447–2456. Cited by: §2.
  • B. G. Baumgart (1974) Geometric modeling for computer vision. Technical report Technical Report AI Lab Memo AIM-249, Stanford University. Cited by: §2.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013)

    Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures

    .
    Cited by: §4.3.
  • Y. Hu, L. Anderson, T. Li, Q. Sun, N. Carr, J. Ragan-Kelley, and F. Durand (2019) DiffTaichi: differentiable programming for physical simulation. arXiv preprint arXiv:1910.00935. Cited by: §2.
  • E. Insafutdinov and A. Dosovitskiy (2018) Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Processing Systems 31 (NIPS 2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Montréal, Canada, pp. 2804–2814 (eng). Cited by: §2.
  • Y. Jiang, D. Ji, Z. Han, and M. Zwicker (2019) SDFDiff: differentiable rendering of signed distance fields for 3d shape optimization. arXiv preprint arXiv:1912.07109. Cited by: §2.
  • T. Karras and T. Aila (2013) Fast parallel construction of high-quality bounding volume hierarchies. In Proceedings of the 5th High-Performance Graphics Conference, pp. 89–99. Cited by: §3.1.1.
  • T. Karras (2012) Maximizing parallelism in the construction of bvhs, octrees, and k-d trees. In Proceedings of the Fourth ACM SIGGRAPH/Eurographics conference on High-Performance Graphics, pp. 33–37. Cited by: §3.1.1.
  • H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3907–3916. Cited by: §2.
  • T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum (2015) Deep convolutional inverse graphics network. In Advances in neural information processing systems, pp. 2539–2547. Cited by: §2.
  • L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, M. Hardt, B. Recht, and A. Talwalkar (2018a) Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934. Cited by: §4.3.
  • T. Li, M. Aittala, F. Durand, and J. Lehtinen (2018b) Differentiable monte carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37 (6), pp. 222:1–222:11. Cited by: §2.
  • R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica (2018) Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118. Cited by: §4.3.
  • C. Lin, C. Kong, and S. Lucey (2018) Learning efficient point cloud generation for dense 3d object reconstruction. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • G. Liu, D. Ceylan, E. Yumer, J. Yang, and J. Lien (2017) Material editing using a physically based rendering network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2261–2269. Cited by: §2.
  • S. Liu, Y. Zhang, S. Peng, B. Shi, M. Pollefeys, and Z. Cui (2019a) DIST: rendering deep implicit signed distance function with differentiable sphere tracing. arXiv preprint arXiv:1911.13225. Cited by: §2.
  • S. Liu, T. Li, W. Chen, and H. Li (2019b) Soft rasterizer: a differentiable renderer for image-based 3d reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7708–7717. Cited by: §2, §3.3, §3.3, §4.1.
  • M. M. Loper and M. J. Black (2014) OpenDR: an approximate differentiable renderer. In Computer Vision – ECCV 2014, Lecture Notes in Computer Science, Vol. 8695, pp. 154–169. Cited by: §2, §2.
  • A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R. Terrel, Š. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz (2017) SymPy: symbolic computing in python. PeerJ Computer Science 3, pp. e103. External Links: ISSN 2376-5992, Link, Document Cited by: §3.4.
  • M. Milakov and N. Gimelshein (2018) Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867. Cited by: §3.4.
  • B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934. Cited by: §2.
  • G. M. Morton (1966) A computer oriented geodetic data base and a new technique in file sequencing. Cited by: §3.1.1.
  • T. H. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang (2018) Rendernet: a deep convolutional network for differentiable rendering from 3d shapes. In Advances in Neural Information Processing Systems, pp. 7891–7901. Cited by: §2.
  • M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob (2019) Mitsuba 2: a retargetable forward and inverse renderer. Transactions on Graphics (Proceedings of SIGGRAPH Asia) 38 (6). External Links: Document Cited by: §2.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §1.
  • F. Petersen, A. H. Bermano, O. Deussen, and D. Cohen-Or (2019) Pix2vex: image-to-geometry reconstruction using a smooth differentiable renderer. arXiv preprint arXiv:1903.11149. Cited by: §2.
  • N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W. Lo, J. Johnson, and G. Gkioxari (2020) PyTorch3D. Note: https://github.com/facebookresearch/pytorch3d Cited by: §2.
  • R. Roveri, L. Rahmann, C. Oztireli, and M. Gross (2018) A network architecture for point cloud classification via automatic depth images generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4176–4184. Cited by: §2.
  • S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2304–2314. Cited by: §2, §4.4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.
  • V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, Cited by: §2.
  • A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi, K. Sunkavalli, R. Martin-Brualla, T. Simon, J. Saragih, M. N. ner, R. Pandey, S. Fanello, G. Wetzstein, J.-Y. Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B. Goldman, and M. Zollhöfer (2020) State of the art on neural rendering. EUROGRAPHICS 2020 39 (2). Cited by: §2.
  • S. Tulsiani, R. Tucker, and N. Snavely (2018) Layer-structured 3d scene inference via view synthesis. In Proc. ECCV, Cited by: §2.
  • J. Valentin, C. Keskin, P. Pidlypenskyi, A. Makadia, A. Sud, and S. Bouaziz (2019) TensorFlow graphics: computer graphics meets deep learning. Cited by: §2.
  • M. Vinkler, J. Bittner, and V. Havran (2017) Extended morton codes for high performance bounding volume hierarchy construction. In Proceedings of High Performance Graphics, pp. 1–8. Cited by: §3.1.1.
  • T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Figure 6, §4.3.
  • O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson (2019) SynSin: end-to-end view synthesis from a single image. arXiv preprint arXiv:1912.08804. Cited by: §2, §2.
  • W. Yifan, F. Serena, S. Wu, C. Öztireli, and O. Sorkine-Hornung (2019) Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–14. Cited by: §2, Table 2, Figure 4, §4.2.
  • H. Zeng, Y. Xu, C. Lassner, H. Li, and T. Tung (2020) ARCH: animatable reconstruction of clothed humans. arXiv preprint arXiv:2004.04572. Cited by: §2, §3.3, Figure 7, §4.4.