DIST: Rendering Deep Implicit Signed Distance Function with Differentiable Sphere Tracing

11/29/2019 ∙ by Shaohui Liu, et al. ∙ 3

We propose a differentiable sphere tracing algorithm to bridge the gap between inverse graphics methods and the recently proposed deep learning based implicit signed distance function. Due to the nature of the implicit function, the rendering process requires tremendous function queries, which is particularly problematic when the function is represented as a neural network. We optimize both the forward and backward pass of our rendering layer to make it run efficiently with affordable memory consumption on a commodity graphics card. Our rendering method is fully differentiable such that losses can be directly computed on the rendered 2D observations, and the gradients can be propagated backward to optimize the 3D geometry. We show that our rendering method can effectively reconstruct accurate 3D shapes from various inputs, such as sparse depth and multi-view images, through inverse optimization. With the geometry based reasoning, our 3D shape prediction methods show excellent generalization capability and robustness against various noise.



There are no comments yet.


page 2

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Solving vision problem as an inverse graphics process is one of the most fundamental approaches, where the solution is the visual structure that best explains the given observations. In the realm of 3D geometry understanding, this approach has been used since the very early age [1, 33, 52]. As a critical component to the inverse graphics based 3D geometric reasoning process, an efficient renderer is required to accurately simulate the observations, e.g., depth map, from an optimizable 3D structure, and also be differentiable to back-propagate the error from the partial observation.

As a natural fit to the deep learning framework, differentiable rendering techniques have drawn great interests recently. Various solutions for different 3D representations, e.g., volume, point cloud, mesh, have been proposed. However, these 3D representations are all discretized up to a certain resolution, leading to the loss of geometric details and breaking the differentiable properties [22]. Recently, continuous implicit function has been used to represent the signed distance field [32], which has premium capacity to encode accurate geometry when combined with the deep learning techniques. Given a latent code as the shape representation, the function can produce a signed distance value for any arbitrary point, and thus enable unlimited resolution and better preserved geometric details for rendering purpose. However, a differentiable rendering solution for learning-based continuous signed distance function does not exist yet.

Figure 1: Illustration of our proposed differentiable renderer for continuous signed distance function. Our method enables geometric reasoning with strong generalization capability. With a random shape code initialized in the learned shape space, we can acquire high-quality 3D shape prediction by performing iterative optimization with various 2D supervisions.

In this paper, we propose a differentiable renderer for continuous implicit signed distance function (SDF) to facilitate the 3D shape understanding via geometric reasoning in deep learning framework (Fig. 1). Our method can render an implicit SDF represented by a neural network from a latent code into various 2D observations, e.g

., depth image, surface normal, silhouette, plus other properties encoded, from arbitrary camera viewpoints. The rendering process is fully differentiable, such that loss functions can be conveniently defined on the rendered images and the observations, and the gradients can be propagated back to the shape generator. As major applications, our differentiable renderer can be applied to infer the 3D shape from various inputs,

e.g., multi-view images and single depth image, through an inverse graphics process. Specifically, given a pre-trained generative model, e.g., DeepSDF [32], we search within the latent code space for the 3D shape that produces the rendered images mostly consistent with the observation. Extensive experiments show that our geometric reasoning based approaches exhibit significantly better generalization capability than traditional purely learning based approaches, and consistently produce accurate 3D shapes across dataset without finetuning.

Nevertheless, it is challenging to make differentiable rendering work on a learning-based implicit SDF with computationally affordable resources. The main obstacle is that an implicit function provides neither the exact location nor any bound of the surface geometry as in other representations like mesh, volume, and point cloud.

Inspired by traditional ray-tracing based approaches, we adopt the sphere tracing algorithm [13], which marches along each pixel’s ray direction with the queried signed distance until the ray hits the surface, i.e., the signed distance equals to zero (Fig. 2). However, this is not feasible in the neural network based scenario where each query on the ray would require a forward pass and recursive computational graph for back-propagation, which is prohibitive in terms of computation and memory.

Figure 2: Illustration on the sphere tracing algorithm [13]. A ray is initiated at each pixel and marching along the viewing direction. The front end moves with a step size equals to the signed distance value of the current location. The algorithm converges when the current absolute SDF is smaller than a threshold, which indicates that the surface has been found.

To make it work efficiently on a commodity level GPU, we optimize the full life-time of the rendering process for both forward and backward propagation. In the forward rendering pass, we adopt a coarse-to-fine approach to save computation at initial steps: an aggressive strategy to speed up the marching, and a safe convergence criteria to prevent unnecessary queries and maintain resolution. In the backward propagation, we propose a gradient approximation which empirically has negligible impact on system performance but dramatically reduces the computation and memory consumption. By making the rendering tractable, we show how to produce 2D observations with the sphere tracing and interact with camera extrinsics in differentiable ways.

To sum up, our major contribution is to enable efficient differentiable rendering on implicit signed distance function represented as a neural network. It enables accurate 3D shape prediction via geometric reasoning in deep learning frameworks and exhibits outstanding generalization capability. The differentiable renderer could also potentially benefit various vision problems thanks to the marriage of implicit SDF and inverse graphics techniques. The following part of the paper is organized as follows. Section 2 introduces related works on differentiable rendering and implicit continuous functions. In Section 3, we explain the proposed renderer in detail. Section 4 shows the experimental results, followed by a conclusion in Section 5.

2 Related Work

3D Representation for Shape Learning The 3D representation for shape learning is one of the main focuses in 3D deep learning community. Early work quantizes shapes into 3D volumes, where each voxel contains either binary occupancy status (occupied / not occupied) [50, 6, 44, 37, 12] or a signed distance value [53, 9, 43]. While voxels are the most straightforward extension from 2D image domain into 3D geometry domain for neural network operations, they normally require huge memory overhead which leads to relatively low resolutions. Meshes are also proposed as a more memory efficient representation for 3D shape learning [45, 11, 21, 19], while the topology of meshes is normally fixed and simple. Many deep learning methods also utilize point clouds as the 3D representation [35, 36]; however, point-based representation lacks of the topology information and thus makes it non-trivial to generate 3D meshes. Very recently, the implicit functions, e.g., continuous SDF and occupancy function, are exploited as 3D representations which show much promising performance in terms of high-frequency detail modeling and high resolution [32, 27, 28, 4]. Similar idea has been also used to encode other information such as texture [31, 38] and 4D dynamics [30]. Our work aims to design an efficient and differentiable render for implicit SDF-based representation.

Differentiable Rendering With the success of deep learning, the differentiable rendering starts to draw more attention as it is essential for end-to-end training. Depending on 3D representations, different rendering techniques have been proposed. Early works focus on 3D triangulated mesh as input and leverage standard rasterization [26]. Various approaches try to solve the discontinuity issue near triangle boundaries by smoothing the loss function or approximating the gradient [20, 34, 24, 3]. Solutions for point cloud and 3D volumes are also introduced [46, 17] to work jointly with PointNet [35] and 3D convolutional architecture. However, the differentiable rendering for the implicit continuous function representation does not exist yet. Some ray tracing based approaches are related, while they are mostly proposed for explicit representation, such as 3D volume [25, 29, 41] or mesh [22], but not implicit function. Most related to our work, Sitzmann et al. [42]

propose a LSTM-based renderer for an implicit scene representation to generate color images, while their model focuses on simulating the rendering process with an LSTM without clear geometric meaning. This method can only generate low-resolution images due to the expensive memory consumption. Alternatively, our method can directly render 3D geometry represented by implicit SDF to produce high-resolution images. It can be also applied without training to existing deep learning models.

3D Shape Prediction 3D shape prediction from 2D observations is one of the fundamental vision problems. Early works mainly focus on multi-view reconstruction using multi-view stereo methods [39, 14, 40]. These purely geometry-based methods suffer from degraded performance on texture-less regions without prior knowledge [7]. With progress of deep learning, 3D shapes can be recovered under different settings. The simplest setting is to recover 3D shape from a single image [6, 10, 49, 18]. These systems rely heavily on priors, and are prone to weak generalization. Deep learning based multi-view shape prediction methods [51, 15, 16, 47, 48] further involve geometric constraints across views in the deep learning framework, which shows better generalization. Another thread of work [9, 8] takes a single depth image as input, and the problem is usually referred as shape completion.Given the shape prior encoded in the neural network [32], our rendering method can effectively predict accurate 3D object shape from a random initial shape code with various inputs, such as depth and multi-view images, through geometric optimization.

3 Differentiable Sphere Tracing

(a) Coarse-to-fine Strategy (b) Aggressive Marching (c) Convergence Criteria
Figure 3: Strategies for our efficient forward propagation. (a) 1D illustration of our coarse-to-fine strategy, and for 2D cases, one ray will be spitted into 4 rays; (b) Comparison of standard marching and our aggressive marching; (c) We stop the marching once the SDF value is smaller than , where

is the estimated minimal distance between the corresponding 3D points of two neighboring pixels.

In this section, we introduce our differentiable rendering method for implicit signed distance function represented as a neural network, such as DeepSDF [32]. In DeepSDF, a network takes a latent code and a 3D location as input, and produces the corresponding signed distance value. Even though such a network can deliver high quality geometry, the explicit surface cannot be directly obtained and requires dense sampling in the 3D space.

Our method is inspired by Sphere Tracing [13] designed for rendering SDF volumes, where rays are shot from the camera pinhole along the direction of each pixel to search for the surface level set according to the signed distance value. However, it is prohibitive to apply this method directly on the implicit signed distance function represented as a neural network, since each tracing step needs a feedforward neural network and the whole algorithm requires unaffordable computational and memory resources. To make this idea work in deep learning framework for inverse graphics, we optimize both the forward and backward propagation for efficient training and test-time optimization. The sphere traced results, i.e., the distance along the ray, can be converted into many desired outputs, e.g., depth, surface normal, silhouette, and hence losses can be conveniently applied in an end-to-end manner.

3.1 Preliminaries - Sphere Tracing

For a self-contained purpose, we first briefly introduce the traditional sphere tracing algorithm [13]. Sphere tracing is a conventional method specifically designed to render depth from volumetric signed distance fields. For each pixel on the image plane, as shown in Figure 2, a ray () is shot from the camera center () and marches along the direction () with a step size that is equal to the queried signed distance value (). The ray marches iteratively until it hits or gets sufficiently close to the surface (i.e. abs(SDF) threshold). A more detailed algorithm can be found in Algorithm 1.

1:initialize , , .
2:while not converged do:
3:     Take the corresponding SDF value of the location and make update: .
4:     ,
5:     check convergence
6:end while
Algorithm 1 Naive sphere tracing algorithm for a camera ray over a signed distance fields .

3.2 Efficient Forward Propagation

Directly applying sphere tracing to an implicit SDF function represented by a neural network is prohibitively computational expensive, because each query of requires a forward pass of a neural network with considerable capacity. Naive parallelization is not sufficient since essentially millions of network queries are required for a single rendering with VGA resolution (). Therefore, we need to cut off unnecessary marching steps and safely speed up the marching progress.

Initialization Because all the 3D shapes represented by DeepSDF are bounded within the unit sphere, we initialize to be the intersection between the camera ray and the unit sphere for each pixel. Pixels with the camera rays that do not intersect with the unit sphere are set as background (i.e., infinite depth).

Coarse-to-fine Strategy At the beginning of sphere tracing, rays for different pixels are fairly close to each other, which indicates that they will likely march in a similar way. To leverage this nice property, we propose a coarse-to-fine sphere tracing strategy, which is shown in Fig. 3 (a). We start the sphere tracing from an image with of its original resolution, and split each ray into four after every three marching steps, which is equivalent to doubling the resolution. After six steps, each pixel in the full resolution has a corresponding ray, which keeps marching until convergence.

Aggressive Marching After the ray marching begins, we apply an aggressive strategy (Fig. 3 (b)) to speed up the marching progress by updating the ray with times of the queried signed distance value, where in our implementation. This aggressive sampling has several benefits. First, it makes the ray march faster towards the surface, especially when it is far from surface. Second, it accelerates the convergence for the ill-posed condition, where the angle between the surface normal and the ray direction is small. Third, the ray can pass through the surface such that space in the back (i.e., SDF 0) could be sampled. This is crucially important to apply supervision on both sides of the surface during optimization.

Dynamic Synchronized Inference A naive parallelization for speeding up sphere tracing is to batch the rays together and synchronously update the front end position. However, depending on the 3D shape, some rays may converge earlier than others, thus leading to wasted computation. We maintain a dynamic unfinished mask indicating which rays still require further marching to prevent unnecessary computation.

Convergence Criteria Even with aggressive marching, the ray movement can be extremely slow when close to the surface since is close to zero. We define a convergence criteria to stop the marching when accuracy is sufficiently good and the gain is marginal (Fig. 3(c)). To fully maintain the detailed geometry supported by the 2D rendering resolution, it is sufficiently safe to stop when the sampled signed distance value does not confuse one pixel with its neighbors. For an object with a smallest depth of 10 captured by a camera with 60 focal length, 32 sensor width, and a resolution of , the approximate minimal distance between the corresponding 3D points of two neighboring pixels is (). In practice, we set the convergence threshold as for most of our experiments.

3.3 Rendering 2D Observations

After all rays converge, we can compute the distance along each ray as the following:


where is the residual term on the last query. In the following part we will show how this computed ray distance is converted into 2D observations.

Depth and Surface Normal Suppose that we find the 3D surface point for a pixel in the image. Then we can directly get the depth for each pixel as the following:


where is the normalized homogeneous coordinate.

The surface normal of the point can be directly computed as the normalized gradient of the function . Since is an implicit function, we take the approximation of the gradient by sampling neighboring locations:


Silhouette Silhouette is a commonly used supervision for 3D shape prediction. To make the rendering of silhouette differentiable, we get the minimum absolute signed distance value for each pixel along its ray and subtract it by the convergence threshold . This produces a tight approximation of the silhouette, where pixels with positive values belong to the background, and vice versa. Note that directly checking if ray marching stops at infinity can also generate the silhouette but it is not differentiable.

Color and Semantics Recently, it has been shown that texture can also be represented as an implicit function parameterized with a neural network [31]. Not only color, other spatially varying properties, like semantics, material, etc, can all be potentially learned by implicit functions. These information can be rendered jointly with the implicit SDF to produce corresponding 2D observations, and some examples are depicted in Fig. 8.

3.4 Approximated Gradient Back-Propagation

DeepSDF [32] uses the conditional implicit function to represent a 3D shape as , where is the network parameters, and is the latent code representing a certain shape. As a result, each queried point in the sphere tracing process is determined by and the shape code , which requires to unroll the network for multiple times and costs huge memory for back-propagation with respect to :


Practically, we ignore the gradients from the residual term in Equation (1). In order to make back-propagation feasible, we define a loss for samples with the minimum absolute SDF value on the ray to encourage more signals near the surface. For each sample, we calculate the gradient with only the first term in Equation (4) as the high-order gradients empirically have less impact on the optimization process. In this way, our differentiable renderer is particularly useful to bridge the gap between this strong prior and some partial observations. Given a certain observation, we can search for the code that minimizes the difference between the rendering from our network and the observation. This allows a number of applications which will be introduced in the next section.

4 Experiments and Results

In this section, we first verify the efficacy of our differentiable sphere tracing algorithm, and then show that 3D shape understanding can be achieved through geometry based reasoning by our method.

4.1 Rendering Efficiency and Quality

Method size #step #query time
Naive sphere tracing 50 N/A N/A
+ practical grad. 50 6.06M 1.6h
+ parallel 50 6.06M 3.39s
+ dynamic 50 1.99M 1.23s
+ aggressive 50 1.43M 1.08s
+ coarse-to-fine 50 887K 0.99s
+ coarse-to-fine 100 898K 1.24s
Table 1: Ablation studies on the cost-efficient feedforward design of our method. The feedforward time was tested on a single NVIDIA GTX-1080Ti over the architecture of DeepSDF [32]. Note that the number of initialized rays is quadratic to the image size, and the numbers are reported for the resolution of .
parallel + dynamic + aggressive + coarse-to-fine
Figure 4: The surface normal rendered with different speed up strategies turned on. Note that adding up these components does not deteriorate the rendering quality.

Run-time Efficiency In this section, we evaluate the run-time efficiency promoted by each design in our differentiable sphere tracing algorithm. The number of queries and runtime for both forward and backward pass at a resolution of on a single NVIDIA GTX-1080Ti are reported in Tab. 1, and the corresponding rendered surface normal are shown in Fig. 4. We can see that the proposed back-propagation prunes the graph and reduces the memory usage significantly, making the rendering tractable with a standard graphics card. The dynamic synchronized inference, aggressive marching and coarse-to-fine strategy all speed up rendering. With all these designs, we can render an image with only 887K query steps within 0.99s when the maximum tracing step is set to 50. The number of query steps only increases slightly when the maximum step is set to 100, indicating that most of the pixels converge safely within 50 steps. Note that related works usually render at a much lower resolution [42].

Figure 5: Loss curves for 3D prediction from partial depth. Our accelerated rendering does not impair the back-propagation. The loss on the depth image is tightly correlated with the Chamfer distance on 3D shapes, which indicates effective back-propagation.
initial optimized
Figure 6: Illustration of the optimization process over the camera extrinsic parameters. Our differentiable renderer is able to propagate the error from the image plane to the camera. Top row: rendered surface normal. Bottom row: error map on the silhouette.

Back-Propagation Effectiveness We conduct sanity checks to verify the effectiveness of the back-propagation with our approximated gradient. We take a pre-trained DeepSDF [32] model and run geometry based optimization to recover the 3D shape and camera extrinsics separately using our differentiable renderer. We first assume camera pose is known and optimize the latent code for 3D shape w.r.t the given ground truth depth, surface normal and silhouette. As can be seen in Fig. 5 (left), the loss drops quickly, and using acceleration strategies does not hurt the optimization. Fig. 5 (right) shows the total loss on the 2D image plane is highly correlated with the Chamfer distance on the predicted 3D shape, indicating that the gradients originated from the 2D observation are successfully back-propagated to the shape. We then assume a known shape (fixed latent code) and optimize the camera pose using depth and a binary silhouette. Fig. 6 shows that a random initial camera pose can be effectively optimized toward the ground truth pose by minimizing the gradients on 2D observation visualized below.

Convergence Criteria The convergence criteria, i.e. the threshold on signed distance to stop the ray tracing, has a direct impact on the rendering quality. Fig. 7 shows the rendering result under different thresholds. As can be seen, rendering with large threshold will dilate the shape, which lost boundary details. Using a small threshold, on the other hand, may produces incomplete geometry. This parameter can be tuned according to applications, but in practice we found our threshold is effective in producing complete shape with details up to the image resolution.

Figure 7: Effects on choices of different convergence thresholds. Under the same marching step, a very large threshold can incur dilation around boundaries while a small threshold may lead to erosion. We pick for all of our experiments.

Rendering Other Properties Not only the signed distance function for 3D shape, implicit function can also encode other spatially variant information. As an example, we train a network to predict both signed distance and color for each 3D location, and this grants us the capability of rendering color images. In Fig. 8, we show that with a 512-dim latent code learned from textured meshes as the ground truth, color images can be rendered in arbitrary resolution, camera viewpoints, and illumination. Note that the latent code size is significantly smaller than the mesh (vertices+triangles+texture map), and thus can be potentially used for model compression. Other per-vertex properties, such as semantic segmentation and material, can also be rendered in the same differentiable way.

LR texture 32x HR texture HR Relighting HR 2nd View
Figure 8: Our method can render information encoded in the implict function other than depth. With a pre-trained network encoding textured meshes, we can render high resolution color images under various resolution, camera viewpoints, and illumination.

4.2 3D Shape Prediction

Our differentiable implicit SDF renderer builds up the connection between 3D shape and 2D observations and enables geometry based reasoning. In this section, we show results of 3D shape prediction from a single depth image, or multi-view color images using DeepSDF as the shape generator. On a high-level, we take a pre-trained DeepSDF and fixed the decoder parameters. When given 2D observations, we define proper loss functions and propagate the gradient back to the latent code, as introduced in Section 3.4, to generate 3D shape. This method does not require any additional training and only need to run optimization at test time, which is intuitively less vulnerable to overfitting or domain gap issues in pure learning based approach. In this section, we specifically focus on evaluating the generalization capability while maintaining high shape quality.

4.2.1 3D Shape Prediction from Single Depth Image

With the development of commodity range sensors, the dense or sparse depth images can be easily acquired, and several methods have been proposed to solve the problem of 3D shape prediction from a single depth image. DeepSDF [32] has shown state-of-the art performance for this task, however requires an offline pre-processing to lift the input 2D depth map into 3D space in order to sample the SDF values with the assistance of the surface normal. Our differentiable render makes 3D shape prediction from a depth image more convenient by directly rendering the depth image given a latent code and comparing it with the given depth. Moreover, with silhouette, e.g. calculated from depth or provided from the rendering, our renderer can also leverage it as additional supervision. Formally, we obtain the complete 3D shape by solving the following optimization:


where is the pre-trained neural network encoding shape priors, and represent the rendering function for depth and silhouette respectively, is the loss of depth observation, and is the loss defined based on the differentiably rendered silhouette. In our experiment, the initial latent shape is chosen as the mean shape.

We test our method and DeepSDF [32] on 200 models from plane, sofa and table category respectively from ShapeNet Core [2]. Specifically, for each model, we use the first camera in the dataset of Choy et al. [6] to generate dense depth images for testing. The comparison between DeepSDF and our method is listed in Tab. 2. We can see that our method with only depth supervision performs even better than DeepSDF [32]

when dense depth image is given. This is probably because that DeepSDF samples the 3D space with pre-defined rule (at fixed distances along normal direction), which may not necessarily sample correct location especially near object boundary or thin structures. In contrast, our differentiable sphere tracing algorithm samples the space adaptively with the current estimation of shape.

dense 50% 10% 100pts 50pts 20pts
DeepSDF 5.37 5.56 5.50 5.93 6.03 7.63
Ours 4.12 5.75 5.49 5.72 5.57 6.95
Ours (mask) 4.12 3.98 4.31 3.98 4.30 4.94
DeepSDF 3.71 3.73 4.29 4.44 4.40 5.39
Ours 2.18 4.08 4.81 4.44 4.51 5.30
Ours (mask) 2.18 2.08 2.62 2.26 2.55 3.60
DeepSDF 12.93 12.78 11.67 12.87 13.76 15.77
Ours 5.37 12.05 11.42 11.70 13.76 15.83
Ours (mask) 5.37 5.15 5.16 5.26 6.33 7.62
Table 2: Quantitative comparison between our geometric optimization with DeepSDF [32] for shape completion over partial dense and sparse depth observation on ShapeNet dataset [2]. We report the median Chamfer Distance on the first 200 instances of the dataset of [6]. We give DeepSDF [32] the groundtruth normal otherwise they could not be applied on the sparse depth.

Robustness against sparsity The depth from laser scanners can be very sparse, so we also study the robustness of our method and DeepSDF against sparse depth. The results are shown in Tab. 2. Specifically, we randomly sample different percentages or fixed numbers of points from the original dense depth for testing. To make a competitive baseline, we provide DeepSDF ground truth normal to sample SDF, since it cannot be reliably estimated from sparse depth. From the table, we can see that even with very sparse depth observations, our method still recovers accurate shapes and gets consistently better performance than DeepSDF with additional normal information. When silhouette is available, our method achieves significantly better performance and robustness against the sparsity, indicating that our rendering method can back-propagate gradients effectively from the silhouette loss.

4.2.2 3D Shape Prediction from Multiple Images

Video sequence Optimization process
Figure 9: Illustration of the optimization process under multi-view setup. Our differentiable renderer is able to successfully recover 3D geometry from a random code with only the photometric loss.

Our differentiable renderer can also enable geometry based reasoning for shape prediction from multi-view color images. The idea is to leverage cross-view photometric consistency.

Specifically, we first initialize the latent code with a random vector and render depths for each of the input views. We then warp each color image to other input views using the rendered depth and the known camera pose. The difference between the warped and the input image are then defined as the photometric loss, and the shape can be predicted by minimizing this loss. To sum up, the optimization problem is formulated as follows,


where represents the rendered depth image at view , are the neighboring images of , and is the warped image from view to view using the rendered depth. Note that no mask is required under the multi-view setup. Fig. 9 shows an example of the optimization process of our method. As can be seen, the shape is gradually improved while the loss is being optimized.

Method car plane
PMO (original) 0.661 1.129
PMO (rand init) 1.187 6.124
Ours (rand init) 0.919 1.595
Table 3: Quantitative results on 3D shape prediction from multi-view images under the metric of Chamfer Distance. We randomly picked 50 instances from the PMO test set to perform the evaluation. 10000 points are sampled from meshes for evaluation.

We take PMO [23] as a competitive baseline, since they also perform deep learning based geometric reasoning but using the triangular mesh representation. Their model first predicts an initial mesh directly from a selected input view and applies cross-view photo-consistency to improve the quality. Both the synthetic and real dataset provided in [23] are used for evaluation.

In Tab. 3, we show quantitative comparison to PMO on their synthetic test set. It can be seen that our method achieves comparable results with PMO [23] from only random initializations. Note that while PMO uses both the encoder and decoder trained on the PMO training set, our DeepSDF decoder was neither trained nor finetuned on it. Besides, if the shape code for PMO, instead of being predicted from their trained image encoder, is also initialized randomly, their performance decreases dramatically, which indicates that with our rendering method, our geometric reasoning becomes more effective.

Generalization Capability To further evaluate the generalization capability, we compare to PMO on some unseen data and initialization. We first evaluate both methods on a testing set generated using different camera focal lengths, and the quantitative comparison is in Fig. 10 (a). It clearly shows that our method generalizes well to the new images, while PMO suffers from overfitting or domain gap. To further test the effectiveness of the geometric reasoning, we also directly add random noise to the initial latent code. The performance of PMO again drops significantly, while our method is not affected since the initialization is randomized (Fig. 10 (b)). Some qualitative results are shown in Fig. 11. Our method produces accurate shapes with detailed surfaces. In contrast, PMO suffers from two main issues: 1) the low resolution mesh is not capable of maintaining geometric details; 2) their geometric reasoning struggles with the initialization from image encoder.

We further show comparison on real data in Fig. 12. Following PMO, we use the provided rough initial similarity transformation to align the camera poses to the canonical frame. As can be seen, both methods perform worse on this challenging dataset. In comparison, our method produces shape with higher quality and correct structure, while PMO only produce a very rough shape. Overall, our method shows better generalization capability and robustness against domain change.

(a) (b)
Figure 10: Robustness of geometric reasoning via multi-view photometric optimization. (a) Performance w.r.t changes on camera focal length. (b) Performance w.r.t noise in the initialization code. Our model is robust against focal length change and not affected by noise in the latent code since we start from random initialization. In contrast, PMO is very sensitive to both factors, and the performance drops significantly when the testing images are different from the training set.
Video sequence PMO (rand init) PMO Ours
Figure 11: Comparison on 3D shape prediction from multi-view images on the PMO test set. Our method maintains good surface details, while PMO suffers from the mesh representation and may not effectively optimize the shape.
Video sequence PMO Ours
Figure 12: Comparison on 3D shape prediction from multi-view images on real-world dataset [5]. It is in general challenging for shape prediction on real image. Comparatively, our method produces more reasonable results with correct structure.

5 Conclusion

We propose a differentiable sphere tracing algorithm to render 2D observations such as depth, normal, silhouette, from implicit signed distance functions parameterized as a neural network. This enables geometric reasoning in 3D shape prediction from both single and multiple views in conjunction with the high capacity 3D neural representation. Extensive experiments show that our geometry based optimization algorithm produces 3D shapes that are more accurate than SOTA, generalizes well to new datasets, and is robust to imperfect or partial observations. Promising directions to explore using our renderer include self-supervised learning, recovering other properties jointly with geometry, and neural image rendering.


  • [1] B. G. Baumgart (1974)

    Geometric modeling for computer vision

    Technical report STANFORD UNIV CA DEPT OF COMPUTER SCIENCE. Cited by: §1.
  • [2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.2.1, Table 2.
  • [3] W. Chen, J. Gao, H. Ling, E. J. Smith, J. Lehtinen, A. Jacobson, and S. Fidler (2019)

    Learning to predict 3d objects with an interpolation-based differentiable renderer

    In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2.
  • [4] Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In

    Proc. of Computer Vision and Pattern Recognition (CVPR)

    pp. 5939–5948. Cited by: §2.
  • [5] S. Choi, Q. Zhou, S. Miller, and V. Koltun (2016) A large dataset of object scans. arXiv preprint arXiv:1602.02481. Cited by: Figure 12.
  • [6] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In Proc. of European Conference on Computer Vision (ECCV), pp. 628–644. Cited by: §2, §2, §4.2.1, Table 2.
  • [7] Z. Cui, J. Gu, B. Shi, P. Tan, and J. Kautz (2017) Polarimetric multi-view stereo. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 1558–1567. Cited by: §2.
  • [8] A. Dai and M. Nießner (2019) Scan2Mesh: from unstructured range scans to 3d meshes. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 5574–5583. Cited by: §2.
  • [9] A. Dai, C. Ruizhongtai Qi, and M. Nießner (2017) Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 5868–5877. Cited by: §2, §2.
  • [10] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta (2016) Learning a predictable and generative vector representation for objects. In Proc. of European Conference on Computer Vision (ECCV), pp. 484–499. Cited by: §2.
  • [11] T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018) AtlasNet: a papier-m^ ach’e approach to learning 3d surface generation. In Proc. of Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [12] C. Häne, S. Tulsiani, and J. Malik (2017) Hierarchical surface prediction for 3d object reconstruction. In Proc. of International Conference on 3D Vision (3DV), pp. 412–420. Cited by: §2.
  • [13] J. C. Hart (1996) Sphere tracing: a geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer 12 (10). Cited by: Figure 2, §1, §3.1, §3.
  • [14] C. Hernandez, G. Vogiatzis, and R. Cipolla (2008) Multiview photometric stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (3), pp. 548–554. Cited by: §2.
  • [15] P. Huang, K. Matzen, J. Kopf, N. Ahuja, and J. Huang (2018) Deepmvs: learning multi-view stereopsis. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 2821–2830. Cited by: §2.
  • [16] S. Im, H. Jeon, S. Lin, and I. S. Kweon (2019) DPSNet: end-to-end deep plane sweep stereo. arXiv preprint arXiv:1905.00538. Cited by: §2.
  • [17] E. Insafutdinov and A. Dosovitskiy (2018) Unsupervised learning of shape and pose with differentiable point clouds. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2802–2812. Cited by: §2.
  • [18] A. Johnston, R. Garg, G. Carneiro, I. Reid, and A. van den Hengel (2017) Scaling cnns for high resolution volumetric reconstruction from a single image. In Proc. of Internatoinal Conference on Computer Vision (ICCV), pp. 939–948. Cited by: §2.
  • [19] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 7122–7131. Cited by: §2.
  • [20] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 3907–3916. Cited by: §2.
  • [21] C. Kong, C. Lin, and S. Lucey (2017) Using locally corresponding cad models for dense 3d reconstructions from a single image. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 4857–4865. Cited by: §2.
  • [22] T. Li, M. Aittala, F. Durand, and J. Lehtinen (2018) Differentiable monte carlo ray tracing through edge sampling. In Proc. of ACM SIGGRAPH, pp. 222. Cited by: §1, §2.
  • [23] C. Lin, O. Wang, B. C. Russell, E. Shechtman, V. G. Kim, M. Fisher, and S. Lucey (2019) Photometric mesh optimization for video-aligned 3d object reconstruction. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 969–978. Cited by: §4.2.2, §4.2.2.
  • [24] S. Liu, W. Chen, T. Li, and H. Li (2019) Soft rasterizer: differentiable rendering for unsupervised single-view mesh reconstruction. arXiv preprint arXiv:1901.05567. Cited by: §2.
  • [25] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh (2019) Neural volumes: learning dynamic renderable volumes from images. Proc. of ACM SIGGRAPH. Cited by: §2.
  • [26] M. M. Loper and M. J. Black (2014) OpenDR: an approximate differentiable renderer. In Proc. of European Conference on Computer Vision (ECCV), pp. 154–169. Cited by: §2.
  • [27] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 4460–4470. Cited by: §2.
  • [28] M. Michalkiewicz, J. K. Pontes, D. Jack, M. Baktashmotlagh, and A. Eriksson (2019) Deep level sets: implicit surface representations for 3d shape inference. arXiv preprint arXiv:1901.06802. Cited by: §2.
  • [29] T. H. Nguyen-Phuoc, C. Li, S. Balaban, and Y. Yang (2018) Rendernet: a deep convolutional network for differentiable rendering from 3d shapes. In Advances in Neural Information Processing Systems (NeurIPS), pp. 7891–7901. Cited by: §2.
  • [30] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2019) Occupancy flow: 4d reconstruction by learning particle dynamics. In Proc. of Internatoinal Conference on Computer Vision (ICCV), pp. 5379–5389. Cited by: §2.
  • [31] M. Oechsle, L. Mescheder, M. Niemeyer, T. Strauss, and A. Geiger (2019) Texture fields: learning texture representations in function space. In Proc. of Internatoinal Conference on Computer Vision (ICCV), Cited by: §2, §3.3.
  • [32] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) DeepSDF: learning continuous signed distance functions for shape representation. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 165–174. Cited by: §1, §1, §2, §2, §3.4, §3, §4.1, §4.2.1, §4.2.1, Table 1, Table 2.
  • [33] G. Patow and X. Pueyo (2003) A survey of inverse rendering problems. In Computer graphics forum, Vol. 22, pp. 663–687. Cited by: §1.
  • [34] F. Petersen, A. H. Bermano, O. Deussen, and D. Cohen-Or (2019) Pix2Vex: image-to-geometry reconstruction using a smooth differentiable renderer. arXiv preprint arXiv:1903.11149. Cited by: §2.
  • [35] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 652–660. Cited by: §2, §2.
  • [36] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems (NeurIPS), pp. 5099–5108. Cited by: §2.
  • [37] G. Riegler, A. Osman Ulusoy, and A. Geiger (2017) Octnet: learning deep 3d representations at high resolutions. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 3577–3586. Cited by: §2.
  • [38] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proc. of Internatoinal Conference on Computer Vision (ICCV), Cited by: §2.
  • [39] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In Proc. of Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 519–528. Cited by: §2.
  • [40] B. Semerjian (2014) A new variational framework for multiview surface reconstruction. In Proc. of European Conference on Computer Vision (ECCV), pp. 719–734. Cited by: §2.
  • [41] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer (2019) Deepvoxels: learning persistent 3d feature embeddings. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 2437–2446. Cited by: §2.
  • [42] V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2, §4.1.
  • [43] D. Stutz and A. Geiger (2018) Learning 3d shape completion under weak supervision. International Journal of Computer Vision (IJCV), pp. 1–20. Cited by: §2.
  • [44] M. Tatarchenko, A. Dosovitskiy, and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proc. of Internatoinal Conference on Computer Vision (ICCV), pp. 2088–2096. Cited by: §2.
  • [45] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proc. of European Conference on Computer Vision (ECCV), pp. 52–67. Cited by: §2.
  • [46] Y. Wang, F. Serena, S. Wu, C. Öztireli, and O. Sorkine-Hornung (2019) Differentiable surface splatting for point-based geometry processing. Proc. of ACM SIGGRAPH Asia. Cited by: §2.
  • [47] Y. Wei, S. Liu, W. Zhao, and J. Lu (2019-06) Conditional single-view shape generation for multi-view stereo reconstruction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [48] C. Wen, Y. Zhang, Z. Li, and Y. Fu (2019) Pixel2Mesh++: multi-view 3d mesh generation via deformation. In Proc. of Internatoinal Conference on Computer Vision (ICCV), Cited by: §2.
  • [49] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems (NeurIPS), pp. 82–90. Cited by: §2.
  • [50] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3d shapenets: a deep representation for volumetric shapes. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920. Cited by: §2.
  • [51] Y. Yao, Z. Luo, S. Li, T. Fang, and L. Quan (2018) Mvsnet: depth inference for unstructured multi-view stereo. In Proc. of European Conference on Computer Vision (ECCV), pp. 767–783. Cited by: §2.
  • [52] Y. Yu, P. Debevec, J. Malik, and T. Hawkins (1999) Inverse global illumination: recovering reflectance models of real scenes from photographs. In siggrpah, Vol. 99, pp. 215–224. Cited by: §1.
  • [53] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser (2017) 3dmatch: learning local geometric descriptors from rgb-d reconstructions. In Proc. of Computer Vision and Pattern Recognition (CVPR), pp. 1802–1811. Cited by: §2.