This repository contains the code for the paper "Occupancy Networks - Learning 3D Reconstruction in Function Space"
Computer graphics, 3D computer vision and robotics communities have produced multiple approaches to representing 3D geometry for rendering and reconstruction. These provide trade-offs across fidelity, efficiency and compression capabilities. In this work, we introduce DeepSDF, a learned continuous Signed Distance Function (SDF) representation of a class of shapes that enables high quality shape representation, interpolation and completion from partial and noisy 3D input data. DeepSDF, like its classical counterpart, represents a shape's surface by a continuous volumetric field: the magnitude of a point in the field represents the distance to the surface boundary and the sign indicates whether the region is inside (-) or outside (+) of the shape, hence our representation implicitly encodes a shape's boundary as the zero-level-set of the learned function while explicitly representing the classification of space as being part of the shapes interior or not. While classical SDF's both in analytical or discretized voxel form typically represent the surface of a single shape, DeepSDF can represent an entire class of shapes. Furthermore, we show state-of-the-art performance for learned 3D shape representation and completion while reducing the model size by an order of magnitude compared with previous work.READ FULL TEXT VIEW PDF
This repository contains the code for the paper "Occupancy Networks - Learning 3D Reconstruction in Function Space"
Calculate signed distance fields for arbitrary meshes
Generative Adversarial Networks and Autoencoders for 3D Shapes
Workshop for datascience.ua on "3D meets neural networks"
Deep convolutional networks which are a mainstay of image-based approaches grow quickly in space and time complexity when directly generalized to the 3rd spatial dimension, and more classical and compact surface representations such as triangle or quad meshes pose problems in training since we may need to deal with an unknown number of vertices and arbitrary topology. These challenges have limited the quality, flexibility and fidelity of deep learning approaches when attempting to either input 3D data for processing or produce 3D inferences for object segmentation and reconstruction.
In this work, we present a novel representation and approach for generative 3D modeling that is efficient, expressive, and fully continuous. Our approach uses the concept of a SDF, but unlike common surface reconstruction techniques which discretize this SDF into a regular grid for evaluation and measurement denoising, we instead learn a generative model to produce such a continuous field.
The proposed continuous representation may be intuitively understood as a learned shape-conditioned classifier for which the decision boundary is the surface of the shape itself, as shown in Fig.2. Our approach shares the generative aspect of other works seeking to map a latent space to a distribution of complex shapes in 3D , but critically differs in the central representation. While the notion of an implicit surface defined as a SDF is widely known in the computer vision and graphics communities, to our knowledge no prior works have attempted to directly learn continuous, generalizable 3D generative models of SDFs.
Our contributions include: (i) the formulation of generative shape-conditioned 3D modeling with a continuous implicit surface, (ii) a learning method for 3D shapes based on a probabilistic auto-decoder, and (iii) the demonstration and application of this formulation to shape modeling and completion. Our models produce high quality continuous surfaces with complex topologies, and obtain state-of-the-art results in quantitative comparisons for shape reconstruction and completion. As an example of the effectiveness of our method, our models use only 7.4 MB (megabytes) of memory to represent entire classes of shapes (for example, thousands of 3D chair models) – this is, for example, less than half the memory footprint (16.8 MB) of a single uncompressed 3D bitmap.
Representations for data-driven 3D learning approaches can be largely classified into three categories: point-based, mesh-based, and voxel-based methods. While some applications such as 3D-point-cloud-based object classification are well suited to these representations, we address their limitations in expressing continuous surfaces with complex topologies.
Point-based. A point cloud is a lightweight 3D representation that closely matches the raw data that many sensors (i.e. LiDARs, depth cameras) provide, and hence is a natural fit for applying 3D learning. PointNet [38, 39]
, for example, uses max-pool operations to extract global shape features, and the technique is widely used as an encoder for point generation networks[57, 1]. There is a sizable list of related works to the PointNet style approach of learning on point clouds. A primary limitation, however, of learning with point clouds is that they do not describe topology and are not suitable for producing watertight surfaces.
Mesh-based. Various approaches represent classes of similarly shaped objects, such as morphable human body parts, with predefined template meshes and some of these models demonstrate high fidelity shape generation results [2, 34]. Other recent works  use poly-cube mapping  for shape optimization. While the use of template meshes is convenient and naturally provides 3D correspondences, it can only model shapes with fixed mesh topology.
Other mesh-based methods use existing [48, 36] or learned [22, 23] parameterization techniques to describe 3D surfaces by morphing 2D planes. The quality of such representations depends on parameterization algorithms that are often sensitive to input mesh quality and cutting strategies. To address this, recent data-driven approaches [57, 22] learn the parameterization task with deep networks. They report, however, that (a) multiple planes are required to describe complex topologies but (b) the generated surface patches are not stitched, i.e. the produced shape is not closed. To generate a closed mesh, sphere parameterization may be used [22, 23], but the resulting shape is limited to the topological sphere. Other works related to learning on meshes propose to use new convolution and pooling operations for meshes [17, 53] or general graphs .
Voxel-based. Voxels, which non-parametrically describe volumes with 3D grids of values, are perhaps the most natural extension into the 3D domain of the well-known learning paradigms (i.e., convolution) that have excelled in the 2D image domain. The most straightforward variant of voxel-based learning is to use a dense occupancy grid (occupied / not occupied). Due to the cubically growing compute and memory requirements, however, current methods are only able to handle low resolutions ( or below). As such, voxel-based approaches do not preserve fine shape details [56, 14], and additionally voxels visually appear significantly different than high-fidelity shapes, since when rendered their normals are not smooth. Octree-based methods [52, 43, 26] alleviate the compute and memory limitations of dense voxel methods, extending for example the ability to learn at up to resolution , but even this resolution is far from producing shapes that are visually compelling.
Aside from occupancy grids, and more closely related to our approach, it is also possible to use a 3D grid of voxels to represent a signed distance function. This inherits from the success of fusion approaches that utilize a truncated SDF (TSDF), pioneered in [15, 37], to combine noisy depth maps into a single 3D model. Voxel-based SDF representations have been extensively used for 3D shape learning [59, 16, 49], but their use of discrete voxels is expensive in memory. As a result, the learned discrete SDF approaches generally present low resolution shapes.  reports various wavelet transform-based approaches for distance field compression, while  applies dimensionality reduction techniques on discrete TSDF volumes. These methods encode the SDF volume of each individual scene rather than a dataset of shapes.
Modern representation learning techniques aim at automatically discovering a set of features that compactly but expressively describe data. For a more extensive review of the field, we refer to Bengio et al. .
Generative Adversial Networks. GANs  and their variants [13, 41] learn deep embeddings of target data by training discriminators adversarially against generators. Applications of this class of networks [29, 31] generate realstic images of humans, objects, or scenes. On the downside, adversarial training for GANs is known to be unstable. In the 3D domain, Wu et al.  trains a GAN to generate objects in a voxel representation, while the recent work of Hamu et al.  uses multiple parameterization planes to generate shapes of topological spheres.
Auto-encoders. Auto-encoder outputs are expected to replicate the original input given the constraint of an information bottleneck between the encoder and decoder. The ability of auto-encoders as a feature learning tool has been evidenced by the vast variety of 3D shape learning works in the literature [16, 49, 2, 22, 55] who adopt auto-encoders for representation learning. Recent 3D vision works [6, 2, 34]
often adopt a variational auto-encoder (VAE) learning scheme, in which bottleneck features are perturbed with Gaussian noise to encourage smooth and complete latent spaces. The regularization on the latent vectors enables exploring the embedding space with gradient descent or random sampling.
Optimizing Latent Vectors. Instead of using the full auto-encoder for representation learning, an alternative is to learn compact data representations by training decoder-only networks. This idea goes back to at least the work of Tan et al.  which simultaneously optimizes the latent vectors assigned to each data point and the decoder weights through back-propagation. For inference, an optimal latent vector is searched to match the new observation with fixed decoder parameters. Similar approaches have been extensively studied in [42, 8, 40], for applications including noise reduction, missing measurement completions, and fault detections. Recent approaches [7, 20] extend the technique by applying deep architectures. Throughout the paper we refer to this class of networks as auto-decoders, for they are trained with self-reconstruction loss on decoder-only architectures.
3D shape completion related works aim to infer unseen parts of the original shape given sparse or partial input observations. This task is anaologous to image-inpainting in 2D computer vision.
Classical surface reconstruction methods complete a point cloud into a dense surface by fitting radial basis function (RBF) to approximate implicit surface functions, or by casting the reconstruction from oriented point clouds as a Poisson problem . These methods only model a single shape rather than a dataset.
Various recent methods use data-driven approaches for the 3D completion task. Most of these methods adopt encoder-decoder architectures to reduce partial inputs of occupancy voxels , discrete SDF voxels , depth maps , RGB images [14, 55] or point clouds  into a latent vector and subsequently generate a prediction of full volumetric shape based on learned priors.
In this section we present DeepSDF, our continuous shape learning approach. We describe modeling shapes as the zero iso-surface decision boundaries of feed-forward networks trained to represent SDFs. A signed distance function is a continuous function that, for a given spatial point, outputs the point’s distance to the closest surface, whose sign encodes whether the point is inside (negative) or outside (positive) of the watertight surface:
The underlying surface is implicitly represented by the iso-surface of . A view of this implicit surface can be rendered through raycasting or rasterization of a mesh obtained with, for example, Marching Cubes .
Our key idea is to directly regress the continuous SDF from point samples using deep neural networks. The resulting trained network is able to predict the SDF value of a given query position, from which we can extract the zero level-set surface by evaluating spatial samples. Such surface representation can be intuitively understood as a learned binary classifier for which the decision boundary is the surface of the shape itself as depicted in Fig.2. As a universal function approximator , deep feed-forward networks in theory can learn the fully continuous shape functions with arbitrary precision. Yet, the precision of the approximation in practice is limited by the finite number of point samples that guide the decision boundaries and the finite capacity of the network due to restricted compute power.
The most direct application of this approach is to train a single deep network for a given target shape as depicted in Fig. 2(a). Given a target shape, we prepare a set of pairs composed of 3D point samples and their SDF values:
We train the parameters of a multi-layer fully-connected neural network on the training set to make a good approximator of the given SDF in the target domain :
The training is done by minimizing the sum over losses between the predicted and real SDF values of points in under the following loss function:
where introduces the parameter to control the distance from the surface over which we expect to maintain a metric SDF. Larger values of allow for fast ray-tracing since each sample gives information of safe step sizes. Smaller values of can be used to concentrate network capacity on details near the surface.
To generate the 3D model shown in Fig. 2(a), we use
and a feed-forward network composed of eight fully connected layers, each of them applied with dropouts. All internal layers are 512-dimensional and have ReLU non-linearities. The output non-linearity regressing the SDF value is tanh. We found training with batch-normalization to be unstable and applied the weight-normalization technique instead . For training, we use the Adam optimizer . Once trained, the surface is implicitly represented as the zero iso-surface of , which can be visualized through raycasting or marching cubes. Another nice property of this approach is that accurate normals can be analytically computed by calculating the spatial derivative via back-propogation through the network.
Training a specific neural network for each shape is neither feasible nor very useful. Instead, we want a model that can represent a wide variety of shapes, discover their common properties, and embed them in a low dimensional latent space. To this end, we introduce a latent vector , which can be thought of as encoding the desired shape, as a second input to the neural network as depicted in Fig. 2(b). Conceptually, we map this latent vector to a 3D shape represented by a continuous SDF. Formally, for some shape indexed by , is now a function of a latent code and a query 3D location , and outputs the shape’s approximate SDF:
By conditioning the network output on a latent vector, this formulation allows modeling multiple SDFs with a single neural network. Given the decoding model , the continuous surface associated with a latent vector is similarly represented with the decision boundary of , and the shape can again be discretized for visualization by, for example, raycasting or Marching Cubes.
Next, we motivate the use of encoder-less training before introducing the ‘auto-decoder’ formulation of the shape-coded DeepSDF.
Different from an auto-encoder whose latent code is produced by the encoder, an auto-decoder directly accepts a latent vector as an input. A randomly initialized latent vector is assigned to each data point in the beginning of training, and the latent vectors are optimized along with the decoder weights through standard backpropagation. During inference, decoder weights are fixed, and an optimal latent vector is estimated.
Auto-encoders and encoder-decoder networks are widely used for representation learning as their bottleneck features tend to form natural latent variable representations.
Recently, in applications such as modeling depth maps , faces , and body shapes  a full auto-encoder is trained but only the decoder is retained for inference, where they search for an optimal latent vector given some input observation. However, since the trained encoder is unused at test time, it is unclear whether using the encoder is the most effective use of computational resources during training. This motivates us to use an auto-decoder for learning a shape embedding without an encoder as depicted in Fig. 4.
We show that applying an auto-decoder to learn continuous SDFs leads to high quality 3D generative models. Further, we develop a probabilistic formulation for training and testing the auto-decoder that naturally introduces latent space regularization for improved generalization. To the best of our knowledge, this work is the first to introduce the auto-decoder learning method to the 3D learning community.
To derive the auto-decoder-based shape-coded DeepSDF formulation we adopt a probabilistic perspective. Given a dataset of shapes represented with signed distance function , we prepare a set of point samples and their signed distance values:
For an auto-decoder, as there is no encoder, each latent code is paired with training shape . The posterior over shape code given the shape SDF samples can be decomposed as:
where parameterizes the SDF likelihood. In the latent shape-code space, we assume the prior distribution over codes to be a zero-mean multivariate-Gaussian with a spherical covariance . This prior encapsulates the notion that the shape codes should be concentrated and we empirically found it was needed to infer a compact shape manifold and to help converge to good solutions.
In the auto-decoder-based DeepSDF formulation we express the SDF likelihood via a deep feed-forward network and, without loss of generality, assume that the likelihood takes the form:
The SDF prediction is represented using a fully-connected network. is a loss function penalizing the deviation of the network prediction from the actual SDF value . One example for the cost function is the standard loss function which amounts to assuming Gaussian noise on the SDF values. In practice we use the clamped cost from Eq. 4 for reasons outlined previously.
At training time we maximize the joint log posterior over all training shapes with respect to the individual shape codes and the network parameters :
At inference time, after training and fixing , a shape code for shape can be estimated via Maximum-a-Posterior (MAP) estimation as:
Crucially, this formulation is valid for SDF samples of arbitrary size and distribution because the gradient of the loss with respect to can be computed separately for each SDF sample. This implies that DeepSDF can handle any form of partial observations such as depth maps. This is a major advantage over the auto-encoder framework whose encoder expects a test input similar to the training data, e.g. shape completion networks of [16, 58] require preparing training data of partial shapes.
To incorporate the latent shape code, we stack the code vector and the sample location as depicted in Fig. 2(b) and feed it into the same fully-connected NN described previously at the input layer and additionally at the 4th layer. We again use the Adam optimizer . The latent vector is initialized randomly from .
Note that while both VAE and the proposed auto-decoder formulation share the zero-mean Gaussian prior on the latent codes, we found that the the stochastic nature of the VAE optimization did not lead to good training results.
|Method||Type||Discretization||topologies||surfaces||normals||size (GB) (s)||time (s)||tasks|
|3D-EPN ||Voxel SDF||voxels||✓||✓||✓||0.42||-||C|
|AtlasNet||Parametric||1 patch||✓||0.015||0.01||K, U|
|AtlasNet||Parametric||25 patches||✓||0.172||0.32||K, U|
|DeepSDF||Continuous||none||✓||✓||✓||0.0074||9.72||K, U, C|
To train our continuous SDF model, we prepare the SDF samples (Eq. 2) for each mesh, which consists of 3D points and their SDF values. While SDF can be computed through a distance transform for any watertight shapes from real or synthetic data, we train with synthetic objects, (e.g. ShapeNet ), for which we are provided complete 3D shape meshes. To prepare data, we start by normalizing each mesh to a unit sphere and sampling 500,000 spatial points ’s: we sample more aggressively near the surface of the object as we want to capture a more detailed SDF near the surface. For an ideal oriented watertight mesh, computing the signed distance value of would only involve finding the closest triangle, but we find that human designed meshes are commonly not watertight and contain undesired internal structures. To obtain the shell of a mesh with proper orientation, we set up equally spaced virtual cameras around the object, and densely sample surface points, denoted , with surface normals oriented towards the camera. Double sided triangles visible from both orientations (indicating that the shape is not closed) cause problems in this case, so we discard mesh objects containing too many of such faces. Then, for each , we find the closest point in , from which the can be computed. We refer readers to supplementary material for further details.
We conduct a number of experiments to show the representational power of DeepSDF, both in terms of its ability to describe geometric details and its generalization capability to learn a desirable shape embedding space. Largely, we propose four main experiments designed to test its ability to 1) represent training data, 2) use learned feature representation to reconstruct unseen shapes, 3) apply shape priors to complete partial shapes, and 4) learn smooth and complete shape embedding space from which we can sample new shapes. For all experiments we use the popular ShapeNet  dataset.
We select a representative set of 3D learning approaches to comparatively evaluate aforementioned criteria: a recent octree-based method (OGN) , a mesh-based method (AtlasNet) , and a volumetric SDF-based shape completion method (3D-EPN)  (Table 1). These works show state-of-the-art performance in their respective representations and tasks, so we omit comparisons with the works that have already been compared: e.g. OGN’s octree model outperforms regular voxel approaches, while AtlasNet compares itself with various points, mesh, or voxel based methods and 3D-EPN with various completion methods.
First, we evaluate the capacity of the model to represent known shapes, i.e. shapes that were in the training set, from only a restricted-size latent code — testing the limit of expressive capability of the representations.
Quantitative comparison in Table 2 shows that the proposed DeepSDF significantly beats OGN and AtlasNet in Chamfer distance against the true shape computed with a large number of points (30,000). The difference in earth mover distance (EMD) is smaller because 500 points do not well capture the additional precision. Figure 5 shows a qualitative comparison of DeepSDF to OGN.
For encoding unknown shapes, i.e. shapes in the held-out test set, DeepSDF again significantly outperforms AtlasNet on a wide variety of shape classes and metrics as shown in Table 3. Note that AtlasNet performs reasonably well at classes of shapes that have mostly consistent topology without holes (like planes) but struggles more on classes that commonly have holes, like chairs. This is shown in Fig. 6 where AtlasNet fails to represent the fine detail of the back of the chair. Figure 7 shows more examples of detailed reconstructions on test data from DeepSDF as well as two example failure cases.
|Mesh acc., mean|
|lower is better||higher is better|
A major advantage of the proposed DeepSDF approach for representation learning is that inference can be performed from an arbitrary number of SDF samples. In the DeepSDF framework, shape completion amounts to solving for the shape code that best explains a partial shape observation via Eq. 10. Given the shape-code a complete shape can be rendered using the priors encoded in the decoder.
We test our completion scheme using single view depth observations which is a common use-case and maps well to our architecture without modification. Note that we currently require the depth observations in the canonical shape frame of reference.
To generate SDF point samples from the depth image observation, we sample two points for each depth observation, each of them located distance away from the measured surface point (along surface normal estimate). With small we approximate the signed distance value of those points to be and , respectively. We solve for Eq. 10 with loss function of Eq. 4 using clamp value of . Additionally, we incorporate free-space observations, (i.e. empty-space between surface and camera), by sampling points along the freespace-direction and enforce larger-than-zero constraints. The freespace loss is if and otherwise.
Given the SDF point samples and empty space points, we similarly optimize the latent vector using MAP estimation. Tab. 4 and Figs. (22, 9) respectively shows quantitative and qualitative shape completion results. Compared to one of the most recent completion approaches  using volumetric shape representation, our continuous SDF approach produces more visually pleasing and accurate shape reconstructions. While a few recent shape completion methods were presented [24, 55], we could not find the code to run the comparisons, and their underlying 3D representation is voxel grid which we extensively compare against.
To show that our learned shape embedding is complete and continuous, we render the results of the decoder when a pair of shapes are interpolated in the latent vector space (Fig. 1). The results suggests that the embedded continuous SDF’s are of meaningful shapes and that our representation extracts common interpretable shape features, such as the arms of a chair, that interpolate linearly in the latent space.
DeepSDF significantly outperforms the applicable benchmarked methods across shape representation and completion tasks and simultaneously addresses the goals of representing complex topologies, closed surfaces, while providing high quality surface normals of the shape. However, while point-wise forward sampling of a shape’s SDF is efficient, shape completion (auto-decoding) takes considerably more time during inference due to the need for explicit optimization over the latent vector. We look to increase performance by replacing ADAM optimization with more efficient Gauss-Newton or similar methods that make use of the analytic derivatives of the model.
DeepSDF models enable representation of more complex shapes without discretization errors with significantly less memory than previous state-of-the-art results as shown in Table 1, demonstrating an exciting route ahead for 3D shape learning. The clear ability to produce quality latent shape space interpolation opens the door to reconstruction algorithms operating over scenes built up of such efficient encodings. However, DeepSDF currently assumes models are in a canonical pose and as such completion in-the-wild requires explicit optimization over a transformation space increasing inference time. Finally, to represent the true space-of-possible-scenes including dynamics and textures in a single embedding remains a major challenge, one which we continue to explore.
This supplementary material provides quantitative and qualitative experimental results along with extended technical details that are supplementary to the main paper. We first describe the shape completion experiment with noisy depth maps using DeepSDF (Sec. B). We then discuss architecture details (Sec. C) along with experiments exploring characteristics and tradeoffs of the DeepSDF design decisions (Sec. D). In Sec. E we compare auto-decoders with variational and standard auto-encoders. Further, additional details on data preparation (Sec. F), training (Sec. G), the auto-decoder learning scheme (Sec. H), and quantitative evaluations (Sec. I) are presented, and finally in Sec. J we provide additional quantitative and qualitative results.
We test the robustness of our shape completion method by using noisy depth maps as input. Specifically, we demonstrate the ability to complete shapes given partial noisy point clouds obtained from consumer depth cameras. Following , we simulate the noise distribution of typical structure depth sensors, including Kinect V1 by adding zero-mean Guassian noise to the inverse depth representation of a ground truth input depth image:
For the experiment, we synthetically generate noisy depth maps from the ShapeNet  plane models using the same benchmark test set of Dai et al.  used in the main paper. We perturb the depth values using standard deviation of , , , and . Given that the target shapes are normalized to a unit sphere, one can observe that the inserted noise level is significant (Fig. 10).
The shape completion results with respect to added Guassian noise on the input synthetic depth maps are shown in Fig. 11. The Chamfer distance of the inferred shape versus the ground truth shape deteriorates approximately linearly with increasing standard deviation of the noise. Compared to the Chamfer distance between raw perturbed point cloud and ground truth depth map, which increases super-linearly with increasing noise level (Fig. 11), the shape completion quality using DeepSDF degrades much slower, implying that the shape priors encoded in the network play an important role regularizing the shape reconstruction.
Fig. 13 depicts the overall architecture of DeepSDF. For all experiments in the main paper we used a network composed of 8 fully connected layers each of which are applied with weight-normalization, and each intermediate vectors are processed with RELU activation and 0.2 dropout except for the final layer. A skip connection is included at the fourth layer.
In this section, we study system parameter decisions that affect the accuracy of SDF regression, thereby providing insight on the tradeoffs and scalability of the proposed algorithm.
In this experiment we test how the expressive capability of DeepSDF varies as a function of the number of layers. Theoretically, an infinitely deep feed-forward network should be able to memorize the training data with arbitrary precision, but in practice this is not true due to finite compute power and the vanishing gradient problem, which limits the depth of the network.
We conduct an experiment where we let DeepSDF memorize SDFs of 500 chairs and inspect the training loss with varying number of layers. As described in Fig. 13, we find that applying the input vector (latent vector + xyz query) both to the first and a middle layer improves training. Inspired by this, we split the experiment into two cases: 1) train a regular network without skip connections, 2) train a network by concatenating the input vector to every 4 layers (e.g. for 12 layer network the input vector will be concatenated to the 4th, and 8th intermediate feature vectors).
Experiment results in Fig. 14 shows that the DeepSDF architecture without skip connections gets quickly saturated at 4 layers while the error keeps decreasing when trained with latent vector skip connections. Compared to the architecture we used for the main experiments (8 FC layers), a network with 16 layers produces significantly smaller training error, suggesting a possibility of using a deeper network for higher precision in some application scenarios. Further, we observe that the test error quickly decrease from four-layer architecture (9.7) to eight layer one (5.7) and subsequently plateaued for deeper architectures. However, this does not suggest conclusive results on generalization, as we used the same number of small training data for all architectures even though a network with more number of parameters tends to require higher volume of data to avoid overfitting.
We study the effect of the truncation distance ( from Eq. 4 of the manuscript) on the regression accuracy of the model. The truncation distance controls the extent from the surface over which we expect the network to learn a metric SDF. Fig. 15 plots the Chamfer distance as a function of truncation distance. We observe a moderate decrease in the accuracy of the surface representation as the truncation distance is increased. A hypothesis for an explanation is that it becomes more difficult to approximate a larger truncation region (a strictly larger domain of the function) to the same absolute accuracy as a smaller truncation region. The benefit, however, of larger truncation regions is that there is a larger region over which the metric SDF is maintained – in our application this reduces raycasting time, and there are other applications as well, such as physics simulation and robot motion planning for which a larger SDF of shapes may be valuable. We chose a value of 0.01 for all experiments presented in the manuscript, which provides a good tradeoff between raycasting speed and surface accuracy.
To compare different approaches of learning a latent code-space for a given datum, we use the MNIST dataset and compare the variational auto encoder (VAE), the standard bottleneck auto encoder (AE), and the proposed auto decoder (AD). As the reconstruction error we use the standard binary cross-entropy and match the model architectures such that the decoders of the different approaches have exactly the same structure and hence theoretical capacity. We show all evaluations for different latent code-space dimensions of 2D, 5D and 15D.
For 2D codes the latent spaces learned by the different methods are visualized in Fig. 16. All code spaces can reasonably represent the different digits. The AD latent space seems more condensed than the ones from VAE and AE. For the optimization-based encoding approach we initialize codes randomly. We show visualizations of such random samples in Fig. 17. Note that samples from the AD- and VAE-learned latent code spaces mostly look like real digits, showing their ability to generate realistic digit images.
We also compare the train and test reconstruction errors for the different methods in Fig. 18. For VAE and AE we show both the reconstruction error obtained using the learned encoder and obtained via code optimization using the learned decoder only (denoted “(V)AE decode”). The test error for VAE and AE are consistently minimized for all latent code dimensions. “AE decode ” diverges in all cases hinting at a learned latent space that is poorly suited for optimization-based decoding. Optimizing latent codes using the VAE encoder seems to work better for higher dimensional codes. The proposed AD approach works well in all tested code space dimensions. Although “VAE decode” has slightly lower test error than AD in 15 dimensions, qualitatively the AD’s reconstructions are better as we discuss next.
In Fig. 19 we show example reconstructions from the test dataset. When using the learned encoders VAE and AE produce qualitatively good reconstructions. When using optimization-based encoding “AE decode” performs poorly indicating that the latent space has many bad local minima. While the reconstructions from “VAE decode“ are, for the most part, qualitatively close to the original, AD’s reconstructions more closely resemble the actual digit of the test data. Qualitatively, AD is on par with reconstructions from end-to-end-trained VAE and AE.
For data preparation, we are given a mesh of a shape to sample spatial points and their SDF values. We begin by normalizing each shape so that the shape model fits into a unit sphere with some margin (in practice fit to sphere radius of 1/1.03). Then, we virtually render the mesh from 100 virtual cameras regularly sampled on the surface of the unit sphere. Then, we gather the surface points by back-projecting the depth pixels from the virtual renderings, and the points’ normals are assigned from the triangle to which it belongs. Triangle surface orientations are set such that they are towards the camera. When a triangle is visible from both orientations, however, the given mesh is not watertight, making true SDF values hard to calculate, so we discard a mesh with more than 2% of its triangles being double-sided. For a valid mesh, we construct a KD-tree for the oriented surface points.
As stated in the main paper, it is important that we sample more aggressively near the surface of the mesh as we want to accurately model the zero-crossings. Specifically, we sample around 250,000 points randomly on the surface of the mesh, weighted by triangle areas. Then, we perturb each surface point along all xyz axes with mean-zero Gaussian noise with variance 0.0025 and 0.00025 to generate two spatial samples per surface point. For around 25,000 points we uniformly sample within the unit sphere. For each collected spatial samples, we find the nearest surface point from the KD-tree, measure the distance, and decide the sign from the dot product between the normal and their vector difference.
For training, we find it is important to initialize the latent vectors quite small, so that similar shapes do not diverge in the latent vector space – we used . Another crucial point is balancing the positive and negative samples both for training and testing: for each batch used for gradient descent, we set half of the SDF point samples positive and the other half negative.
Learning rate for the decoder parameters was set to be 1e-5 * , where is number of shapes in one batch. For each shape in a batch we subsampled 16384 SDF samples. Learning rate for the latent vectors was set to be 1e-3. Also, we set the regularization parameter
. We trained our models on 8 Nvidia GPUs approximately for 8 hours for 1000 epochs. For reconstruction experiments the latent vector size was set to be 256, and for the shape completion task we used models with 128 dimensional latent vectors.
To derive the auto-decoder-based shape-coded DeepSDF formulation we adopt a probabilistic perspective. Given a dataset of shapes represented with signed distance function , we prepare a set of point samples and their signed distance values:
The SDF values can be computed from mesh inputs as detailed in the main paper.
For an auto-decoder, as there is no encoder, each latent code is paired with training shape data and randomly initialized from a zero-mean Gaussian. We use . The latent vectors are then jointly optimized during training along with the decoder parameters .
We assume that each shape in the given dataset
follows the joint distribution of shapes:
where parameterizes the data likelihood. For a given a shape code can be estimated via Maximum-a-Posterior (MAP) estimation:
We estimate as the parameters that maximizes the posterior across all shapes:
where the second equality follows from Bayes Law.
For each shape defined via point and SDF samples as defined in Eq. 12 we make a conditional independence assumption given the code to arrive at the decomposition of the posterior :
Note that the individual SDF likelihoods are parameterized by the sampling location .
To derive the proposed auto-decoder-based DeepSDF approach we express the SDF likelihood via a deep feed-forward network and, without loss of generality, assume that the likelihood takes the form:
The SDF prediction is represented using a fully-connected network and is a loss function penalizing the deviation of the network prediction from the actual SDF value . One example for the cost function is the standard loss function which amounts to assuming Gaussian noise on the SDF values. In practice we use the clamped cost introduced in the main manuscript.
In the latent shape-code space, we assume the prior distribution over codes to be a zero-mean multivariate-Gaussian with a spherical covariance . Note that other more complex priors could be assumed. This leads to the final cost function via Eq. 15 which we jointly minimize with respect to the network parameters and the shape codes :
At inference time, we are given SDF point samples of one underlying shape to estimate the latent code describing the shape. Using the MAP formulation from Eq. 14 with fixed network parameters we arrive at:
where can be used to balance the reconstruction and regularization term. For additional comments and insights as well as the practical implementation of the network and its training refer to the main manuscript.
For quantitative evaluations we converted the DeepSDF model for a given shape into a mesh by using Marching Cubes  with resolution. Note that while this was done for quantitative evaluation as a mesh, many of the qualitative renderings are instead produced by raycasting directly against the continuous SDF model, which can avoid some of the artifacts produced by Marching Cubes at finite resolution. For all experiments in representing known or unknown shapes, DeepSDF was trained on ShapeNet v2, while all shape completion experiments were trained on ShapeNet v1, to match 3D-EPN. Additional DeepSDF training details are provided in Sec. G.
For OGN we trained the provided decoder model (“shape_from_id”) for 300,000 steps on the same train set of cars used for DeepSDF. To compute the point-based metrics, we took the pair of both the groundtruth 256-voxel training data provided by the authors, and the generated 256-voxel output, and converted both of these into point clouds of only the surface voxels, with one point for each of the voxels’ centers. Specifically, surface voxels were defined as voxels which have at least one of 6 direct (non-diagonal) voxel neighbors unoccupied. A typical number of vertices in the resulting point clouds is approximately 80,000, and the points used for evaluation are randomly sampled from these sets. Additionally, OGN was trained based on ShapeNet v1, while AtlasNet was trained on ShapeNet v2. To adjust for the scale difference, we converted OGN point clouds into ShapeNet v2 scale for each model.
Since the provided pretrained AtlasNet models were trained multi-class, we instead trained separate AtlasNet models for each evaluation. Each model was trained with the available code by the authors with all default parameters, except for the specification of class for each model and matching train/test splits with those used for DeepSDF. The quality of the models produced from these trainings appear comparable to those in the original paper.
Of note, we realized that AtlasNet’s own computation of its training and evaluation metric, Chamfer distance, had the limitation that only the vertices of the generated mesh were used for the evaluation. This leaves the triangles of the mesh unconstrained in that they can connect across what are supposed to be holes in the shape, and this would not be reflected in the metric. Our evaluation of meshes produced by AtlasNet instead samples evenly from the mesh surface, i.e. each triangle in the mesh is weighted by its surface area, and points are sampled from the triangle faces.
We used the provided shape completion inference results for 3D-EPN, which is in voxelized distance function format. We subsequently extracted the isosurface using MATLAB as described in the paper to obtain the final mesh.
The first two metrics, Chamfer and Earth Mover’s, are easily applicable to points, meshes (by sampling points from the surface) and voxels (by sampling surface voxels and treating their centers as points). When meshes are available, we also can compute metrics suited particularly for meshes: mesh accuracy, mesh completion, and mesh cosine similarity.
Chamfer distance is a popular metric for evaluating shapes, perhaps due to its simplicity . Given two point sets and , the metric is simply the sum of the nearest-neighbor distances for each point to the nearest point in the other point set.
Note that while sometimes the metric is only defined one-way (i.e., just ) and this is not symmetric, the sum of both directions, as defined above, is symmetric: . Note also that the metric is not technically a valid distance function since it does not satisfy the triangle inequality, but is commonly used as a psuedo distance function . In all of our experiments we report the Chamfer distance for 30,000 points for both and , which can be efficiently computed by use of a KD-tree, and akin to prior work  we normalize by the number of points: we report .
Earth Mover’s distance , also known as the Wasserstein distance, is another popular metric for measuring the difference between two discrete distributions. Unlike the Chamfer distance, which does not require any constraints on the correspondences between evaluated points, for the Earth Mover’s distance a bijection , i.e. a one-to-one correspondence, is formed. Formally, for two point sets and of equal size , the metric is defined via the optimal bijection :
Although the metric is commonly approximated in the deep learning literature  by distributed approximation schemes  for speed during training, we compute the metric accurately for evaluation using a more modest number of point samples (500) using .
In practice the intuitive, important difference between the Chamfer and Earth Mover’s metrics is that the Earth Mover’s metric more favors distributions of points that are similarly evenly distributed as the ground truth distribution. A low Chamfer distance may be achieved by assigning just one point in to a cluster of points in , but to achieve a low Earth Mover’s distance, each cluster of points in requires a comparably sized cluster of points in .
Mesh accuracy, as defined in , is the minimum distance such that 90% of generated points are within of the ground truth mesh. We used 1,000 points sampled evenly from the generated mesh surface, and computed the minimum distances to the full ground truth mesh. To clarify, the distance is computed to the closest point on any face of the mesh, not just the vertices. Note that unlike Chamfer and Earth Mover’s metrics which require sampling of points from both meshes, with this metric the entire mesh for the ground truth is used – accordingly this metric has lower variance than for example Chamfer distance computed with only 1,000 points from each mesh. Note also that mesh accuracy does not measure how complete the generated mesh is – a low (good) mesh accuracy can be achieved by only generating one small portion of the ground truth mesh, ignoring the rest. Accordingly, it is ideal to pair mesh accuracy with the following metric, mesh completion.
Mesh completion, also as defined in , is the fraction of points sampled from the ground truth mesh that are within some distance (a parameter of the metric) to the generated mesh. We used , which well measured the differences in mesh completion between the different methods. With this metric the full generated mesh is used, and points (we used 1,000) are sampled from the ground truth mesh (mesh accuracy is vice versa). Ideal mesh completion is 1.0, minimum is 0.0.
Mesh cosine similarity is a metric we introduce to measure the accuracy of mesh normals. We define the metric as the mean cosine similarity between the normals of points sampled from the ground truth mesh, and the normals of the nearest faces of the generated mesh. More precisely, given the generated mesh and a set of points with normals sampled from the ground truth mesh, for each point in we look up the closest face in , and then compute the average cosine similarity between the normals associated with and ,
where each is a unit-norm normal vector. We use and in order to allow for  which does not provide oriented normals, we compute the over both the generated mesh normal and its flipped normal: . Ideal cosine similarity is 1.0, minimum (given the allowed flip of the normal) is 0.0.
We provide additional results on representing test objects with trained DeepSDF (Fig. 20, 21). We provide additional data with the additional metrics, mesh completion and mesh cosine similarity, for the comparison of methods contained in the manuscript (Tab. 5). The success of this task for DeepSDF implies that 1) high quality shapes similar to the test shapes exist in the embedding space, and 2) the codes for the shapes can be found through simple gradient descent.
|Mesh comp., mean||chair||plane||table||lamp||sofa|
|Mesh comp., median|
|Cosine sim., mean|
Finally we present additional shape completion results on unperturbed depth images of synthetic ShapeNet dataset (Fig. 22), demonstrating the quality of the auto-decoder learning scheme and the new shape representation.
Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 600–609. PMLR, 10–15 Jul 2018.
Combined input training and radial basis function neural networks based nonlinear principal components analysis model applied for process monitoring.In IJCCI, pages 483–492, 2012.
Deformable shape completion with graph convolutional autoencoders.CVPR, 2017.