Representing 3D shape is a fundamental problem with many applications, including surface reconstruction, analysis, compression, matching, interpolation, manipulation, and visualization. For most vision applications, a 3D representation should support: (a) reconstruction with accurate surface details, (b) scalability to complex shapes, (c) support for arbitary topologies, (d) generalizability to unseen shape classes, (e) independence from any particular application domain, (f) encoding of shape priors, (g) compact storage, and (h) computational efficiency.
No current representation has all of these desirable properties. Traditional explicit 3D representations (voxels, meshes, point clouds, etc.) provide properties (a-e) above. They can represent arbitrary shapes and any desired detail, but they require storage and computation proportional to the shape complexity and reconstruction accuracy, and they don’t encode shape priors helpful for 3D completion and reconstruction tasks. In contrast, learned representations (latent vectors and deep network decoders) excel at representing shapes compactly with low-dimensional latent vectors and encoding shape priors in network weights, but they struggle to reconstruct details for complex shapes or generalize to shape classes outside the training distribution.
Most recently, deep implicit functions (DIF) have been shown to be highly effective for reconstruction of individual objects [mescheder2019occ, chen2018learning, park2019deepsdf, xu2019disn]. They represent an input observation as a latent vectorgiven a query location in 3D space. This approach achieves state of the art results for several 3D shape reconstruction tasks. However, they use a single, fixed-length latent feature vector to represent the entirety of all shapes and they evaluate a complex deep network to evaluate the implicit function for every position . As a result, they support limited shape complexity, generality, and computational efficiency.
Meanwhile, new methods are emerging for learning to infer structured decomposition of shapes [tulsiani2017learning, genova2019learning]. For example, [genova2019learning] recently proposed a network to encode shapes into Structured Implicit Functions (SIF), which represents an implicit function as a mixture of local Gaussian functions. They showed that simple networks can be trained to decompose diverse collections of shapes consistently into SIFs, where the local shape elements inferred for one shape (e.g., the leg of a chair) correspond to semantically similar elements for others (e.g., the leg of a table). However, they did not use these structured decompositions for accurate shape reconstruction due to the limited shape expressivity of their local implicit functions (Gaussians).
The key idea of this paper is to develop a pipeline that can learn to infer Deep Structured Implicit Functions (DSIF), a representation of 3D shape as a structured set of local deep implicit functions (Figure 1). This DSIF representation is similar to SIF in that it decomposes a shape into a set of overlapping local regions represented by Gaussians; however it also associates a latent vector with each local region that can be decoded with a DIF to produce finer geometric detail. Alternately, DSIF is similar to DIF in that it encodes a shape as a latent vector that can be evaluated with a neural network to estimate the inside/outside function for any location ; however, the DSIF latent vector is decomposed into parts associated with local regions of space (SIF Gaussians), which makes it more scalable, generalizable, and computationally efficient.
In this paper, we not only propose the DSIF representation, but we also provide a common system design that works effectively for 3D autoencoding, depth image completion, and partial surface completion. First, we propose to use DIF to predict local functions that areresiduals with respect to the Gaussian functions predicted by SIF – this choice simplifies the task of the DIF, as it must predict only fine details rather than the overall shape within each shape element. Second, we propose to use the SIF decomposition of space to focus the DIF encoder on local regions by gathering input 3D points within each predicted shape element and encoding them with PointNet [qi2016pointnet]
. Finally, we investigate several significant improvements to SIF (rotational degrees of freedom, symmetry constraints, etc.) and simplifications to DIF (fewer layers, smaller latent codes, etc.) to improve the DSIF representation. Results of ablation studies show that each of these design choices provides significant performance improvements over alternatives. In all, DSIF achieves 10-15 points better F-Score performance on shape reconstruction benchmarks than the state-of-the-art[mescheder2019occ], with fewer than 1% of the network parameters.
2 Related Work
Traditional Shape Representations: There are many existing approaches for representing shape. In computer graphics, some of the foundational representations are meshes [baumgart1975polyhedron], point clouds [foley1996computer], voxel grids [foley1996computer], and implicit surfaces [ricci1973constructive, blinn1982generalization, bloomenthal1991convolution, muraki1991volumetric, ohtake2003multi, wyvill1986soft]. These representations are popular for their simplicity and ability to operate efficiently with specialized hardware. However, they lack two important properties: they do not leverage a shape prior, and they can be inefficient in their expressiveness. Thus, traditional surface reconstruction pipelines based on them, such as Poisson Surface Reconstruction [kazhdan2006poisson], require a substantial amount of memory and computation and are not good for completing unobserved regions.
Learned Shape Representations: To leverage shape priors, shape reconstruction methods began representing shape as a learned feature vector, with a trained decoder to a mesh [smith2019geometrics, gkioxari2019mesh, wang2018pixel2mesh, groueix2018papier, kanazawa2018learning], point cloud [fan2017point, lin2018learning, yang2019pointflow], voxel grid [choy20163d, wu20153d, brock2016generative, wu2016learning], or octree [tatarchenko2017octree, riegler2017octnetfusion, riegler2017octnet]. Most recently, representing shape as a vector with an implicit surface function decoder has become popular, with methods such as OccNet [mescheder2019occ], ImNet [chen2018learning], DeepSDF [park2019deepsdf], and DISN [xu2019disn]. These methods have substantially improved the state of the art in shape reconstruction and completion. However, they do not scale or generalize very well because the fundamental representation is a single fixed-length feature vector representing a shape globally.
Structured Shape Representations: To improve scalability and efficiency, researchers have introduced structured representations that encode the repeated and hierarchical nature of shapes. Traditional structured representations include scene graphs [foley1996computer], CSG trees [foley1996computer], and partition of unity implicits [ohtake2003multi], all of which represent complex shapes as the composition of simpler ones. Learned structured representations include CSGNet [sharma2018csgnet], GRASS [li2017grass], Volumetric Primitives [tulsiani2017learning], Superquadrics [Paschalidou2019CVPR], and SIF [genova2019learning]. These methods can decompose shapes into simpler ones, usually with high consistency across shapes in a collection. However, they have been used primarily for shape analysis (e.g. part decomposition, part-aware correspondence), not for accurate surface reconstruction or completion.
3 Deep Structured Implicit Functions
In this paper, we propose a new 3D shape representation based on Structured Deep Implicit Functions (DSIF). The DSIF is a function that can be used to classify whether a query point is inside or outside a shape. It is represented by a set of shape elements, each parameterized by 10 analytic shape variables and latent shape variables :
where is a local analytic implicit function and is a deep implicit function. Intuitively, provides a density function that defines a coarse shape and region of influence for each shape element, and provides the shape details that cannot be represented by .
Like a typical deep implicit function (DIF), our DSIF represents a 3D shape as an isocontour of an implicit function decoded with a deep network conditioned on predicted latent variables. However, our DSIF replaces the (possibly long) single latent code of a typical DIF with the concatenation of pairs of analytic parameters and short latent codes – i.e., the global implicit function is decomposed into the sum of local implicit functions. This key difference helps it to be more accurate, efficient, consistent, scalable, and generalizable (see Section 6).
Analytic shape function. The analytic shape function defines a coarse density function and region of influence for each shape element. Any simple analytic implicit function with local support would do. We use an oriented, anisotropic, 3D Gaussian:
where the parameter vector consists of ten variables: one for a scale constant , three for a center point , three radii , and three Euler angles (this is the same parameterization as [genova2019learning], except with 3 additional DoFs for rotation). The last 9 variables imply an affine transformation matrix that takes a point from object space coordinates to the local isotropic, oriented, centered coordinate frame of the shape element.
Deep shape function. The deep implicit function defines local shape details within a shape element by modulating (one function is shared by all shape elements). To compute , we use a network architecture based on Occupancy Networks [mescheder2019occ] (we call ours TinyOccNet). As in the original OccNet, ours is organized as a fully-connected network conditioned on the latent code
and trained using conditional batch normalization. However, one critical difference is that we transform the pointby before feeding it to the network:
Another critical difference is that only modulates the local implicit function rather than predicting an entire, global function. As a result, TinyOccNet has fewer network layers (9 vs. 33), shorter latent codes (32 vs. 256), and many fewer network parameters (8.6K vs 2M) than the original OccNet, and still achieves higher overall accuracy (see Section 6).
Symmetry constraints. For shape collections with man-made objects, we constrain a subset of the shape elements (half) to be symmetric with respect to a selected set of transformations (reflection across a right/left bisecting plane). These “symmetric” shape elements are evaluated twice for every point query, once for and once for , where is the symmetry transformation. In doing so, we effectively increase the number of shape elements without having to compute/store extra parameters for them. Adding partial symmetry encourages the shape decomposition to match global shape properties common in many shape collections and gives a boost to accuracy (Table 4).
4 Processing Pipeline
The processing pipeline for computing DSIF is shown in Figure 1. All steps of the pipeline are differentiable and trained end-to-end. At inference time, the input to the system is a 3D surface or depth image, and the output is a set of shape element parameters and latent codes for each of overlapping local regions, which can be decoded to predict inside/outside for any query location . Complete surfaces can be reconstructed for visualization by evaluating DSIF at points on a regular grid and running Marching Cubes [lorensen1987marchingcubes].
The exact configuration of the encoder architecture varies with input data type. We encode a 3D mesh by first rendering a stack of 20 depth images at 224 x 224 resolution from a fixed set of equally spaced views surrounding the object. We then give the depth images to an early-fusion ResNet50 [he2016resnet] to regress the shape element parameters . Meanwhile, we generate a set of 10K points with normals covering the whole shape by estimating normals from the depth image(s) and unprojecting randomly selected pixels to points in object space using the known camera parameters. Then, for each shape element, we select a sampling of 1K points with normals within the region of influence defined by the predicted analytic shape function, and pass them to a PointNet [qi2016pointnet] to generate the latent code . Alternatively, we could have encoded 3D input surfaces with CNNs based on mesh, point, or voxel convolutions, but found this processing pipeline to provide a good balance between detail, attention, efficiency, and memory. In particular, since the local geometry of every shape element is encoded independently with a PointNet, it is difficult for the network to “memorize” global shapes and it therefore generalizes better.
We encode a depth image with known camera parameters by first converting it into a 3 channel stack of 224 x 224 images representing the XYZ position of every pixel in object coordinates. We then feed those channels into a ResNet50 to regress the shape element parameters , and we regress the latent codes for each shape element using the same process as for 3D meshes.
4.1 Training Losses
The pipeline is trained with the following loss :
Point Sample Loss .
The first loss measures how accurately the DSIF(x) predicts inside/outside of the ground-truth shape. To compute it, we sample 1024 points near the ground truth surface (set ) and 1024 points uniformly at random in the bounding box of the shape (set ). We combine them with weights to form set . The near-surface points are computed using the sampling algorithm of [genova2019learning]
. We scale by a hyperparameter, apply a sigmoid to the decoded value DSIF(x), and then compute an loss to the ground truth indicator function (see [genova2019learning] for details):
Shape Element Center Loss .
The second loss encourages the center of every shape element to reside within the target shape. To compute it, we estimate a signed distance function on a low-res 32x32x32 grid for each training shape. The following loss is applied based on the grid value at the center of each shape element:
Here, is a threshold chosen to account for the fact that is coarse. It is set to half the width of a voxel cell in . This setting makes it a conservative loss: it says that when is definitely outside the ground truth shape, should be moved inside. never penalizes a center that is within the ground truth shape.
It is also possible for the predicted center to lie outside the bounding box of . In this case, there is no gradient for , so we instead apply the inside-bounding-box loss from [genova2019learning] using the object-space bounds of .
5 Experimental Setup
We execute a series of experiments to evaluate the proposed DSIF shape representation, compare it to alternatives, study the effects of its novel components, and test it in applications. Unless otherwise noted, we use shape elements and dimensional latent vectors during all experiments.
Datasets. When not otherwise specified, experiments are run on the ShapeNet dataset [chang2015shapenet]. We use the train and test splits from 3D-RN [choy20163d]. We additionally subdivide the train split to create an 85%, 5%, 10% train, validation, and test distribution. We pre-process the shapes to make them watertight using the depth fusion pipeline from Occupancy Networks [mescheder2019occ]. We train models multi-class (all 13 classes together) and show examples only from the test split.
Metrics. We evaluate shape reconstruction results with mean intersection-over-union (IoU) [mescheder2019occ], mean Chamfer distance [mescheder2019occ], and mean F-Score [tatarchenko2019single] at . As suggested in [tatarchenko2019single]
, we find that IoU is difficult to interpret for low values, and Chamfer distance is outlier sensitive, and so we focus our discussions mainly on F-Scores.
Baselines. We compare most of our results to the two most related prior works: Occupancy Networks [mescheder2019occ] (OccNet), the state-of-the-art in deep implicit functions, and Structured Implicit Functions [genova2019learning] (SIF), the state-of-the-art in structural decomposition.
6 Experimental Evaluation
In this section, we report results of experiments that compare DSIF and baselines with respect to how well they satisfy desirable properties of a 3D shape representation.
Accuracy. Our first experiment compares 3D shape representations in terms of how accurately they can encode/decode shapes. For each representation, we compare a 3D3D autoencoder trained on the multiclass training data, use it to reconstruct shapes in the test set, and then evaluate how well the reconstructions match the originals (Table 1). DSIF’s mean F-Score is 92.2, 10.3 points higher than OccNet, and 33.2 points higher than SIF. A more detailed breakdown of the results appears in Figure 2, which shows the F-scores for all models in the test set – DSIF improves on OccNet’s score for 93% of test shapes. The increase in accuracy translates into a large qualitative improvement in results (shown above in Figure 2). For example, DSIF often reproduces better geometric details (e.g., back of the bench) and handles unusual part placements more robustly (e.g., handles on the rifle).
Efficiency. Our second experiment compares the efficiency of 3D shape representations in terms of accuracy vs. storage/computation costs. Since DSIF can be trained with different numbers of shape elements () and latent feature sizes (), a family of DSIF representations is possible, each with a different trade-off between storage/computation and accuracy. Figure 3 investigates these trade-offs for several combinations of and and compares the accuracy of their autoencoders to baselines. Looking at the plot on the top, we see that DSIF performs comparably to OccNet and outperforms SIF at the same number of bytes, and is capable of scaling to larger embeddings. Similarly, the plot on the bottom shows that DSIF provides more accurate reconstructions than baselines at every decoder size – our decoder with and is the size of OccNet and provides better F-Score.
Consistency. Our third experiment investigates the ability of DSIF to decompose shapes consistently into shape elements. This property was explored at length in [genova2019learning] and shown to be useful for structure-aware correspondences, interpolations, and segmentations. While not the focus of this paper, we find qualitatively that the consistency of the DSIF representation is slightly superior to SIF, because the shape element symmetries and rotations introduced in this paper provide the DoFs needed to decompose shapes with fewer elements. On the other hand, the local DIFs are able to compensate for imperfect decompositions during reconstruction, which puts less pressure on consistency. Figure 4 shows qualitative results of the decompositions computed for DSIF. Please note the consistency of the colors (indicating the index of the shape element) across a broad range of shapes.
Generalizability. Our third experiment studies how well trained autoencoders generalize to handle unseen shape classes. To test this, we used the auto-encoders trained on 3D-RN classes and tested them without fine-tuning on a random sampling of meshes from 10 ShapeNet classes that were not seen during training. Table 2 shows that the mean F-Score for DSIF on these novel classes is 84.4, which is 17.8 points higher than OccNet and 41.4 points higher than SIF. Looking at the F-Score for every example in the bottom of Figure 5, we see that SDIF is better on 91% of examples. We conjecture this is because DSIF learns to produce consistent decompositions for a broad range of input shapes when trained multiclass, and because the local TinyOccNets learn to predict shape details only for local regions with their limited capacity. This two-level factoring of structure and detail seems to help DSIF generalize.
Domain-independence. Our fifth experiment investigates whether DSIF can be used in application domains beyond the man-made shapes found in ShapeNet. As one example, we trained DSIF without any changes to autoencode meshes of human bodies in a wide variety of poses sampled from [varol17_surreal]. Specifically, we generated 5M meshes by randomly sampling SMPL parameters (CAESAR fits for shape, mocap sequence fits for pose). We use a 80/5/15 train/val/test split similar to [varol17_surreal], and measured the error of the learned autoencoder on the held-out test set. The challenge for this dataset is quite different than for ShapeNet – the autoencoder must be able to represent large-scale, non-rigid deformations in addition to shape variations. Our reconstructions achieve mIOU compared to mIOU for SIF. The results of DSIF reconstructions and the underlying SIF templates are shown in Figure 6. Despite a lack of supervision on pose or subject alignment, our approach reconstructs a surface close to the original and establishes coarse correspondences via the structure decomposition.
In this section, we investigate how the proposed DSIF representation can be used in applications. Although SIF (and similarly DSIF) has previously been shown useful for 3D shape analysis applications like structure-aware shape interpolation, surface correspondence, and image segmentation [genova2019learning], we focus our study here on 3D surface reconstruction from partial observations.
7.1 3D Completion from a Single Depth Image
Task. Reconstructing a complete 3D surface from a single depth image is important vision task with applications in AR, robotics, etc. To investigate how DSIF performs on this task, we modified our network to take a single depth image as input (rather than a stack of 20) and trained it from scratch on depth images generated synthetically from random views of the 3D-R2N2 split of shapes. The depth images were 512x512 to approximate the resolution of real depth sensors (though all CNN inputs remain 224x224 due to memory restrictions). The depth images were rendered from view points sampled from all view directions and at variable distances to mimic the variety of scan poses. Each depth image was then converted to a three channel XYZ image using the known camera parameters.
Baseline. For comparison, we trained an OccNet network from scratch on the same data. Because the OccNet takes a point cloud rather than depth images, we train an XYZ image encoder network to regress the 256-D OccNet embedding. This OccNet* model provides an apples-to-apples baseline that isolates differences due only to the representation decoding part of the pipeline.
Results. Table 3 shows results of this 3D depth completion experiment. We find that the F-Score of DSIF is 15.8 points higher than OccNet* (78.2 vs. 62.4). Figure 7 highlights the difference in the methods qualitatively. As in the 3D case, we observe that DSIF’s local part encoders result in substantially better performance on hard examples.
Ablation study. To further understand the behavior of DSIF during depth completion, we ablate three components of our pipeline (Table 4). First, we verify that having local pointnets to encode the local feature vectors is useful, rather than simply predicting them directly from the input image. Second, we show that providing an XYZ image as input to the network is much more robust than providing a depth image. Finally, we show that taking advantage of the explicit structure via partial symmetry improves results qualitatively and achieves the same quality with fewer degrees of freedom. The biggest of these differences is due to the PointNet encoding of local shape elements, which reduces the F-Score by 11.4 points if it is disabled.
7.2 Reconstruction of Partial Human Body Scans
Task. Acquisition of complete 3D surface scans for a diverse collection of human body shapes has numerous applications [allen2003space]. Unfortunately, many real world body scans have holes (Figure 8(a)), due to noise and occlusions in the scanning process. We address the task of learning to complete and beautify the partial 3D surfaces without any supervision or even a domain-specific template.
Dataset and baselines. The dataset for this experiment is CAESAR [robinette2002civilian]. We use our proposed 3D autoencoder to learn to reconstruct a DSIF for every scan in the CAESAR dataset, and then we extract watertight surface from the DSIFs (using the splits from [pishchulin17pr]). For comparisons, we do the same for SIF (another unsupervised method) and a non-rigid deformation fit of the S-SCAPE template [pishchulin17pr].
Results. Figure 8 shows representative results. Note that DSIF captures high-frequency details missing in SIF reconstructions. Although the approach based on S-SCAPE provides better results, it requires a template designed specifically for human bodies as well as manual supervision (landmarks and bootstrapping), whereas DSIF is domain-independent and unsupervised. These results suggest that DSIF could be used for 3D reconstruction of other scan datasets where templates are not available.
Summary of research contributions: In this paper, we propose Deep Structured Implicit Functions (DSIF), a new 3D representation that describes a shape implicitly as the sum of local 3D functions, each evaluated as the product of a Gaussian and a residual function predicted with a deep network. We describe a method for inferring a DSIF from a 3D surface or posed depth image by first predicting a structured decomposition into shape elements, encoding 3D points within each shape element using PointNet [qi2016pointnet], and decoding them with a residual TinyOccNet [mescheder2019occ]. This approach provides an end-to-end framework for encoding shapes in local regions arranged in a global structure.
We show that this DSIF representation improves both reconstruction accuracy and generalization behavior over previous work – its F-Score results are better than the state-of-the-art [mescheder2019occ] by 10.3 points for 3D autoencoding of test models from trained classes and by 17.8 points for unseen classes. We show that it dramatically reduces network parameter count – its local decoder requires approximately 0.4% of the parameters used by [mescheder2019occ]. We show that it can be used to complete posed depth images – its depth completion results are 15.8 percentage points higher than [mescheder2019occ]. Finally, we show that it can be used without change to reconstruct complete 3D surfaces of human bodies from partial scans.
Limitations and future work: Though the results are encouraging, there are limitations that require further investigation. First, we decompose space into a flat set of local regions – it would be better to consider a multiresolution hierarchy. Second, we leverage known camera poses when reconstructing shapes from depth images – it would be better to estimate them. Third, we estimate a constant number of local regions – it would be better to derive a variable number dynamically during inference (e.g., with an LSTM). Finally, we just scratch the surface of how structured and implicit representations can be combined – this is an interesting topic for future research.
Acknowledgements: We thank Daniel Vlasic for invaluable discussions and help generating CAESAR data. We also thank Boyang Deng for sharing OccNet* training code, Max Jiang for creating single-view depth renders, and Fangyin Wei and JP Lewis for feedback on the manuscript.
Appendix A Hyperparameters
Table 5 contains all hyperparameter values used for training the model. Architecture details for individual networks are below.
ResNet50. We use a ResNet50 [he2016resnet] V2 that is trained from scratch. The 20 depth images are concatenated channel-wise prior to encoding.
PointNet. We modify the original PointNet [qi2016pointnet] archictecture by removing the 64x64 orthogonal transformation to improve speed and reduce memory requirements.
OccNet. Our TinyOccNet variant follows the same structure as the original OccNet [mescheder2019occ]. However, we reduce the number of residual blocks from 5 to 1. The latent layer feature widths are also decreased proportionally to the vector dimensionality.
Local Point Cloud Extraction. We sample a subset of points for encoding by the local PointNet as follows. We first transform all 10,000 points to the local frame. Then we choose a distance threshold measured in local units. Since the local frame is scaled proportionally to the radius, this threshold is approximately four radii in the world frame. We randomly sample 1,000 points without replacement within and return those as the set of points to be encoded. If 1,000 points do not exist, we expand until 1,000 total points are found.
Global Point Cloud Creation. In order to create 10,000 points from one or more input depth images, we randomly sample valid points without replacement from the depth images. If 10,000 valid pixels do not exist, we repeat random points as necessary before moving to the local extraction phase.
Activation Functions. Since the generated network activations are in the range
, we apply activation functions to latentsto interpret them as the analytic parameters . The following functions are used. For constants : . For ellipsoid radii : . For ellipsoid euler-angles : . For ellispoid positions : .