1 Introduction
In recent years, deep implicit functions (DIF) have gained much popularity as a 3D shape representation in applications such as compression [deepimplicitcompression], shape completion [dai2017shape], neural rendering [mildenhall2020nerf, tewari2020state]
, and superresolution
[chibane2020implicit]. In contrast to explicit representations such as point clouds, voxels, or meshes, a 3D shape is encoded into a compact latent vector, which when combined with a sampled 3D location as input to a decoder can be used to evaluate an implicit function for surface reconstruction.
In this paper, our objective is to design a DIF for shape representation that has three main properties: \⃝raisebox{0.6pt}{1} represent shapes with arbitrarily fine details (adding more bits to the representation provides more details), \⃝raisebox{0.6pt}{2} support both encoderdecoder inference and decoderonly latent optimization, and can be applied to different tasks, and \⃝raisebox{0.6pt}{3} enable detailpreserving shape completion from inputs with large unobserved regions. These properties are all important for a shape representation. Yet, to the best of our knowledge, no prior method has achieved all three properties.
Existing DIF methods can be classified into global and local approaches. Early methods mostly belong to the global category
[park2019deepsdf, chen2019learning, mescheder2019occupancy, xu2019disn, michalkiewicz2019deep], where a single latent vector is used to represent the whole shape. These approaches learn to encode a global shape prior in a compact latent space, which can then be leveraged to fulfill various reconstruction tasks. However, due to the limited capacity of the latent space and the global nature of these approaches, global methods usually lack finegrained detail.More recently, local approaches [jiang2020local, chabra2020deep] have been proposed. These methods divide the space into local regions and encode each one with a latent vector. Such local representations provide better accuracy and generalization when representing shapes, especially under decoderonly latent optimization. However, they do not model a global prior. As a result, they cannot be used for shape completion with large unobserved regions since in such regions there is no data to optimize the latent vectors. To overcome this issue, [Genova_2020_CVPR, chibane2020implicit] use an encoder to regress local latent vectors from incomplete inputs. However, their methods are limited to encoderdecoder inference when doing shape completion. Compared to decoderonly latent optimization, encoderdecoder inference has less flexibility on the inputs and is less accurate for preserving detail in observed regions.
In this paper, we propose a novel 3D representation: Multiresolution Deep Implicit Function (MDIF). The core idea is to represent a shape as a multiresolution hierarchy of latent vectors, where each level encodes different frequencies of an implicit function. The higher levels of our representation provide the global shape and the lower levels provide fine detail. Different from local methods [Genova_2020_CVPR, chibane2020implicit], MDIF has a onedecoderperlevel architecture, where each decoder produces a residual with respect to its parent level, like a Haar wavelet [chui2016introduction]. This simplifies learning of fine detail and enables progressive decoding to achieve arbitrary levels of detail (see Figure 1).
To enable detailed shape completion with decoderonly latent optimization, we further propose to use global connection across levels as well as applying dropout on the latent codes. The global connection serves to integrate global priors into lower levels to compensate for missing observations. Meanwhile, applying dropout on the latent codes simulates partial observation in the latent space during training, and therefore forces the decoders to learn to complete shapes under encoderless scenario.
Overall, our model has the following merits:

Can represent complex shapes with high accuracy, and allows progressive decoding for different levels of detail.

Supports both encoderdecoder inference and decoderonly latent optimization, and is effective for different applications as illustrated in the experimental results.

Enables detailed decoderonly shape completion that accurately preserves detail in observed regions while producing plausible results in unobserved regions.
2 Related Work
There are largely two types of 3D geometry representations in computer graphics and vision. Explicit representations such as meshes, splines, and point clouds, are widely adopted in the field of CAD and animation, since they are compact and highly optimized for editing and rendering. Implicit representations, such as the zerolevel set of a signed distance field, have gained increasing popularity in volumetric capture [motion2fusion, kinfu, dynfu, dou2016fusion4d, tsdf], since they can represent arbitrary surface topology and define watertight surfaces.
Convolutional neural networks (CNNs) have been proposed for predicting an implicit representation of objects. Early techniques were only able to predict lowresolution grids [girdhar2016learning, choy20163d, wu20153d]. More recently, methods relying on an octree structure have been proposed [tatarchenko2017octree, hane2019hierarchical, riegler2017octnetfusion, Wang:2017:OOC:3072959.3073608, wang2020deep]
to avoid the cubic growth inherent to highresolution grids. However, the implicit representation learnt by these networks is still discrete, potentially creating discretization artefacts when reconstructing 3D shapes. To overcome this limitation and allow for learning the implicit representation over the continuous domain, the problem can be reformulated as a multilayer perceptron (MLP) which takes the location at which the implicit representation is to be evaluated as input
[park2019deepsdf, chen2019learning, mescheder2019occupancy, xu2019disn, michalkiewicz2019deep]. This allows for querying the implicit representation at continuous locations during test time. Termed as Deep Implicit Functions (DIF), this technique can be categorized into global, local, and hierarchical methods.Global methods
Global methods represent a 3D shape with a single holistic latent code. The projection to the latent space can be done via an encoder [chen2019learning], or latent optimization [park2019deepsdf]. A decoder is then used to recover the shape from the latent vector. To obtain a smooth manifold on the latent space for shape generation, people have developed optimization strategies based on autodecoding [park2019deepsdf], curriculum learning [duan2020curriculum], and adversarial training [kleineberg2020adversarial]. Global methods are robust to local noise, hence have good shape completion capability. However, these approaches have difficulty recovering fine detail. Recent methods [sitzmann2020implicit, tancik2020fourfeat, mildenhall2020nerf]
propose to use periodic activation functions to lift the input positional vector to high dimensional space allowing to better preserve high frequency detail. However, these methods focus on perinstance fitting instead of generalization to new scenes and objects.
Local methods
In contrast, local methods uniformly divide the 3D space into local grids [jiang2020local, chabra2020deep, chibane2020implicit] or use an encoder to decompose space into local parts [genova2019learning, Genova_2020_CVPR]. Then they either assign each local grid/part with a latent code [jiang2020local, chabra2020deep, Genova_2020_CVPR]
or trilinearly interpolate feature grids to obtain the latent code at each querying location
[chibane2020implicit]. Since each latent code only needs to represent the shape in a local region, it is much easier to encode detail and generalize to unseen objects. However, these methods do not include global context, hence it is not feasible to perform decoderonly shape completion when there are large unobserved regions. While the feature grids used by [chibane2020implicit] span multiple resolutions, they still do not contain global context and are only used to represent single level of detail.Hierarchical methods
Some methods perform shape reconstruction in stages, where a lowresolution shape prediction precedes a highresolution prediction [dai2017shape, jiang2020sdfdiff, hanocka2019meshcnn, wang2018global]. For example, GlobalToLocal Generative Models [wang2018global] decode a global voxel grid and then add detail with a partwise refiner. NSVF [liu2020neural] uses an octree hierarchy of implicit functions to represent the radiance field for neural rendering. Though their motivation for using an octree is similar as ours, they do not use it to represent a globally consistent shape, but rather a viewdependent radiance function suitable for view synthesis. They would not be able, for example, to perform shape completion.
3 Methodology
Our overarching goal is to design a flexible representation that can generate shapes from coarse to fine resolutions for reconstruction or completion tasks. Depending on the application, our model can perform inference in the encoderdecoder mode for efficiency or the decoderonly latent optimization for better accuracy. To achieve this, our pipeline, shown in Figure 2, encodes the input SDF into multiple levels of latent codes. Each level has a decoder reconstructing in a different detail level. To detail our design, we first formulate a multiresolution representation in the form of traditional implicit function in Section 3.1. Then in Section 3.2
, we explain how to design a deep neural network version of this representation. The training process is described in Section
3.3. Finally in Section 3.4, we explain different inference modes with respect to different applications.3.1 Multires Implicit Function
We choose to learn the signed distance function (SDF), which is a level set defined as:
(1) 
where is the volume containing the shape, is a 3D point inside , and is the SDF function that represents the signed distance to the closest surface (positive on the outside and negative on the inside). We then use to represent the SDF value of a particular point , and to represent the surface or zerocrossing.
Now we can define an level version of as , , where each level represents different frequency of details from low to high. To construct this, we subdivide into an level octree. Unlike conventional octrees which only subdivide nonempty cells, our tree is balanced because completing partial observation is one of our target scenarios.
For level 0 (the coarsest level), geometry is represented as SDF ; for level , we use the residual to capture finer details, as shown in Figure 3. The final SDF reconstruction is therefore defined as . In Section 4.2, we empirically show that inferring residuals yields better performance compared to directly regressing the SDF.
3.2 Multires Deep Implicit Function
The idea of a deep version of the multires implicit function, is to encode the shape in each cell of the octree into a latent code with DNN. For a cell in level , its latent code represents an SDF; while for a cell in , its latent code encodes residuals. Eventually we end up having a tree of latent codes , where the latent codes in each cell of level form a latent grid at this level. The spatial resolution and total capacity of the latent grids increase with the level, and consequently the level of detail gets higher.
In Figure 2, we describe the design of our network architecture to encode the shape into . On the encoder side, the input is the regular grid form of a SDF with a resolution of . The encoder first extracts a global feature from . Then is encoded into different levels of latent grids through 3D convolution layers. Note that at level 0, there is only one latent code representing the global shape which is critical for completion tasks.
On the decoder side, unlike [chibane2020implicit], our model has one decoder per level to support different levels of detail. For the decoder module at each level, we choose IMNet [chen2019learning] which consists of several fullyconnected layers. The remaining question is what do we input to the decoders? At the global level (), since there is only one latent code, the decoder simply takes and a 3D position as input, and decodes the SDF value at that point. For higher levels (), the input of decoder consists of two parts. The first part is similar to [chibane2020implicit], we use trilinear interpolation to sample a latent code from the latent grid of this level as , based on the 3D location . For the second part, we first apply deconvolution to upsample to a latent grid , which has the same spatial resolution as . Then trilinear interpolation is also applied to sample a latent code from . This allows the decoder to have access to the global context to better decode local details as well as compensating for missing data during shape completion. We call this global connection. Formally,
(2) 
Note that for , the decoders do not need to take 3D positions as input, because and are already functions of via trilinear interpolation. Finally, since predicts residual , the outputs of all levels are aggregated to have the final SDF. For detailed network architecture, please refer to our supplementary.
3.3 Training
MDIF is trained endtoend in an encoderdecoder fashion because: \⃝raisebox{0.6pt}{1} it allows both encoderdecoder inference and decoderonly latent optimization to be available during testtime; \⃝raisebox{0.6pt}{2} training with an encoder is generally more efficient comparing to training in decoderonly mode, since latent codes are not initialized randomly.
Points Sampling
We generate regular SDF grids as the input of the encoder . In addition, the decoders require a 3D point set as training data. Similar to [Genova_2020_CVPR], we sample a uniform point set inside the object bounding box, as well as a nearsurface point set for each training object. Each point set has 100K samples. Mixing the two gives us the final training set , which implicitly applies more weight to the nearsurface points. At each training iteration, 4096 samples are randomly drawn from each set.
Loss
During training, our final loss is the summation of losses at all levels, such that . For each level , we first aggregate the predicted SDF and residual up to this level to produce , and then measures the L1 difference between it and groundtruth . Formally,
(3) 
Latent grid dropout
There are mainly two standard ways to make a deep implicit function model work for completion tasks. The more conventional way, as illustrated in Figure 4 (a), is training the model to take partial data as input and complete them. In this manner, the completion functionality is distributed among the encoder and decoder, therefore different encoders need to be trained for different completion tasks. Another way is decoderonly latent optimization, where the encoder is not needed during testtime and the latent code is optimized based on partial data [park2019deepsdf]. This manner provides higher accuracy on observed regions and directly generalizes to different completion modalities (depth image, partial scan, etc.) without retraining. However, it only works for global methods and cannot be applied to local methods. The reason is that for unobserved regions with no data point, the corresponding local latent codes cannot be optimized and will stay as initialization. Such latent codes would then be decoded into wrong shapes by the decoder.
To address this, we propose to train with complete shapes, but apply random dropout to latent grids, as shown in Figure 4 (b). The motivation is to simulate partial data in the latent space rather than the input space, hence forcing the decoder to learn to complete shapes without encoder. Specifically, for each level , we apply spatial dropout to , but keep the full content of , so that the decoder can utilize the global context from level . Note that our proposed multilevel architecture and global connection make this dropout strategy possible during training: this cannot be applied to other global or local approaches, without substantial changes in the architectures.
3.4 Inference
We discuss our inference process with respect to autoencoding (complete observation) and shape completion (partial observation).
Autoencoding
MDIF supports both encoderdecoder inference and decoderonly latent optimization. For applications that emphasize efficiency, encoderdecoder inference is a better choice, as it only has one feedforward pass. For applications that require accuracy, decoderonly latent optimization is preferred.
Shape completion
Here we focus on shape completion from a single depth image via decoderonly latent optimization, due to its benefits in accuracy and generalizability. We initialize all latent codes as zeros. Similar to global methods, level 0 can be optimized to have a coarse but complete reconstruction. For higher levels, the decoder is trained to add detail onto the observed parts, while produce sparse residual to the unobserved part. For this optimization process to work, we need to properly sample points and modify the loss function to accommodate incomplete observation.
When sampling the point set from a depth image, since part of the shape is occluded, we cannot simply sample points in the full volume as in training. Instead, we apply raycasting to sample cameraobservable points as , and occluded points as . For level , the loss function is the same as Equation 3 except only applied to visible points . For level , the loss function is modified to contain two terms as follows:
(4) 
The first term is to minimize the difference between aggregated SDF prediction and ground truth for visible points. The second term is for regularizing the residual of occluded points, such that the global shape from level 0 will be preserved for the unobserved part. In particular, measures the closest distance from an occluded point to the visible point set , and is normalized by a Gaussian
. In practice, we empirically set and . We call the second term global consistency.4 Experiments
In this section, we first validate the benefits of our proposed components by ablating important aspects (Section 4.2). Then to evaluate the effectiveness of our approach, we compare with stateoftheart methods on autoencoding 3D shapes (Section 4.3) and applications including point cloud completion (Section 4.4), voxel superresolution (Section 4.5) and shape completion from depth image (Section 4.6). These experiments demonstrate the capability of our method under different tasks and inference modes. We use 5 levels for MDIF in the experiments and set the dimensions of the latent grids as: . But note that MDIF is flexible to use any number of levels. During decoderonly latent optimization, we fix all other network parameters and only optimize over . Please refer to supplementary for more implementation details.
4.1 Dataset & Metrics
Following prior works [jiang2020local, Genova_2020_CVPR], we run the experiments on the ShapeNet dataset [shapenet2015] with train/test splits from 3DRN [choy20163d], which contain a subset of 13 categories in ShapeNet. We use all 13 categories in our experiments except for ablation studies where we only use the chair category. In all experiments, we only take the train split for training and leave out the test split for evaluation. For metrics, we use the Chamfer L2 distance and FScore with the exact settings as in [Genova_2020_CVPR]
. Since the Chamfer distance measures the average errors of all points, while the FScore measures the ratio of good predictions, these two metrics do not always agree with each other: a better FScore with a higher Chamfer distance usually indicates a few outliers resulting in significant error.
4.2 Ablation Study
We conduct our ablation studies on the chair category of ShapeNet, for it contains large number of instances as well as significant intraclass shape variance. The models are all trained under encoderdecoder scheme and use decoderonly latent optimization during inference.
Global/local/hierarchical
We compare MDIF with a global and a local baselines to emphasize the impact of MDIF ’s hierarchical model. The global baseline only has level 0, whilst the local baseline has only level 4 (a latent grid). In Table 1, we compare with the baselines in terms of autoencoding and shape completion from depth image. For autoencoding, the local baseline clearly outperforms global, since it has larger capacity and the capability to capture details. On the flip side, for shape completion, the global baseline has better accuracy because the local baseline behaves randomly on the unobserved part, as visualized in the column 3 of Figure 5. Our MDIF however, incorporates the benefits of global and local levels, and produces superior results in both tasks.
Method  Autoencoding  Shape Completion  

Chamfer ()  FScore ()  Chamfer ()  FScore ()  
Ours  0.009  99.5  1.34  66.5 
Global baseline  0.228  88.7  1.56  63.7 
Local baseline  0.012  99.2  5.47  48.3 
Network components
In Table 2, we incrementally compare the impact of four network components during decoderonly latent optimization.
Global consistency loss (Equation 4), which is designed to work for shape completion, has marginal improvements on the overall completion numbers. However, the column 3 of Figure 6 shows that it is still important for clean reconstruction in unobserved regions.
We also compare the difference between decoding into SDF or residual in Equation 2. Since predicting residual forces lower levels to focus on the addition of fine detail, it is a stronger constraint and improves both autoencoding and shape completion.
Latent grid dropout is another component that is tailored to shape completion. Without it, the Chamfer error drastically increases from to . Also, it slightly improves decoderonly autoencoding. We hypothesize it is because dropout improves the generalization of the decoders at levels 14 to test data and reduces the ambiguity between levels.
Finally, global connection passes the global shape prior to other levels. Without it, the completion results are almost unconstrained on the unobserved part. It also helps autoencoding, since without it, we are asking the network to add more detail without knowing what has been predicted by the previous levels, which is not sensible.
Method  Autoencoding  Shape Completion  

Chamfer ()  FScore ()  Chamfer ()  FScore ()  
Full pipeline  0.009  99.5  1.34  66.5 
No consistency loss      1.43  64.7 
No residual  0.025  98.2  3.00  53.0 
No dropout  0.026  97.9  8.38  43.0 
No global connection  0.086  93.9  19.9  39.6 
4.3 AutoEncoding 3D Shapes
Accuracy on test split
We first evaluate the autoencoding accuracy under encoderdecoder inference for the test shapes in 3DRN. We compare our approach with stateoftheart DIF methods including OccNet (“Occ.”) [mescheder2019occupancy], SIF [genova2019learning], LDIF [Genova_2020_CVPR] and IFNet (“IF.”) [chibane2020implicit]. The results for OccNet, SIF and LDIF are kindly provided by the authors of [Genova_2020_CVPR]. For IFNet, it originally uses highresolution latent grids (up to ) which altogether is over 20 times larger than the input grid () in the number of parameters. This would make the encoded latent grids meaningless for autoencoding task. Therefore in this experiment, we constrain IFNet to only use latent grids up to resolution (same as our approach) and have same total number of parameters in the latent grids as our approach. Table 3 (middle columns) show the average metrics across 13 categories. Our method achieves slightly higher FScore and much lower Chamfer error, which means it works better overall and on hard examples too. As visualized in Figure 7, our method preserves details well and represents thin structures much better than the competing methods (see the last row).
Next, we evaluate the performance under decoderonly latent optimization. We compare with OccNet (“Occ.”) [mescheder2019occupancy], IMNet (“IM.”) [chen2019learning] and a local baseline (resembles [jiang2020local, chabra2020deep]), as shown in Table 3 (right columns). Our method also performs the best under this inference mode and can improve over encoderdecoder inference by a large margin. The last column of Figure 7 shows qualitative results.
Generalizability
In this experiment, we study the generalizability to shapes vastly different from training data. We test the trained models from the last experiment without finetuning on 10 ShapeNet categories that are unseen during training. In Table 4, we compare the performance under both inference modes, and our method respectively outperforms other methods. While global methods generalize poorly to unseen categories, our method performs equally well as seen categories. Qualitative results are shown in Figure 8.
Occ.  SIF  LDIF  IF.  Ours  Occ.*  IM.*  Local*  Ours*  

Chamfer  0.49  1.18  0.4  0.39  0.19  0.43  0.46  0.14  0.10 
FScore  81.9  59  92.2  92.9  93.0  81.4  86.7  96.9  97.0 
Progressive refinement
One unique property of MDIF is the capability to decode shapes in different levels of detail. This enables the progressive refinement application in graphics, where 3D data are encoded into different levels of detail and progressively rendered. Since MDIF has a multilevel architecture, this can be easily achieved by only decoding the shape up until a certain level. Figure 9 shows the distortion against the accumulated latent code size in bytes of each level, , latent space capacity. MDIF consistently improves with each level added. When under similar bytes, MDIF still outperforms SIF, LDIF and IFNet.
Occ.  SIF  LDIF  IF.  Ours  Occ.*  IM.*  Local*  Ours*  

Chamfer  0.85  1.48  0.53  0.40  0.17  0.62  0.47  0.063  0.054 
FScore  66.6  43.0  84.4  92.4  92.8  71.1  80.5  97.5  97.5 
4.4 Point Cloud Completion
In this application, we take voxelized point cloud instead of SDF grid as input. We follow the same steps as IFNet [chibane2020implicit] to produce such input: first sample 300 points from object surface and then voxelize these points into a grid. We compare our method with IFNet, where both methods use encoderdecoder inference. As indicated in Table 5 (middle 2 columns), our method has higher FScore and much lower Chamfer error. This reveals that our method is more accurate and stable in prediction. Figure 10 (top row) shows results for one example data. Our method preserves the cavity in the legs while IFNet incorrectly fills part of the cavity.
4.5 Voxel SuperResolution
In this task, we input occupancy grid and ask the network to predict the underlying continuous implicit field. The resolution of output grid for meshing is . We compare our method with IFNet, with both under encoderdecoder inference. Table 5 (right 2 columns) show the quantitative results. Similar to the case in point cloud completion, our method outperforms IFNet with a large margin in Chamfer error. In Figure 10 (bottom row), we show qualitative results on one example data. Our method is reasonably accurate in both global shape and local detail while IFNet produces artifacts near the object boundary.
Method  Point Cloud Completion  Voxel SuperResolution  

Chamfer ()  FScore ()  Chamfer ()  FScore ()  
IFNet  1.61  85.0  1.82  65.4 
Ours  0.39  86.1  0.96  66.9 
4.6 Shape Completion from Depth Image
Our final experiment investigates shape completion from depth image. We compare MDIF with IMNet, OccNet and LDIF. OccNet and LDIF use encoderdecoder inference while IMNet and MDIF use decoderonly latent optimization. Note that for IMNet and MDIF , we directly use the model trained in the autoencoding task (Section 4.3) without retraining or finetuning. This is considered a benefit of decoderonly latent optimization. Figure 11 reports the percentages of surface points with distance to groundtruth smaller than different thresholds. MDIF has a good proportion of points with low error and consistently outperforms IMNet at all thresholds, reflecting its advantage on preserving details in observed regions. However, MDIF has higher error in unobserved regions than methods under encoderdecoder inference (OccNet, LDIF). This is illustrated in Figure 12
, where the errors of our results are mostly on the occluded side. For example, in row 4 where the table top is completely unobserved, our estimation is thicker than groundtruth, hence resulting in higher error. Despite this, the predicted shape still looks plausible. This and other examples suggest that the Chamfer distance and FScore are suited for assessing the observed parts, but not for the unobserved parts where many plausible solutions exist. Therefore, to evaluate plausibility, we further conduct a user study that votes between MDIF and LDIF results on 32 pairs of examples (please refer to supplementary for details). The results show that 54.2% of the participants chose MDIF results as more plausible, whilst 31.9% thought LDIF results were better. In addition, 13.9% could not decide between MDIF and LDIF. Moreover, when compared with the quantitative metrics, 68.1% disagree with the Chamfer distance, and 51.4% disagree with the FScore.
5 Conclusion
In this paper, we present MDIF, a multiresolution deep implicit function to progressively represent and reconstruct geometries. MDIF is trained endtoend in an encoderdecoder fashion and supports both encoderdecoder inference and decoderonly latent optimization. We demonstrate that MDIF outperforms stateoftheart methods on tasks including autoencoding 3D shapes, point cloud completion and voxel superresolution. We further show that MDIF enables detailed decoderonly shape completion from a depth image: the details in observed regions are accurately preserved while the unobserved regions are completed with plausible shapes. In the future, we would like to explore transferring details from observable parts to occluded parts in completion tasks. We also plan to apply MDIF to more applications such as shape manipulation.
Multiresolution Deep Implicit Functions for 3D Shape Representation
(Supplementary Material)
6 Supplementary Material
6.1 Implementation Details
Detailed network architecture
Figure 13 shows the detailed architecture of our network. On the left, Figure 13 (a) is the encoder network that is used in training and encoderdecoder inference. It takes 3D grid as input and outputs the latent grid of each level. For the voxel superresolution experiment (Section 4.5), since the input is only , we accordingly remove the first 4 convolution layers along with their activation and normalization layers.
On the right, Figure 13 (b) is the predecoder network. With latent grids as input, it includes global connection and trilinear interpolation. The global connection consists of 3D transposed convolution layers to propagate global context from level 0 to other levels. Trilinear interpolation is utilized to obtain the latent codes at each query point, which are then fed into the decoders at each level. For level 0, the 3D position of query point is also fed into the decoder. For the decoder modules, we use the same IMNet [chen2019learning] architecture for each level, with the only difference in the input dimension.
Hyperparameters
We implement our method in TensorFlow. During training, we set batch size as
and train our network endtoend. We use Adam as optimizer, with , and a learning rate of . The latent grid dropout rate is set as for the models that need to carry out decoderonly latent optimization while it is set as for the models that only run encoderdecoder inference (, the models for point cloud completion and voxel superresolution).During decoderonly latent optimization, we optimize over and keep other parameters fixed. We use Adam with the same configuration of as training, but at a higher learning rate of to accelerate convergence. In all our experiments, we only run latent optimization for steps. For each step during autoencoding, we randomly draw points. For each step during shape completion, we randomly draw cameraobservable points, along with occluded points for the global consistency loss.
Experiment details
For the training data, we use the watertight ShapeNet meshes from OccNet [mescheder2019occupancy] and normalize into bounding box with side length . We also truncate SDF values at .
For the autoencoding experiment (Section 4.3), as mentioned in the paper, IFNet [chibane2020implicit] originally uses highresolution latent grids which contain more parameters than the input grid. We therefore constrain IFNet to only use latent grids with dimensions: . The resulting total number of parameters in the latent grids is the same as MDIF .
For the point cloud completion (Section 4.4) and voxel superresolution (Section 4.5) experiments, unlike autoencoding, the goal is to infer missing data rather than learn a compact latent space. Therefore, in these experiments, we use the original implementation of IFNet which exploits highresolution latent grids. Similarly, for MDIF in these experiments, we additionally interpolate features at query points from highresolution feature grids and feed into the decoders.
6.2 EncoderDecoder vs. DecoderOnly Inference
In Figure 14, we show qualitative autoencoding results of MDIF using encoderdecoder inference and decoderonly latent optimization. Compared with encoderdecoder inference, decoderonly latent optimization already produces more accurate reconstruction with only optimization steps. More steps further lower the error.
6.3 Illustration of Ablation Baselines
6.4 Comparison of Dropout and Consistency Loss
To further analyze the different contribution of latent grid dropout and global consistency loss on shape completion, we carry out a leaveoneout ablation on dropout where the only difference with full pipeline is the removal of latent grid dropout. Same as the baselines in Table 2, this ablation is conducted on the chair category of ShapeNet. In Table 6, we show that the removal of dropout leads to slightly larger decrease in quantitative performance than the removal of consistency loss. Meanwhile, dropout impacts qualitative results in a different way than the consistency loss. As shown in Figure 16, when dropout is applied (the third and fourth columns from the left), the model is able to synthesize plausible details on the unobserved regions that are close to the observed part (see insets at the bottom). On the contrary, without dropout (the rightmost column), the model tends to produce noisy residuals (red inset) or add no detail due to the consistency loss (blue inset).
Method  Shape Completion  

Chamfer ()  FScore ()  
Full pipeline  1.34  66.5 
No consistency loss  1.43  64.7 
No dropout (leaveoneout)  1.43  63.9 
6.5 Failure Cases
Figure 17 shows our failure cases under decoderonly latent optimization for autoencoding and shape completion from depth image. For objects with very complex geometry or thin structures, our approach still faces challenges. For autoencoding, such problems could be alleviated by using more levels and higher resolution latent grids. For shape completion, when an unobserved part (, the lamp body in row 3, column 3) is completely missing in the coarse prediction from level 0, our approach is unable to synthesize such delicate structures.
6.6 Additional Ablation for Number of Levels
In the paper, we use 5 levels as it is a good balance between accuracy and efficiency. But as previously indicated, MDIF is flexible to use other number of levels. In Figure 9, we showed progressive refinement ratedistortion for levels 15. Here in Table 7, we further show the autoencoding accuracy under encoderdecoder inference with up to 8 levels.
Ours  Ours6  Ours7  Ours8  Ours  Ours6  Ours7  Ours8  

Chamfer  0.19  0.13  0.13  0.12  0.17  0.14  0.13  0.13 
FScore  93.0  96.5  96.7  97.5  92.8  96.3  97.1  97.3 
6.7 Interpolation and Retrieval in Latent Space
6.8 Additional Qualitative Results
6.9 Detailed Quantitative Results
Table 8 and Table 9 show percategory quantitative results (Chamfer L2 distance and FScore) on autoencoding. For encoderdecoder inference, we compare MDIF with OccNet (“Occ.”) [mescheder2019occupancy], SIF [genova2019learning], LDIF [Genova_2020_CVPR] and IFNet (“IF.”) [chibane2020implicit]. For decoderonly latent optimization, we compare MDIF with OccNet (“Occ.”) [mescheder2019occupancy], IMNet (“IM.”) [chen2019learning] and a local baseline (resembles [jiang2020local, chabra2020deep]). Table 10 shows percategory quantitative results (Chamfer L2 distance / FScore) on point cloud completion and voxel superresolution, where we compare MDIF with IFNet [chibane2020implicit] under encoderdecoder inference.
In these experiments, MDIF has lower Chamfer errors for most categories and higher overall FScore.
Category  Chamfer ()  FScore (, %)  

Occ.  SIF  LDIF  IF.  Ours  Occ.*  IM.*  Local*  Ours*  Occ.  SIF  LDIF  IF.  Ours  Occ.*  IM.*  Local*  Ours*  
airplane  0.16  0.44  0.10  0.52  0.05  0.25  0.13  0.044  0.028  87.8  71.4  96.9  94.4  97.2  89.8  91.7  98.5  98.6 
bench  0.24  0.82  0.17  0.31  0.08  0.34  0.22  0.121  0.052  87.5  58.4  94.8  92.6  92.4  85.2  88.6  96.0  96.0 
cabinet  0.41  1.10  0.33  0.11  0.29  0.32  0.23  0.063  0.051  86.0  59.3  92.0  93.0  91.5  83.2  89.2  96.6  96.6 
car  0.61  1.08  0.28  0.30  0.29  0.58  0.26  0.090  0.088  77.5  56.6  87.2  87.4  86.6  69.3  82.7  93.1  93.0 
chair  0.44  1.54  0.34  0.10  0.10  0.38  0.43  0.042  0.035  77.2  42.4  90.9  94.5  93.8  80.2  82.5  97.7  97.6 
display  0.34  0.97  0.28  0.07  0.08  0.35  0.20  0.043  0.019  82.1  56.3  94.8  96.1  95.1  82.3  89.4  98.6  98.7 
lamp  1.67  3.42  1.80  1.17  0.90  1.47  2.76  0.795  0.795  62.7  35.0  84.0  89.1  87.1  62.9  73.8  93.5  93.5 
rifle  0.19  0.42  0.09  1.07  0.05  0.39  0.55  0.060  0.057  86.2  70.0  97.3  93.5  96.2  86.1  81.1  96.9  96.9 
sofa  0.30  0.80  0.35  0.13  0.11  0.31  0.16  0.208  0.037  85.9  55.2  92.8  92.5  93.5  85.2  89.3  98.3  98.4 
speaker  1.01  1.99  0.68  0.14  0.27  0.38  0.17  0.065  0.044  74.7  47.4  84.3  90.2  90.1  78.1  89.4  97.3  97.3 
table  0.44  1.57  0.56  0.17  0.13  0.31  0.30  0.107  0.046  84.9  55.7  92.4  93.4  93.7  87.2  88.6  96.5  97.6 
telephone  0.13  0.39  0.08  0.08  0.06  0.19  0.11  0.043  0.010  94.8  81.8  98.1  98.8  98.3  88.9  96.5  99.6  99.6 
watercraft  0.41  0.78  0.20  0.90  0.10  0.35  0.39  0.075  0.067  77.3  54.2  93.2  92.7  93.7  80.3  84.7  97.4  97.2 
mean  0.49  1.18  0.40  0.39  0.19  0.43  0.46  0.135  0.102  81.9  59.0  92.2  92.9  93.0  81.4  86.7  96.9  97.0 
Category  Chamfer ()  FScore (, %)  

Occ.  SIF  LDIF  IF.  Ours  Occ.*  IM.*  Local*  Ours*  Occ.  SIF  LDIF  IF.  Ours  Occ.*  IM.*  Local*  Ours*  
bed  1.30  2.24  0.68  0.10  0.16  0.87  0.43  0.052  0.045  59.3  32.0  81.4  94.7  90.9  67.1  77.8  96.8  97.0 
birdhouse  1.25  1.92  0.75  0.31  0.11  0.72  0.49  0.036  0.036  54.2  33.8  76.2  90.4  92.1  61.3  74.3  97.6  97.7 
bookshelf  0.83  1.21  0.36  0.30  0.20  0.99  0.60  0.103  0.091  66.5  43.5  86.1  93.5  88.3  59.0  73.0  95.1  94.2 
camera  1.17  1.91  0.83  0.27  0.16  0.45  0.58  0.047  0.050  57.3  37.4  77.7  95.0  94.0  70.2  75.9  98.6  98.6 
file  0.41  0.71  0.29  0.35  0.30  0.38  0.25  0.054  0.041  86.0  65.8  93.0  95.7  94.4  84.3  90.0  97.6  97.7 
mailbox  0.60  1.46  0.40  1.18  0.20  0.51  0.74  0.102  0.102  67.8  38.1  87.6  81.4  93.5  80.0  85.2  98.5  98.5 
piano  1.07  1.81  0.78  0.34  0.08  0.91  0.71  0.034  0.030  61.4  39.8  82.2  96.7  94.8  62.2  77.3  98.3  98.3 
printer  0.85  1.44  0.43  0.15  0.15  0.48  0.31  0.035  0.035  66.2  40.1  84.6  94.9  94.3  74.9  82.3  98.2  98.3 
stove  0.49  1.04  0.30  0.55  0.22  0.37  0.25  0.107  0.040  77.3  52.9  89.2  91.3  93.5  78.6  87.4  97.7  97.7 
tower  0.50  1.05  0.47  0.44  0.14  0.53  0.30  0.060  0.070  70.2  45.9  85.7  90.3  91.8  73.9  81.7  96.9  96.6 
mean  0.85  1.48  0.53  0.40  0.17  0.62  0.47  0.063  0.054  66.6  43.0  84.4  92.4  92.8  71.1  80.5  97.5  97.5 
6.10 Shape Completion User Study
First, in Table 11, we compare quantitative results of MDIF and competing methods on shape completion from depth image. In this comparison, we also include a MDIF model (“Ours”) that uses encoderdecoder inference. This model has the same architecture as the MDIF model in the point cloud completion experiment, and is retrained from scratch to take voxelized depth points (depth points voxelized into a grid) as input. In terms of metrics, we additionally use Asymmetric Chamfer to measure the reconstruction accuracy in observed regions. It is computed as onedirectional Chamfer L2 distance from depth points to reconstruction.
Category  Point Cloud Completion  Voxel SuperResolution  

IFNet  Ours  IFNet  Ours  
airplane  2.37 / 89.7  0.08 / 93.3  1.51 / 78.3  1.02 / 80.7 
bench  1.22 / 84.5  0.18 / 86.0  1.88 / 59.1  1.09 / 59.5 
cabinet  1.65 / 87.1  0.84 / 83.8  0.65 / 60.6  0.60 / 60.8 
car  1.96 / 79.4  0.19 / 80.9  0.40 / 75.8  0.30 / 75.8 
chair  2.02 / 81.3  0.33 / 80.5  1.02 / 62.6  0.82 / 63.4 
display  1.09 / 88.5  0.30 / 88.6  1.04 / 62.0  0.74 / 62.1 
lamp  2.03 / 76.3  1.76 / 78.0  8.14 / 58.3  3.97 / 60.9 
rifle  2.19 / 85.3  0.05 / 95.9  2.09 / 78.0  0.34 / 81.3 
sofa  0.71 / 88.2  0.18 / 86.8  0.68 / 56.2  0.48 / 57.5 
speaker  1.52 / 78.4  0.65 / 75.9  0.73 / 56.1  0.65 / 58.0 
table  1.70 / 84.7  0.25 / 85.1  2.72 / 53.5  1.87 / 55.7 
telephone  0.98 / 95.7  0.06 / 96.5  0.77 / 77.9  0.67 / 78.2 
watercraft  1.51 / 87.2  0.14 / 88.4  2.05 / 71.7  0.69 / 73.6 
mean  1.61 / 85.0  0.39 / 86.1  1.82 / 65.4  1.02 / 66.7 
Category  Chamfer ()  FScore (, %)  Asym. Chamfer ()  

OccNet  LDIF  Ours  Ours*  OccNet  LDIF  Ours  Ours*  OccNet  LDIF  Ours  Ours*  
airplane  0.47  0.17  0.26  0.46  70.1  89.2  90.1  73.2  0.246  0.054  0.022  0.007 
bench  0.70  0.39  0.45  0.96  64.9  81.9  82.5  56.9  0.281  0.108  0.049  0.012 
cabinet  1.13  0.77  0.73  1.35  70.1  77.9  73.8  60.4  0.109  0.052  0.070  0.009 
car  0.99  0.51  0.41  1.04  61.6  72.4  74.3  64.2  0.138  0.054  0.043  0.011 
chair  2.34  1.02  0.91  1.42  50.2  69.6  72.5  67.0  0.785  0.270  0.053  0.012 
display  0.95  0.62  0.56  1.69  62.8  80.0  76.7  55.4  0.312  0.217  0.056  0.007 
lamp  9.91  2.15  1.26  3.26  44.1  66.4  70.5  54.6  10.80  1.429  0.160  0.110 
rifle  0.49  0.14  0.31  0.62  66.4  92.3  91.5  75.9  0.246  0.048  0.022  0.005 
sofa  1.08  0.83  0.70  1.19  61.2  71.7  71.4  62.1  0.155  0.074  0.059  0.007 
speaker  3.50  1.48  1.45  3.73  52.4  67.3  64.6  49.8  0.280  0.115  0.077  0.020 
table  2.49  1.14  0.94  1.11  66.7  78.0  77.8  61.5  0.784  0.339  0.065  0.015 
telephone  0.35  0.19  0.21  1.05  86.1  92.0  89.4  55.9  0.089  0.046  0.046  0.002 
watercraft  1.15  0.50  0.45  0.69  54.5  77.5  78.3  67.2  0.684  0.148  0.033  0.020 
mean  1.97  0.76  0.67  1.43  62.4  78.2  78.0  61.9  1.147  0.227  0.058  0.018 
When comparing under encoderdecoder inference (“OccNet”, “LDIF”, “Ours”), MDIF is only slightly worse than LDIF in FScore while performs the best in the other two metrics. This reveals that when using encoderdecoder inference, MDIF can produce completion results similarly close to the groundtruth as LDIF. Meanwhile, the large margin in Asymmetric Chamfer compared with OccNet and LDIF demonstrates the better capability of MDIF to preserve details in observed regions, even under encoderdecoder inference. For the MDIF model that uses decoderonly latent optimization (“Ours*”), although it has worse performance in Chamfer distance and FScore, it can reduce the error in Asymmetric Chamfer even much further. This indicates that it performs much better on the observable parts and the source of error mostly comes from the unobserved parts. As illustrated in the paper (Figure 12), although different from the groundtruth, the unobserved parts of its results still look plausible.
To prove our point, we conducted a user study to compare human subjective verdicts and FScore. We recruited 88 participants who were at least 18 years old. All participants had no prior knowledge of this project. Each participant was given 32 pairs of examples, one from MDIF (with decoderonly latent optimization) and one from LDIF [Genova_2020_CVPR]. Order of the examples is fully counterbalanced and randomized. Each example was shown in two different views: one observed (input view) and one unobserved. Participants were then asked to choose which example was the more plausible reconstruction given the input. If both examples looked similarly plausible, they were allowed to choose cannot decide.
Examples were chosen in this way. The worst results in FScore were filtered, since both human and FScore tend to agree on those cases. Then examples with unmatched input views were removed. We then randomly picked 32 examples from the rest.
The results of user study are summarized in Figure 22. In contrast to FScore, of the participants chose in favor of MDIF results, whilst thought LDIF results were better. In addition, could not decide between MDIF and LDIF. Moreover, when compared with the quantitative metrics, disagree with Chamfer L2 distance, and disagree with FScore. All the 32 examples and itemized results are shown in Figure 23, Figure 24, Figure 25 and Figure 26.
The conclusion of this user study aligns with previous work [tatarchenko2019single], where Chamfer distance has been argued as not suitable for evaluating completion tasks due to its sensitivity to outliers. Moreover, this study also shows that, although more robust, FScore only tells us how different the reconstruction of the unobserved part is from the groundtruth, but not how plausible it is, which is what humans ultimately care about.