1 Introduction
3D scene reconstruction has been a longstanding problem in computer vision and graphics, and has recently seen a renewed flurry of developments, driven by successes in generative neural networks
[deepsdf, mescheder2019occupancy, dai2020sg, peng2020convolutional]. In particular, developing an effective geometric reconstruction is challenging due to the dimensionality of the problem, and the simultaneous expressibility for local details as well as coherent, complex global structures. In recent years, various approaches have been developed for geometric reconstruction, encoding the full generative process into a neural network. This can result in difficulty in representing largescale, complex scenes, as all levels of detail must be fully encoded as part of the generative network.We thus propose to augment geometric reconstruction with a database which our method learns to leverage at inference time, and introduce a generative model that does not need to encode the entire training data as part of the network parameters. Instead, our model learns how to best transfer structures and details from retrieved scene database geometry.
We construct this database as geometric, cropped chunks of 3D scenes from train scene data. Each chunk represents clean, consistent, highresolution geometry. We leverage these chunks as a basis for scene reconstruction.
To this end, we develop a neural 3D scene reconstruction approach to generate 3D scenes as volumetric distance fields. This approach consists of two main steps: a topk nearest neighbor retrieval and combination for initial estimation, and a refinement stage to produce the higherquality, final reconstruction. Specifically, to generate a 3D scene from an input condition (e.g., a noisy or sparse observation of a scene), we first learn to construct an initial estimate of the scene as a combination of cropped volumetric chunks from the database. By providing an initial estimate based on chunks of existing scene geometry, we can more easily encourage consistent, sharp structures already seen in the existing scene geometry. Since these initial scene crop estimates may not be entirely locally consistent with each other, we then refine this estimate to produce a final scene reconstruction. The scene refinement is based on patchbased attention which encourages the selection of given scene chunk estimates where they suffice – maintaining their clean details – and synthesizing refined geometry otherwise.
By leveraging database retrieval in combination with a generative model, our approach does not need to encode the full train set for effective reconstruction, and facilitates generation of globally coherent, high quality 3D scenes. We demonstrate our approach on the tasks of 3D super resolution and 3D surface reconstruction from sparse point samples on both synthetic and realworld 3D scene data, showing significant qualitative and quantitative improvement in comparison to stateoftheart reconstruction approaches. Additionally, we show that our approach can also be applied to other generative representations, in particular, to improve implicitbased reconstruction.
In summary, our main contributions are:

A neural 3D reconstruction technique that leverages details present in a database of cropped scene chunks for improving reconstructed geometry.

A patchwise attentionbased refinement that robustly fuse together details from the retrieved scene chunks.
2 Related Works
Learned 3D Shape Reconstruction.
3D shape reconstruction is a longstanding problem in computer vision. We refer readers to Szeliski [szeliski2010computer]
for a more comprehensive review of the classic techniques. Recently, inspired by the progress of deep learning for images, many developments have been made in deep generative models for reconstructing 3D shapes, largely focusing on leveraging different geometric representations.
Early generative neural networks focused on voxel grids as a natural extension of pixels, with a regular structure wellsuited for convolutions, but can struggle with cubic growth in dimension [maturana2015voxnet, wu20153d, choy20163d, dai2017shape]. Multiresolution representations were proposed [hane2017hierarchical, tatarchenko2017octree] to address the cubic complexity with hierarchical data structures. Rather than operating on a regular grid, point cloud based approaches propose to generate points only on the geometric surface [fan2017point, yang2019pointflow], but do not encode structural connectivity. Meshbased approaches have also been proposed to efficiently capture surface geometry while encoding connectivity, but tend to rely on strong topological assumptions such as a template mesh that is then deformed [wang2018pixel2mesh], or a small number of vertices for freeform generation [dai2019scan2mesh]. Implicit representations encoded directly by the neural network enable modeling of a continuous surface, typically as binary occupancies or signed distance fields [mescheder2019occupancy, deepsdf, chibane2020implicit]; such representations have seen notable success in modeling single objects but can struggle to directly scale to scenes.
Learned 3D Scene Reconstruction.
Compared to shape reconstruction, scenelevel reconstruction is significantly more challenging due to the scale, variance, and complexity of geometry. Several approaches have been proposed to combine local implicit functions with a coarse volumetric basis
[jiang2020local, deep_local_shapes, peng2020convolutional] to capture complex, largescale scene reconstructions. SGNN [dai2020sg] leverages a single, sparse volumetric network for largescale scene completion in a selfsupervised fashion. These approaches rely on encoding the full generative process into network parameters, whereas we leverage a basis of existing scene geometry, that does not need to be fully encoded but rather refined to transfer desired geometric characteristics from the valid scene geometry (e.g., clean structures, local details).2D/3D Retrieval.
Our approach is related to 2D image retrieval and completion applications
[datta2008image], where recent work [radenovic2018fine, xu2020texture]focuses on developing a CNN to automatically retrieve relevant patches from a large collection of unordered images. Note that memorization is also an active area of research in language models
[khandelwal2019generalization].For 3D retrieval, the pioneering work of Chen [chen2003visual] proposed a 3D shape retrieval system based on visual similarity. More recently, several works have been proposed to leverage 3D CAD model retrieval to represent objects in input images or 3D scans [li2015database, izadinia2017im2cad, avetisyan2019scan2cad, kuo2020mask2cad, izadinia2020licp], but are limited to the objects in the CAD dataset, while we use our retrieval as a basis for enabling more accurate reconstruction from learned selection and blending of retrieved scene geometry.
3 Method
We formulate the problem as a general 3D reconstruction conditioned on inputs which can be spatially correlated with the output scene. This can be instantiated into applications like 3D superresolution from lowresolution observations and surface reconstruction from a sparse set of 3D point measurements. The proposed approach augments a generative model for synthesizing 3D scenes with external knowledge in the form of a database of existing 3D scene data. Specifically, a typical learningbased 3D reconstruction function, , trains on pairs of input and ground truth 3D data, with a loss to measure the distance between each and . At inference time, given an unseen input , is applied as , without using any additional information.
In contrast, our approach maintains the training data to form a basis of an initial reconstruction estimate during inference. An overview of our approach is visualized in Fig. 1. We first learn to retrieve similar train data to the input condition to construct multiple initial reconstruction estimates, . We then refine these estimates to produce the final reconstruction, . This facilitates transfer of scene geometry characteristics such as detail and global structures from train data to produce more coherent and detailed output reconstructions.
We demonstrate our approach on the 3D reconstruction tasks of super resolution and surface reconstruction from points, learning to reconstruct a distance field representation of an output 3D scene from a lowresolution distance field and point cloud, respectively. For a set of train scenes , we consider cropped scene chunks as our additional knowledge database. We first learn to map spatially corresponding input chunks to these by constructing a shared embedding space between the and and retrieving the nearest neighbors for a new . These nearest neighbors then form candidates for a scene reconstruction.
Based on the input condition and these candidates, which comprise distance fields for each chunk in the output scene, we then learn to refine the initial estimates to a final distance field scene reconstruction. The initial basis constructed by existing 3D scene data is composed of chunks of valid local and global structures (e.g., flat walls or floors, full structures as well as sharp details of objects), enabling our refinement to more easily maintain these characteristics in the output reconstruction. To encourage the transfer of desired global structures and fine details to the final reconstruction, we employ attention to help select the most meaningful parts of the initial estimate. This facilitates coherent reconstructions while maintaining local detail.
3.1 Estimating Reconstruction as a Composition of Existing Scene Data
We first aim to approximate a scene reconstruction as a composition of cropped chunks of existing scene data from the database . By recomposing cropped chunks of scene data, we can express diverse scene content while leveraging a basis of existing scene data. To construct this approximation, we learn to retrieve candidate chunks from the database, providing a variety of candidate reconstruction estimates that can be used to inform the final reconstruction refinement. These multiple candidates provide alternatives to the following refinement stage, as we cannot expect to have exactly corresponding chunk geometry at test time. We show an overview of our retrievalbased reconstruction estimation in Fig. 2.
To find the best candidate chunks from the database for the corresponding part of the input, we learn to embed chunks of input observations and target scene chunks into a shared latent space, where top nearest neighbor retrieval is then performed. We thus embed into a 64 dimensional latent space by passing it through a stack of convolutional layers followed by a fully connected layer, resulting in feature . We similarly embed the corresponding target into a 64 dimensional latent space by also passing it through a stack of convolutional layers with a fully connected layer at the end, resulting in feature . Inspired by contrastive learning [hadsell2006dimensionality], we construct the shared space using a normalized, temperaturescaled cross entropy loss (NTXent) [chen2020simple]:
(1) 
where denotes the number of samples in the minibatch, evaluating to 1 iff , and is a temperature parameter. This encourages similarly structured target scene chunks to be retrieved for an input observation.
A minibatch may contain target chunk similar in geometry to the target chunk where . We thus use to discourage heavy penalization in this scenario, by making the temperature scaling to be a function of IoU between the target chunks,
(2) 
where and are constant shift and bias, and a sigmoid.
Retrieval Database.
Once the networks have been trained, the target chunks are all embedded as into the latent space to support chunk retrieval. Then for a new input observation , it is split into spatial chunks , for which nearest neighbors are found from by an distance metric. This provides candidate reconstruction estimates .
3.2 Reconstruction Refinement
Our initial retrievalbased reconstruction estimate provides a strong prior for global structures and finescale details in the scene, but the retrieved chunks may not be fully locally consistent with each other. Therefore, we leverage this estimated reconstruction to refine a globally coherent reconstruction while maintaining local detail. We visualize this refinement in Fig. 3. The input observation and the estimated retrievalbased reconstructions inform the final refinement. The input is passed through a UNet [ronneberger2015u]based feature extractor to produce a grid of features, which is split into a set of patch features. The retrieval approximations are first split into volumetric patches and are passed through feature extractor , analogously structured as but smaller in parameters since it operates on retrieved patches. Next, we blend together the features from the input with those from corresponding retrieval approximations. We leverage a patchbased attention layer which learns to select and blend the retrieved patches, based on feature similarities to the spatiallycorresponding features from the input data. The resulting feature grid is finally decoded with convolutions to output the reconstructed geometry.
Patchbased Attention.
We leverage a patchbased attention to encourage selection of robust patches from the retrieved chunks to inform the final reconstruction, i.e., only features that would most help the reconstruction are used.
The features from the input and from the retrievalbased reconstructions can be spatially aligned with each other. To select the most relevant features, we consider patches of these aligned features (with the patch size smaller than the chunk size of the retrieval, as we want to be able to select features within retrieved chunks); each patch of the input then corresponds to
patches of retrieved chunks. Similarity between input and retrieval patch features is then computed in a lowerdimensional projection space, using cosine similarity to compute attention scores. The process is visualized in Fig.
4.More formally, if and are the input and retrieved patch features for the nearest neighbor retrieval, the patched attention layer first computes the attention score as
(3) 
where and are networks implemented as MLPs that project and to a 32dimensional normalized space. The attention weights are computed as softmax of attention scores:
(4) 
where
is a hyperparameter controlling sharpness of the softmax.
encourages selection over blending from the retrievals, in order to maintain the local detail present in the retrieved patches. The total contribution due to the retrievals is then given by the attention weighted sum of retrieval features. Next, a learned blending function blends the input patch feature with this weighted sum based on the maximum attention score. That is, once we have the attention weights, the output from attention layer is given as(5) 
with the blending coefficient given as
(6) 
and are learnable shift and bias parameters. Intuitively, the attention weights determine which of the retrievals should contribute, while determines how much input features should contribute compared to retrieval features. Finally, the blended patches are reinterpreted as a full grid and decoded to an output distance field.
Refinement Loss.
To train the refinement, we employ a reconstruction loss on the final prediction as well as a retrievalreconstruction loss and attention loss:
(7) 
denotes the reconstruction loss on the final predicted distance field with the ground truth distance field as an loss:
(8) 
ensures that the refinement decoder continues to decode to the original distance field of the retrieved chunk features:
(9) 
where is a chunk of . Finally, attention embedding space is supervised with
(10) 
where NTXent is the normalized cross entropy loss, and , are target and corresponding input patch features respectively.
3.3 Implementation Details
For both tasks of 3D superresolution and point cloud to surface reconstruction, we use a truncated distance field (TDF) representation for the target geometry (larger scenes are processed in a sliding window fashion in windows), which are converted to meshes by Marching Cubes [lorensen1987marching]. We use a chunk size in the target domain for retrievals, resulting in chunks per sample. The spatial attention uses a smaller patch size of .
We use a temperature of for the retrieval NTXent, and for the attention phase NTXent. We found that a lower temprature works better for the smaller sized patches. The refinement phase uses retrieval approximations for an input. The refinement loss coefficients and to bring the losses to similar magnitudes.
All networks are trained using Adam [kingma2014adam] with a learning rate of . We use a batch size of 196 for retrieval training, and 8 for refinement. We train on a single NVIDIA 2080Ti for 150k iterations for retrieval ( hours) and 350k iterations for refinement ( hours)
4 Results
We demonstrate our approach on the tasks of 3D superresolution and surface reconstruction from sparse point clouds, on both objects and scenes as well as synthetic and realworld data. We evaluate on three datasets with increasing complexity: ShapeNet [chang2015shapenet] (synthetic shapes), 3DFront [fu20203dfront] (synthetic scenes) and Matterport3D [chang2017matterport3d] (realworld 3D scans). For ShapeNet, we use the 13 class subset and train/test split from [choy20163d]. For 3DFront, we use the scenes which have furniture in a train/test split of rooms. We use the official train/test split for Matterport3D of 72/18 buildings and 1799/394 rooms.
Metrics. We evaluate the reconstructed geometry with complementary metrics: Volumetric IoU (IoU), Chamfer
Distance (CD) in meters, Normal Consistency (NC) as cosine distances and Fscore (F1) at
window size threshold. Additional evaluation details can be found in the appendix.4.1 3D Super Resolution
For the task of 3D super resolution, we consider lowresolution geometry as input, and aim to reconstruct highresolution target geometry. We use input / target voxel sizes of m / m for 3DFront, m / m for Matterport3D, and resolutions /
for ShapeNet. For synthetic and realworld 3D scenes of varying sizes, we operate in a sliding window fashion with a stride of
voxels; we similarly run all baselines in the same sliding fashion.Comparison to state of the art. We compare to stateoftheart 3D generative approaches: SGNN [dai2020sg] which operates on sparse volumetric data, IFNet [chibane2020implicit] which learns implicit reconstruction, and Convolutional Occupancy Networks [peng2020convolutional] which uses a hybrid of volumetric convolutions coupled with implicit decoders. While these methods encode the entire generative process in the network, we additionally use an explicit database that can assist the reconstruction during inference. In Tab. 1, we show a quantitative comparison; our ability to leverage strong priors from retrieved scene data enables more effective reconstructions of 3D scenes. We additionally show a qualitative comparison in Fig. 5; our approach maintains sharper detail which is more easily propagated by retrieving priors from the train set.
Method  ShapeNet  3DFront  Matterport3D  
IoU  CD  F1  NC  IoU  CD  F1  NC  IoU  CD  F1  NC  
SGNN  0.624  0.668  0.813  0.889  0.639  0.032  0.733  0.900  0.731  0.021  0.697  0.916 
ConvOcc  0.648  0.726  0.838  0.906  0.631  0.033  0.711  0.901  0.584  0.027  0.542  0.879 
IFNet  0.650  0.623  0.838  0.892  0.639  0.041  0.736  0.878  0.593  0.028  0.624  0.893 
Ours  0.655  0.590  0.844  0.905  0.751  0.027  0.801  0.922  0.739  0.020  0.708  0.923 
4.2 Surface Reconstruction
We additionally demonstrate our approach on the task of surface reconstruction from point cloud data. Input point clouds are obtained by randomly sampling points for each shape in ShapeNet, points per for 3DFront, and points per for Matterport3D.
Comparison to state of the art. We compare to the stateoftheart 3D generative approaches as in the 3D superresolution task (SGNN [dai2020sg], IFNet [chibane2020implicit], Convolutional Occupancy Networks [peng2020convolutional]), in addition to Screened Poisson Surface Reconstruction (SPSR) [kazhdan2006poisson, kazhdan2013screened] and Local Implicit Grids (LIG) [jiang2020local]. All datadriven methods are trained on our data. For our method, SGNN, and IFNet which take volumetric input, we consider the point cloud as a volumetric occupancy grid (occupancy for voxels containing any points). Tab. 2 shows a quantitative comparison, where our learned use of train scene data through an attentionbased refinement provides more accurate geometric reconstruction. Fig. 6 additionally shows that our reconstructions more effectively capture both global structures and local details in the scenes.
Method  ShapeNet  3DFront  Matterport3D  
IoU  CD  F1  NC  IoU  CD  F1  NC  IoU  CD  F1  NC  
SPSR  0.333  3.225  0.523  0.852  0.204  0.438  0.267  0.755  0.234  0.105  0.245  0.841 
LIG  0.589  0.751  0.767  0.872  0.566  0.041  0.673  0.886  0.546  0.034  0.576  0.868 
SGNN  0.494  0.876  0.673  0.857  0.738  0.025  0.804  0.919  0.441  0.029  0.471  0.867 
ConvOcc  0.600  0.779  0.765  0.913  0.565  0.037  0.667  0.905  0.419  0.034  0.420  0.859 
IFNet  0.777  0.420  0.937  0.923  0.779  0.028  0.832  0.918  0.575  0.029  0.607  0.866 
Ours  0.783  0.377  0.947  0.938  0.863  0.021  0.875  0.955  0.710  0.021  0.702  0.917 
4.3 Ablations
Effect of retrieval and attentionbased refinement.
We evaluate the effect of our retrievalbased priors and attentionbased refinement in Tab. 3. We consider Retrieval as the initial nearest neighbor estimate provided by the retrieved scene data, Backbone as a UNet backbone styled similar to our refinement (and similar number of parameters to our refinement) but without using retrievals or attention as there are no retrievals to attend to, and Naive to be our retrieval and refinement using concatenation of features instead of attention. A visualization is shown in Fig. 7, with Retrieval appearing disjoint between different retrieved chunks, Backbone producing oversmoothed results, Naive providing more details but still suffering from oversmoothing, and our method (with retrieval priors combined with attentionbased refinement) producing the most consistent structure with local details defined.
Network  3D Super Resolution  Surface Reconstruction  
IoU  CD  F1  NC  IoU  CD  F1  NC  
Retrieval  0.67  0.032  0.71  0.87  0.70  0.028  0.75  0.88 
Backbone  0.68  0.029  0.77  0.91  0.83  0.024  0.85  0.94 
Naive  0.71  0.028  0.77  0.91  0.84  0.023  0.86  0.96 
Ours  0.75  0.026  0.80  0.92  0.86  0.021  0.88  0.96 
Effect of number of nearest neighbor retrievals. Tab. 4 shows the effect of increasing number of nearest neighbor retrievals used as a prior for the scene reconstruction. With more nearest neighbors, the attentionbased refinement has more candidates to select geometry from, improving performance but with decreasing marginal gain.
k  IoU  CD  F1  NC  GPU memory 
0  0.684  0.029  0.773  0.909  2.3G 
1  0.733  0.028  0.794  0.920  6.4G 
2  0.741  0.027  0.797  0.923  7.9G 
3  0.745  0.027  0.797  0.922  9.0G 
4  0.751  0.027  0.801  0.922  9.9G 
Extending the database during test time. Our approach can take advantage of new entries in the database without retraining. Specifically, we conducted an experiment where we train on a subset of ShapeNet classes and evaluate it on other classes. In Tab. 5, we show that if we augment the database with chunks from the train set of the classes, our method improves without retraining by leveraging better retrievals. Note that the other baselines would need to be retrained or refined to take advantage from new data.
Variant  IoU  CD  F1  NC 
Ours  0.478  0.034  0.601  0.811 
Ours (extended DB)  0.579  0.029  0.743  0.825 
4.4 Implicit Reconstruction with a Database
We can also apply our approach to implicit networks for 3D reconstruction by leveraging our retrieval estimates as an initial reconstruction for implicitbased refinement. We thus incorporate our retrievalbased reconstruction with IFNet [chibane2020implicit], maintaining our distance field database and incorporating the IFNet encoder (which also takes volumetric input) as well as IFNet decoder. For additional architecture details, we refer to the appendix. Tab. 6 shows that our retrievalbased reconstruction on 3DFront superresolution task also helps to improve upon a learned implicit 3D reconstruction. Qualitative results are shown in Fig. 8.
Method  IoU  CD  F1  NC 
IFNet  0.639  0.041  0.736  0.878 
Ours (Implicit)  0.687  0.038  0.766  0.897 
Limitations.
Our approach to leverage highquality train scene data for 3D reconstruction shows notable improvements from state of the art; however, several limitations remain. Our retrieval estimates and refinement operate in two separate stages without gradient propagation from the final reconstruction to the retrieved scene data, resulting in possible suboptimal retrieval where the refinement must compensate more; developing a differentiable NN retrieval [plotz2018neural, tseng2020retrievegan] for refinement could bridge these disconnects. Additionally, by constructing our retrieval dictionary from chunks of train scenes, our dictionary size can grow quite large; we believe a learned dictionary could help to construct the most informative characteristics to be leveraged for reconstruction. We refer to the appendix for visualisation and discussion of limitations and failure cases.
5 Conclusion
We introduce a new approach to geometric scene reconstruction, leveraging learned retrieval of train scene data as a basis for attentionbased refinement to produce more globally consistent reconstructions while maintaining local detail. By first constructing a reconstruction estimate composed of train scene chunks, we can learn to propagate desired geometric properties inherent in existing scene data such as clean structures and local detail, through our patchbased attention in the reconstruction refinement. This produces more accurate scene reconstructions from lowresolution or point cloud input, and opens up exciting avenues for exploiting constructed or even learned dictionaries for challenging generative 3D tasks.
Acknowledgements
This work was supported by the ZD.B (Zentrum Digitalisierung.Bayern), a TUMIAS Rudolf Mößbauer Fellowship, an NVidia Professorship Award, the ERC Starting Grant Scan2CAD (804724), and the German Research Foundation (DFG) Grant Making Machine Learning on Static and Dynamic 3D Data Practical. Apple was not involved in the evaluations and implementation of the code.
References
Appendix A Implementation Details
Levels of Operation for Scene Reconstruction.
Fig. 9 shows the different levels of operation at which our method operates on to reconstruct a 3D scene. Larger scenes are split into fixed size windows, chunk retrievals are made on smaller sized chunks for more expressability, and attentionbased blending works on yet smaller sized patches to allow the method to choose among different retrievals at a finer detail.
Network Architecture.
Inference Time and Number of Parameters.
We report the number of trainable parameters and the inference time for our method (both retrieval and refinement stage) along with that of the baselines in Tab. 7 for the 3D superresolution task. All runtimes are reported on a machine with Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz processor with an NVIDIA 2080Ti GPU. We use FLANN [muja2009fast] to speed up nearest neighbor lookups from the database. Our retrieval inference time is significantly higher than refinement due to multiple disk reads to retrieve chunks ( number of chunks number of retrievals). To avoid this overhead during training, once the retrieval networks have been trained, we preprocess the entire training set to extract retrievals before starting refinement stage training.
Method  Inference Time (s)  # Parameters () 
SGNN [dai2020sg]  2.297  0.64 
ConvOcc [peng2020convolutional]  1.707  1.04 
IFNet [chibane2020implicit]  0.708  2.95 
Ours (Retrieval)  0.784  0.77 
Ours (Refinement)  0.012  1.49 
Chunk side (m)  Retrieval  Refinement  
IoU  CD  NC  Entries  IoU  CD  NC  
3.467  0.53  0.074  0.72  43092  0.71  0.029  0.91 
1.733  0.60  0.041  0.85  344249  0.72  0.028  0.91 
0.867  0.67  0.033  0.87  2093592  0.75  0.026  0.92 
Variant  IoU  CD  F1  NC 


Retrieval  0.364  0.781  0.525  0.708 
Backbone  0.463  0.647  0.602  0.813 
Naive  0.432  0.684  0.576  0.798 
Ours  0.478  0.635  0.601  0.811 
a.1 IFNetbased RetrievalFuse
To demonstrate the wide applicability of our method, we also demonstrate our approach integrated into the implicitbased reconstruction of IFNet [chibane2020implicit] to leverage our retrieved approximations. We keep the IFNet encoder and decoder unmodified, and add an additional retrieval encoder for processing the retrieved reconstruction approximations. This retrieval encoder is based on the original IFNet encoder, and works with chunks from the retrievals. For a given point in space, features sampled at the point from feature volumes at different levels of the input encoder make up the input features. Features sampled from the retrieval features volumes at this point for each of the retrievals make up the retrieval features. Next, based on the feature volume at last layer of input and retrieval encoder, a blending coefficient grid and an attention weight grid is obtained. To obtain these, the input feature volume and the retrieval feature volume are interpreted as 512 patch volumes of shape and respectively. These input and corresponding retrieval patch volumes are mapped to a shared embedding space, from which we can get the blending coefficient (Eq. 6, main paper) and attention weights (Eq. 4, main paper). Once we have the blending coefficient grid and attention weight grid, we can sample their values at the queried point. Finally we blend the sampled input features and the sampled retrieved features (Eq. 5, main paper) to give the blended feature that is decoded by the IFNet decoder.
a.2 Baselines
We use the official implementations provided by the authors of IFNet [chibane2020implicit], Convolutional Occupancy Networks [mescheder2019occupancy], SGNN [dai2020sg], Local Implicit Grids [jiang2020local] and Screened Poisson Reconstruction [kazhdan2013screened] in our experiments. For 3D superresolution experiments, the methods are provided with lowresolution distance field grids as inputs instead of voxel grid inputs. In particular, for IFNet we use the ShapeNet32Vox model for 3D superresolution. For surface reconstruction from point clouds for IFNet, the discretized point cloud is used with the ShapeNetPoints model. For Convolutional Occupancy Networks we use the voxel simple encoder for 3D superresolution, and a point net local pool encoder for point cloud surface reconstruction. For SGNN, we use a resolution with nearestneighbor upsampling to a grid for the input. For Local Implicit Grids we found that the part sizes shape size for ShapeNet and window size for 3DFront and Matterport3D worked best at the sparsity of the input point cloud.
Appendix B Data Generation and Evaluation Metrics
Data generation.
As specified in the main paper, the targets for both 3D superresolution and surface reconstruction from point cloud tasks are distance field grids. Training and inference on larger scenes is done in a sliding window manner with a window stride of 64. We use SDFGen^{1}^{1}1https://github.com/christopherbatty/SDFGen to generate these distance field targets. Lowresolution distance field inputs are generated in a similar manner at a coarser resolution. Point cloud samples for surface reconstruction task are generated as random samples on the surface of meshes generated from target distance fields.
For IFNet [chibane2020implicit], Convolutional Occupancy Networks [mescheder2019occupancy], and our implicit variant, all of which need supervision in the form of points along with their occupancies, we first extract meshes from the target distance fields using the marching cubes algorithm [lorensen1987marching]. These meshes are then made watertight using implicit waterproofing [chibane2020implicit] from which points and their occupancies are finally sampled. SGNN is provided the same inputs and targets as ours for training, with the respective inputs upsampled to match the target resolution grid. Local Implicit Grids [jiang2020local] is trained on ShapeNet, and Screened Poisson Reconstruction [kazhdan2013screened] does not require training; however, both methods are provided highresolution normals to obtain oriented point clouds as inputs.
Evaluation Metrics.
We follow the definition and implementations of Chamfer Distance, Normal Consistency, and FScore from [peng2020convolutional]. Specifically, Chamfer Distance (CD) is defined as:
where and are the predicted and target meshes (obtained by running marching cubes on predicted and target distance fields). and are accuracy and completeness given as:
and
with and denoting the surfaces of the meshes. Normal Consistency (NC) is defined as:
where indicates inner product, and
are the unit normal vectors on the mesh surface, and
and are projections of and onto mesh surfaces and respectively. FScore [tatarchenko2019single]is defined as the harmonic mean of precision and recall, where recall is fraction of points on
that lie within a certain distance to , and precision is the fraction of points on that lie within a certain distance to . For calculating the volumetric IoU, we first voxelize the meshes and with voxel sizes of m for 3DFront, m for Matterport3D, and resolutions for ShapeNet. The IoU is then given as:Appendix C Additional Evaluation
c.1 Ablation Studies
Chunk Embedding Space Visualization.
Fig. 13 visualizes the embedding space used for retrieving chunks from our database. Chunks with similar geometry end up lying closer in this space.
Effect of retrieved chunk size on the performance of our method.
Tab. 8 evaluates our method with retrieval approximations of different chunk sizes for retrieval. A chunk size that is too large cannot effectively capture the diversity of various scene arrangements, while smaller sizes can represent a wider variety of geometry, at the cost of an increased database size.
c.2 Additional Qualitative Results
We provide additional qualitative evaluation of our method on 3DFront and Matterport3D superresolution and point cloud to surface reconstruction tasks in Fig. 16 and Fig. 17 respectively. Qualitative evaluation on ShapeNet for both of the tasks is provided in Fig. 18. Further, additional qualitative visualization for Effect of retrieval and attentionbased refinement (main paper section 4.3) is provided in Fig. 12.
Appendix D Additional Discussion
The result in the main paper as well is the additional experiments in this document show the broad applicability of our method, achieving stateoftheart reconstruction and superresolution outputs. Nevertheless, our approach still has limitations as discussed in the main paper. In particular, if the retrieval approximations are suboptimal, they will not help in the refinement process. Fig. 14 visualizes some samples where the retrieval approximations don’t help the reconstruction. However, in these cases, even though the retrievals don’t help the reconstruction, they also don’t worsen the reconstruction. This is achieved by the blending network effectively ignoring the retrievals in such cases. The dependence on good retrievals can be observed more clearly in the following experiment. We train our retrieval and refinement networks on a ShapeNet subset of classes. The dictionary is created using chunks from the same classes. The trained networks are evaluated on a subset of new 5 classes. As shown in Tab. 9 and Fig. 15, our method doesn’t improve significantly over the backbone network due to low quality retrievals. Compared to a naive fusion of features from retrievals however, which learns to rely on retrievals during training, our method is more robust.
A limitation of our method is cubic growth in number of chunks in the database with the decrease in patch size. As observed in Tab. 8, smaller chunk retrievals help both retrieval and refinement. This however comes at the cost of more patches in the database, making the database indexing and retrieval slower.