Neural Fields as Learnable Kernels for 3D Reconstruction

11/26/2021
by   Francis Williams, et al.
27

We present Neural Kernel Fields: a novel method for reconstructing implicit 3D shapes based on a learned kernel ridge regression. Our technique achieves state-of-the-art results when reconstructing 3D objects and large scenes from sparse oriented points, and can reconstruct shape categories outside the training set with almost no drop in accuracy. The core insight of our approach is that kernel methods are extremely effective for reconstructing shapes when the chosen kernel has an appropriate inductive bias. We thus factor the problem of shape reconstruction into two parts: (1) a backbone neural network which learns kernel parameters from data, and (2) a kernel ridge regression that fits the input points on-the-fly by solving a simple positive definite linear system using the learned kernel. As a result of this factorization, our reconstruction gains the benefits of data-driven methods under sparse point density while maintaining interpolatory behavior, which converges to the ground truth shape as input sampling density increases. Our experiments demonstrate a strong generalization capability to objects outside the train-set category and scanned scenes. Source code and pretrained models are available at https://nv-tlabs.github.io/nkf.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

page 13

page 14

page 15

08/19/2021

3DIAS: 3D Shape Reconstruction with Implicit Algebraic Surfaces

3D Shape representation has substantial effects on 3D shape reconstructi...
12/17/2019

Extrinsic Kernel Ridge Regression Classifier for Planar Kendall Shape Space

Kernel methods have had great success in the statistics and machine lear...
05/17/2021

StrobeNet: Category-Level Multiview Reconstruction of Articulated Objects

We present StrobeNet, a method for category-level 3D reconstruction of a...
03/24/2020

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

Efficiently reconstructing complex and intricate surfaces at scale is a ...
04/02/2020

DOPS: Learning to Detect 3D Objects and Predict their 3D Shapes

We propose DOPS, a fast single-stage 3D object detection method for LIDA...
08/10/2020

Deterministic error bounds for kernel-based learning techniques under bounded noise

We consider the problem of reconstructing a function from a finite set o...
10/12/2021

HyperCube: Implicit Field Representations of Voxelized 3D Models

Recently introduced implicit field representations offer an effective wa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of 3D reconstruction is to recover geometry from partial measurements of a shape. In this work, we aim to map a sparse set of oriented points sampled from the surface of a shape to a 3D implicit surface for that shape. Surface reconstruction from point clouds is a well studied topic in computer vision and graphics, with applications in robotics, entertainment, and manufacturing. Techniques for surface reconstruction broadly fall into two types: implicit methods which aim to recover a volumetric function whose zero level-set encodes the surface, and explicit methods which directly recover a triangle mesh from the input points. While implicit approaches can adapt to arbitrary topologies, the requirement to store a dense volumetric field led many past works to favor explicit approaches 

[44, 19]. More recently, implicit approaches have regained popularity due to a number of works demonstrating that neural networks are compact and effective at encoding signed-distance [45, 57] and occupancy fields [41, 48]. These works pair neural field111A neural field refers to the parameterization of a continuous function of spatial coordinates using a neural network. In this work we focus on scalar functions mapping coordinates to real numbers.
     *Denotes equal contribution.
representations with modern advances in point cloud processing architectures to produce powerful reconstruction techniques. Current state-of-the-art shape reconstructions methods can be categorized along three axes (Fig. 2):

(1) Feed-forward vs. test-time optimization: Feed-forward methods leverage shape priors to directly predict a surface from input points. While these methods are fast, they are not strictly constrained by their input and thus may perform a task more akin to retrieval than reconstruction (see [59] and Fig. 1, top). This results in decreased generalization performance on out-of-distribution shapes and input point densities. In contrast, test-time optimization via latent space traversal allows adaptation to the input, but is slow and can converge to poor local minima (See e.g. [16] and Fig. 1, bottom).

(2) Whether or not to leverage data priors: Data-free methods recover the surface by minimizing the residuals between the reconstructed surface and input points, leveraging a pre-determined prior to control the behavior away from the input points (e.g. a smooth space of functions [35, 64] or, emergent regularization arising from neural architectures [63, 25]). Such fixed priors are, however, difficult to tailor to specific tasks, like completion of partial shapes (Fig. 1, middle). Data-driven approaches, on the other hand, can learn task-specific priors to predict shapes that resemble a given dataset.

(3) Which scale to process and represent data. Local-scale methods [33, 4] use the idea that complex structures can be reduced to a collection of simpler geometric primitives. These methods learn local models which are used to reconstruct a surface in patches. While this approach can generalize better, patch-size plays a critical role and must be carefully tuned per object (Fig. 1, bottom). Furthermore, without any notion of global context, these methods are unable to complete larger missing regions, leaving a fundamental gap in their generalization performance.

Based on these axes and the motivating examples in Fig. 1, we identify the need for a method that can learn good priors from a simple collection of shapes to drive 3D reconstruction of both in-distribution and out-of distribution shapes and scenes. In particular, the priors learned by this method should respect the input points, performing reconstruction rather than retrieval.

SPSR [35]

NS [64]

Ours

C-OccNet [48]

OccNet [41]

Ours

LIG [33] (0.3)

LIG [33] (0.1)

Ours
Figure 1: Comparison of our approach with methods along three Axes in Sec. 1. Top Row: Data free methods [35, 64] respect the input points but their simple fixed priors cannot complete the partial shape. Middle Row: Feed-forward methods [48, 48] learn from data, but miss the slats on the slightly out of distribution canoe. Bottom Row: LIG [33], a local method which performs test-time optimization, is very sensitive to the choice of patch size (0.3 left vs 0.1 middle), and gets stuck in bad local minima (bumpy artefacts).

We thus propose a method using a novel representation of neural fields based on learned kernels, which we call Neural Kernel Fields (NKFs). In brief, NKFs work by learning a positive definite kernel conditioned on an input point cloud, and then using that kernel to predict an implicit shape by solving a simple linear system (Fig. 3). Our approach provides several key benefits: First, since predicted kernels are conditioned on the input and learned from data, they enjoy the versatility of learning-based methods. Second, since NKFs leverage a kernel for shape prediction, any reconstructed surfaces respect the input points by construction. Third, unlike gradient descent-based latent space optimization, at test-time NKF kernel weights are solved in closed form via a simple convex least-squares problem, guaranteeing good minima. Finally, our kernel acts as a global aggregator of spatially local features, allowing our method to work at a wide variety of sampling densities without tuning any scale parameters. The result is a generalizable method that can be trained only on synthetic shapes to seamlessly reconstruct out of distribution shapes and large scale scenes, while being robust to changes in input point density. Compared with the baselines, our method achieves a marked improvement reconstruction detail on both in and out-of distribution shapes. We summarize our contributions as follows:

  • We introduce Neural Kernel Fields, a novel representation of neural fields for 3D reconstruction, which outputs highly detailed surfaces that respect the input points.

  • Our NKF representation achieves state of the art performance on ShapeNet reconstruction (Section 4.1).

  • We show state-of-the-art generalization performance on out-of-distribution shapes (Section 4.3), scenes (Section 4.4) and point densities (Section 4.5)

2 Related Work

Local

Global

Feedforward Pred.

Test-time Opt.

OccNet [41]

IM-Net [7]

C-OccNet [48]

IF-Net [9]

SIF [22]

LDIF [20]

NGLOD [57]

MPU [42]

MLS [8]

DeepLS [4]

LIG [33]

SPSR [35]

NS [64]

SALD [2]

DeepSDF [45]

SIREN [55]

FFN [58]

MetaSDF [54]

SAL [1]

IGR [25]

Ours
Figure 2: Taxonomy of design choices for methods which reconstruct implicit shapes from point clouds. The and axes correspond to axes 1 and 3 discussed in Section 1 respectively. Color corresponds to axis 2: blue methods use learned priors, and orange methods do not.

Figure 2 visualizes existing implicit 3D shape reconstruction methods along the three axes defined in Section 1. Our Neural Kernel Field approach lies at the center of the diagram since it (1) uses a simple convex test time optimization, (2) leverages priors learned from data, and (3) learns local features on a spatial grid, but aggregates these globally during fitting.

We now highlight several works that are particularly relevant to our approach: Learned kernels were investigated in [66, 32, 46]

and used for tasks such as few-shot transfer learning and classification of images. Neural Splines 

[64]

used a kernel method derived from infinitely wide ReLU networks to reconstruct 3D surfaces from points. Convolutional Occupancy Networks 

[48] proposes a convolutional architecture that maps 3D points to features. We use a similar feature network for our Neural Kernel Field architecture. LIG [21] addresses the need for reconstruction methods that can generalize. MetaSDF [54] meta-learns a network which can be rapidly trained to predict SDFs. Neural Kernel Fields can also be viewed as a form of meta-learning since they predict a kernel machine from data. Shape as Points [47] is a concurrent work relevant to our method. It solves a linear system to reconstruct a surface after a learned upsampling phase. Unlike our method, however, Shape as Points relies on the inductive bias of Poisson reconstruction to output a surface rather than learning an inductive bias from data.

Beyond methods based on implicit surfaces, other shape reconstruction techniques exist which leverage different output representations. These representations include dense point clouds [51, 40, 73, 49, 50, 72, 56, 69, 70, 17, 36], polygonal meshes [30, 6, 19, 29, 24, 62, 12, 38, 27, 53], manifold atlases [63, 15, 26, 18, 3], and voxel grids [10, 60, 28, 67, 61, 23]. While our method focuses on shape reconstruction from points, past work has used neural fields to perform a variety of 3D tasks such as shape compression [57, 64], shape prediction from images [41, 37], voxel grid upsampling [48, 41], reconstruction from rotated inputs [14] and articulated poses [13, 71], and video to 3D [68, 39].

3 Method

Figure 3: Our method works in two stages: (1) prediction (Top row) where we predict an implicit function from an input point cloud, and (2) evaluation (Bottom row) where we evaluate the implicit function. Our predicted implicit consists of a feature function which lifts points in the volume to features in , and a set of coefficients , which are used to encode the function as a linear combination of basis functions centered at the input points.

Our approach predicts an implicit surface from an oriented point cloud using a learned kernel. Neural Splines [64] also solves a 3D reconstruction problem using a fixed kernel (not learned from data), and is thus related to our approach. To introduce the reader to kernel methods for 3D reconstruction, we begin by giving an overview of Neural Splines. We then show how these kernel methods can be extended into Neural Kernel Fields capable of leveraging priors from data.

3.1 Review of Neural Splines

Given a point set with corresponding normals , [64] seeks an implicit field which represents the underlying surface from which and were sampled. Namely, it should zero out on the set of input points and its gradient should equal the normal direction. More formally, the implicit field should minimize

(1)

The gradient part of (1) can be approximated with a finite difference method, by augmenting the points with and (see inset figure) and minimizing the simpler loss:

(2)

Let denote the union of the augmented points. To minimize (2), we represent as a weighted sum of kernel basis functions centered at the points :

(3)

which is linear in the coefficients . These coefficients can thus be recovered by solving the linear system

(4)

where is the augmented Gram matrix over the points (i.e. ), is an optional regularizer which can be used to filter noise, and

is a vector such that

(5)

The kernel function is the closed form expression for an infinitely wide shallow ReLU network. It depends on the inner product between the inputs expressed in homogeneous coordinates. i.e. . See the appendix for the exact equation and more details.

3.2 Inductive Bias of Neural Splines

The kernel formulation in Neural Splines makes explicit the notion of inductive bias, i.e. the behavior of solutions away from the input points. To see this, we observe that solutions to the linear system (4) are solutions to the following constrained optimization problem:

(6)
(7)

Here the norm being minimized defines the inductive bias of the kernel method, i.e. it governs the behavior of the function away from the constraints. The constraints

guarantee that any solution to the above optimization problem interpolates the input data up to a bound defined by the regularizer

.

For Neural Splines, the kernel norm favors smooth functions: It is proportional to curvature () for 1D curves [65] and to the Radon transform of the Laplacian () for 3D implicit surfaces [43, 64]. While an inductive bias favoring smoothness is good for reconstructing shapes with dense samples, it is too weak a prior in more challenging cases such as when the input points are very sparse or only cover part of a shape. For example, Fig. 1 (top) shows that Neural Splines is incapable of completing a partial point cloud of a truck. To this end, NKFs use a data dependent kernel, which learns an appropriate inductive bias conditioned on the input. By solving a linear system such as (4) using this kernel, we guarantee that output shapes respect their input points. We now describe NKFs in detail.

3.3 Neural Kernel Fields

Our model accepts the same inputs as Neural Splines described above in Section 3.1: i.e. We are given a set of points and normals sampled from the surface of an unknown shape, which we subsequently expand into an augmented point cloud with points and corresponding labels . We remark that our method only uses the inside and outside augmented points, i.e. . For brevity, we denote the inputs to our model as . We now describe our architecture in four steps: (1) how to define our data dependent kernel, (2) how to use that kernel to predict an implicit function, (3) how to train our model, and (4) how to add filtering for noisy inputs. Figure 3 shows our NKF architecture pictorially.

Data Dependent Kernel

To learn a kernel from data, we first augment input points with a feature where is a neural network with parameters conditioned on the inputs . Using these learned per-point features, we the define data-dependent kernel as:

(8)

where is the concatenation of the vectors and , and is the Neural Spline kernel function. The architecture of the network follows an approach similar to Convolutional Occupancy Networks [48]: We discretize the volume around the input point cloud into a grid, and use a PointNet within each grid cell containing input points to extract a feature in that cell (empty cells have a zero feature). We then feed these features into a fully convolutional 3D U-Net, which produces an grid of output features. To extract features per point, we trilinearly interpolate the output grid using the sampled points.

Predicting an Implicit Function

To predict an implicit function, we find coefficients for each input point by solving the positive definite linear system

(9)

where is the gram matrix , and is a user supplied regularization parameter. To evaluate the predicted function at a new point , we compute the following equation using the coefficients :

(10)

Training the Model

To supervise our model during training, we use a dataset of shapes. Each shape consists of the augmented input points and labels , a dense set of points and occupancy labels () in the volume surrounding the shape, and a dense set of points on the surface of the shape. We remark that the dense points on the surface and in the volume are only needed as supervision during training. The occupancy labels denote whether a volume point lies inside or outside a shape and are defined as:

(11)

We then train the network used to define the kernel (8) by first predicting an implicit function using the inputs and then evaluating it at the dense volume and surface points to compute the loss:

(12)

The first term in (12

) encourages the predicted function to have the correct occupancy, while the second term encourages the surface to agree with the ground truth shape. We backpropagate gradients through this loss to update the weights of the network

, and thus learn the data dependent kernel.

Learning to Denoise

We can optionally predict per-input point weights to make our solutions more robust to noise. We predict these via a fully connected network mapping per-point input features to weights. Instead of Eq. 9, we then solve the weighted ridge regression problem:

(13)

where is a diagonal matrix of per-input-point weights. Figure 4 shows the effect of weighted versus unweighted ridge regression in the presence of noise on a toy example.

Figure 4: Unweighted (left) versus weighted (right) kernel ridge regression. Both reconstructions use the same noisy input points and regularization value. The right reconstruction, which uses per-point weights (visualized as the size of the points) can filter out the contribution of noisy points and produce a more accurate reconstruction.

4 Experiments

We first evaluate the effectiveness of Neural Kernel Fields on the tasks of single object reconstruction (Section 4.1) and partial object completion (Section 4.2) using the ShapeNet [5] dataset. Next, we highlight NKF’s ability to generalize by evaluating the tasks of out-of-category shape generlization (Section 4.3), generlization to full scenes (Section 4.4), and generlization to different sampling densities (Section 4.5). Finally, in Section 4.6, we ablate the design choices for our backbone architecture.

SPSR [35]

NS [64]

OccNet [41]

C-OccNet [48]

Ours

Ground truth
Figure 5: Single object reconstruction on ShapeNet [5]. NKF recovers fine details like the lamp’s cord and car’s side mirror.
Noise free Noise std. = 0.0025 Noise std. = 0.005
IoU  Chamfer  Normal C.  IoU  Chamfer  Normal C.  IoU  Chamfer  Normal C. 
mean std. mean std. mean std. mean std. mean std. mean std. mean std. mean std. mean std.
SPSR [35] 0.772 0.162 0.122 0.069 0.847 0.061 0.759 0.163 0.125 0.066 0.847 0.060 0.735 0.169 0.133 0.067 0.843 0.060
OccNet [41] 0.773 0.162 0.068 0.048 0.902 0.073 0.771 0.164 0.069 0.051 0.903 0.072 0.699 0.172 0.192 0.137 0.888 0.074
C-OccNet [48] 0.810 0.116 0.051 0.018 0.922 0.052 0.820 0.112 0.049 0.019 0.924 0.051 0.866 0.089 0.080 0.040 0.937 0.044
C-OccNet* [48] 0.823 0.105 0.048 0.016 0.928 0.048 0.847 0.094 0.043 0.015 0.932 0.046 0.863 0.088 0.078 0.031 0.937 0.045
NS [64] 0.864 0.151 0.051 0.071 0.926 0.059 0.831 0.147 0.054 0.064 0.919 0.057 0.791 0.155 0.121 0.167 0.900 0.055
Ours 0.949 0.053 0.024 0.010 0.954 0.042 0.914 0.061 0.028 0.010 0.947 0.043 0.883 0.074 0.066 0.018 0.939 0.041
Table 1: Single object reconstruction on ShapeNet [5]. NKF consistently outperforms strong baselines on standard metrics: IoU, Chamfer distance, and Normal Consistency, across all 13 categories.

Baselines:  For ShapeNet reconstruction, we compare our method to OccNet [41], Conv-OccNet [48], SPSR [35], and Neural Splines [64]. On the task of completion, we compare against Conv-OccNet [48]. For out-of-distribution shape reconstruction, we compare with OccNet [41], Conv-OccNet [48], LIG [33], and Neural Splines [64], while on the task of full scene reconstruction we use Conv-OccNet [48], SPSR [35], and NS [64] as baselines. Combined, these methods cover a broad spectrum of 3D shape reconstruction approaches and represent SoTA in their respective categories depicted in Fig. 2.

Metrics:  We use 3 metrics for quantitative evaluation: Intersection over Union (IoU) is computed by sampling a set of 100k points in the volume around a watertight shape and computing the IoU of the set of inside points for the predicted and ground truth shapes. IoU indicates how well the predicted shape agrees with the ground truth both near and away from the surface. L2 Chamfer Distance is evaluated by sampling 100k points on the predicted and ground truth surfaces (extracted as meshes using marching cubes), then computing the average shortest distance between all pairs of points. Chamfer distance measures how accurately each method reconstructs the surface of the input shape. Normal Correlation is computed as the average dot product between the normals at pairs of nearest points on the ground truth and predicted shapes and evaluates how well each method does at preserving the surface direction. We use the same 100k samples as for Chamfer distance to compute this metric.

4.1 Single Object Reconstruction on ShapeNet

We evaluate NKF’s performance against strong baselines in reconstructing objects from 13 categories of the ShapeNet dataset. As input to all methods we use 1000 randomly sampled surface points to which we add Gaussian noise of different magnitudes. For learning based methods (Conv-Occnet, OccNet, Ours), we train a single model across all 13 categories per noise level. Since both NKF and Neural Splines utilize pairs of points spread along the normals, we train a version of Conv-OccNet with (C-OccNet*) and without (C-OccNet) these points. Table 1 shows that NKF achieves large improvements across all metrics, reaching near 95% IoU on noise-free reconstruction. Figure 5, which shows reconstructions at the middle noise level, clearly demonstrates how NKF recovers fine details like the cars’ side-view mirror, the cord on the lamp, and the bulges on the chair legs. In the supplemental, we provide per-category results, additional figures, and ablations on different numbers of input points.

4.2 Shape Completion on ShapeNet

Albeit using input points as anchors, thanks to the global support of the kernel, NKF can learn to recover an entire shape from partial input. To demonstrate that, we sample a point cloud from up to 50 % of a shape surface along one of the principal axes, and supervise NKF to predict the full shape. We train a separate model per shape category for each of 13 ShapeNet categories. Table 2 presents quantitative results across all categories for this task. NKF achieves on-par Chamfer and Normal correlation as C-OccNet with substantially better IoU. The top row of Fig. 1 shows an example of completing a truck shape from very partial input. Note how NKF learned to leverage shape symmetry to faithfully recover unobserved regions like the wheels. The appendix shows per-category quantitative and qualitative results.

IoU  Chamfer  Normal C. 
mean std. mean std. mean std.
C-OccNet [48] 0.770 0.152 0.075 0.068 0.909 0.059
Ours 0.819 0.171 0.077 0.091 0.907 0.067
Table 2: Object completion from partial point clouds.

4.3 Out of Category Generalization

Generalization to categories beyond the train set is key to making learnable methods useful in the wild. To evaluate NKF on this task we train all methods on 6 of the ShapeNet categories (airplane, lamp, display, rifle, chair, cabinet) and evaluate on the other 7 (bench, car, loudspeaker, sofa, table, telephone, watercraft). Table 3

presents quantitative statistics for this task using the standard metrics. NKF greatly outperforms both learned and non-learned baselines. Furthermore, we note in brackets the decrease in performance compared to the model trained on all categories. NKF, with a minimal

drop in IoU, aligns with data-free methods thanks to its test-time adaptation ability. We point out that LIG only provides models pretrained on all categories, which sets an upper bound on its generalization performance. The distinct differences between NKF and baselines are readily apparent in Figure 6.

OccNet [41]

C-OccNet [48]

Ours

Ground truth
Figure 6: Out-of-category generalization. Reconstructed object from categories unseen during training.
IoU  Chamfer  Normal C. 
OccNet [41] 0.603 (-20.4%) 0.134 (0.070) 0.829 (-8.3%)
C-OccNet [48] 0.734 (-9.5%) 0.074 (0.023) 0.895 (-2.9%)
C-OccNet* [48] 0.785 (-4.9%) 0.064 (0.013) 0.911 (-1.7%)
LIG [33] 0.518 (N.A.) 0.112 (N.A.) 0.536 (N.A.)
NS [64] 0.869 (0.0%) 0.049 (0.000) 0.924 (0.0%)
Ours 0.938 (-1.1%) 0.028 (0.003) 0.939 (-1.0%)
Table 3: Generalization capacity of object-level 3D reconstruction from sparse points clouds. We train all models using 6 ShapeNet categories (airplane, lamp, display, rifle, chair, cabinet) and evaluate them on the remaining 7 (bench, car, loudspeaker, sofa, table, telephone, watercraft). The numbers in the brackets denote the difference in performance with the model trained on all categories.

SPSR [35]

NS [64]

C-OccNet [48]

Ours
Figure 7: ScanNet reconstruction. Trained on ShapeNet objects, NKF gracefully scales to real world scanned scenes.
Chamfer  Normal C. 
C-OccNet (w. walls) [48] 0.133 0.779
C-OccNet (w.o. walls) [48] 0.074 0.843
SPSR [35] 0.060 0.871
NS [64] 0.060 0.876
Ours 0.032 0.873
Table 4: Scene-level 3D reconstruction from sparse point clouds on ScanNet [11]. All methods use 10 000 input points for each scene.

4.4 Scene Reconstruction on ScanNet

Next, we extend beyond single objects and evaluate NKFs on ScanNet scenes. For this experiment, we followed the setup in [48] and trained our model on synthetic scenes consisting of random ShapeNet object placements. We found the synthetic floors and walls, added by [48] to the training set, harmed performance and, hence, trained our method without them. We report C-OccNet’s results with and without walls for completeness. According to Table 4, for 10K input points, NKF achieves an average Chamfer distance of about half of the next best method. Figure 7 shows a comparison to baselines on 2 reconstructed rooms. Now how our method better captures small details such as the stepladder and shelf.

4.5 Point Density Generalization

In real-world applications, point density may differ between train and test times. A good data-driven prior should compensate for lack of data (i.e. sparse inputs) without hindering data-rich settings (i.e. dense inputs). Therefore, we evaluate the response of NKF and various baseline methods to changes in input sampling density. We trained each method on 1000 input points and evaluated it on varying numbers of input samples (between 250 and 3000). To report the upper-bound performance of each method, we train additional models on each density value. Figure 8 shows the mean IoU of each method versus the number of input points. Curves with labels ending in ”-1k” were trained on 1000 points, and otherwise, were trained and tested on the same number of points. OccNet shows no response to increased sampling density (even at train time). Although C-OccNet marginally improves when trained on denser data, it does not improve when evaluated with more points than it was trained on. The performance of Neural Splines improves for denser inputs, but is poor on sparse inputs as expected from data-free methods. Finally, our method works well in sparse settings and improves with increasing density. Moreover, it does not degrade if trained and tested on different sampling densities (the gray and green curves are nearly identical).

Figure 8: ShapeNet IoU vs. number of input points. Curves ending in ”-1k” correspond to methods trained on 1000 points, and other methods were trained and evaluated on the same number of points. Our method performs well in the sparse and dense regimes and does not decay when trained and tested on different point densities.

4.6 Ablations

We conduct an ablation study of our design choices on the task of shape reconstruction on ShapeNet. We experiment with using different per-point feature dimensions and whether to include the surface loss, . Table 5 summarizes the results.

feature dimension
8 16 32 64
without 0.939 0.941 0.942 0.942
with 0.945 0.947 0.949 0.949
Table 5: Ablation study (Section 4.1). NKFs benefits from the L1 surface loss and work well even with small feature dimensions. Values in the table are mean IoU on the test set.

5 Conclusion and Limitations

We presented a novel method for reconstructing and completing 3D shapes from sparse point clouds. Our method outperforms the state-of-the-art on object reconstruction and completion as well as scene reconstruction, while demonstrating strong generalization capability (both with respect to shape categories and input sampling density). While our method pushes the boundary on many fronts, it still has several limitations which we plan to address in future work: First, our current kernel implementation requires a dense linear solve, which limits the number of evaluation points to around 12k on a V100 GPU. State-of-the-art Kernel solvers in the literature (e.g. [52]) have scaled up to millions of points by leveraging techniques such as Nyström sampling. We plan to investigate how to leverage these approaches to handle larger inputs. Furthermore, we would like to investigate kernels with spatial decay to sparsify our linear system and scale our method to very large inputs. A second limitation is the requirement of oriented points. While these are usually available from sensors, they can be noisy. Thus, in the future we would like to incorporate normal prediction into our method so it can operate on unoriented point clouds.

References

Appendix A Neural Spline Kernel Equation

The Neural Spline [64] kernel is defined as the limiting kernel for an infinitely wide ReLU network with either Gaussian or Uniform initialization (using Kaiming-He [31] initialization). In our implementation we use the Gaussian initialized version which has the following closed form solution:

(14)

where are the vectors and expressed in homogeneous coordinates and is the angle between the input vectors in homogeneous coordinates. In practice we compute the angle using the formula from Kahan [34]:

(15)

which is numerically stable, especially with small angles.

Appendix B The Effect of Noise Filtering in 3D

Figure 9 shows the effect of weighting (Section 3.3) to filter noise in the input points. The left column shows our reconstruction without these learned weights, the middle column shows the effect of adding weighting, while the right column shows the ground truth surface. Notice how the weighted model is smoother and does not interpolate the input noise.

Ours (w/o weights)

Ours (weighted)

Ground truth
Figure 9: The effect of noise filtering versus regularization. The left column shows reconstructions using our method without any noise filtering and regularization in the kernel ridge regression. The middle column shows these same models reconstructed with additional noise filtering (Section 3.3). Note how the regularized model still has bumps caused by the noisy input points while these are smoothed out by the filtering module.

Appendix C More Extreme Generalization

Table 6 and Figure 10 compare our reconstruction results using a model trained only on chairs to reconstruct the other 12 ShapeNet categories (airplane, bench, cabinet, car, display, lamp, loudspeaker, rifle, sofa, table, telephone, watercraft) against a model trained on all categories. The experimental setup is identical to Section 4.3 (1000 input points) except the model is trained only on chairs. Note how the performance of model trained only on chairs only drops slightly compared to the model trained on all categories.

Pretrain on Chairs Pretrain on All
IoU  Chamfer  Normal C.  IoU  Chamfer  Normal C. 
airplane 0.922 0.021 0.945 0.951 0.016 0.962
bench 0.898 0.024 0.936 0.908 0.022 0.940
cabinet 0.938 0.043 0.947 0.968 0.028 0.962
car 0.913 0.037 0.882 0.937 0.030 0.913
chair 0.946 0.026 0.962 0.943 0.027 0.960
display 0.967 0.028 0.971 0.976 0.023 0.978
lamp 0.895 0.040 0.928 0.920 0.024 0.940
loudspeaker 0.931 0.059 0.935 0.965 0.033 0.952
rifle 0.889 0.115 0.937 0.957 0.012 0.970
sofa 0.971 0.025 0.967 0.974 0.024 0.969
table 0.939 0.028 0.964 0.951 0.025 0.969
telephone 0.985 0.018 0.986 0.988 0.017 0.988
watercraft 0.936 0.039 0.934 0.955 0.019 0.950
mean 0.929 0.036 0.939 0.949 0.024 0.954
Table 6: Comparison between model trained only on chairs (left column) to model trained on all categories.

Ours (chairs)

Ours (all)

Ground truth
Figure 10: Out of category generalization on ShapeNet [5]. Our model trained only on chairs (left) can seamlessly generalize to other 12 ShapeNet categories, achiving only sligthly worse performance than the model trained on all categories (middle).

Appendix D Inference Timings

Our method uses a convex test-time optimization to perform inference of 3D shapes. We report the timing of each part of our method for the ShapeNet reconstruction (Section 4.1) and ScanNet reconstruction (Section 4.4) experiments in Table 7. With input points for ShapeNet, we evaluated on a grid of size (M points), and with input points for ScanNet, we evaluated on a grid of size (M points. We implemented the kernel evaluation as a single monolithic CUDA kernel and report the timings on a Quadro GV100 GPU.

ShapeNet ScanNet
Encoder 12.9ms 229.8ms
Decoder 0.3ms 0.42ms
Solve 30.3ms 3142ms
Eval 193.5ms 13254ms
Table 7: Timings on ShapeNet (1k input points and 2.1 million evaluation points) and ScanNet (10k input points and 16.9 million eval points).

Appendix E Additional ShapeNet Reconstruction Figures

Figure 11 shows additional reconstruction comparisons (with 0.0025) noise as described in Section 4.1.

SPSR [35]

NS [64]

OccNet [41]

C-OccNet [48]

Ours

Ground truth
Figure 11: ShapeNet [5] Reconstruction. Reconstructions of models from the ShapeNet test set given 1000 input points and normals with Gaussian noise.

Appendix F Additional ShapeNet Generalization Figures

Figure 12 shows additional reconstructions for the out-of-category reconstruction experiment described in Section 4.3.

Appendix G Additional Completion Figures

Figure 12 shows additional completion comparisons for the experiment described in Section 4.2.

C-OccNet [48]

Ours

C-OccNet [48]

Ours
Figure 12: Shape completion on ShapeNet [5].

Appendix H Per-Category ShapeNet Results

Tables 8 and 9 report the per-category reconstruction and completion results respectively for the experiments described in Sections 4.1 and  4.2.

Noise-Free
IoU  Chamfer  Normal C. 
OccNet C-OccNet* NS Ours OccNet C-OccNet* NS Ours OccNet C-OccNet* NS Ours
airplane 0.752 0.811 0.775 0.951 0.054 0.036 0.103 0.016 0.900 0.927 0.898 0.962
bench 0.713 0.723 0.768 0.908 0.052 0.045 0.065 0.022 0.889 0.900 0.901 0.940
cabinet 0.869 0.898 0.921 0.968 0.060 0.049 0.041 0.028 0.931 0.950 0.939 0.962
car 0.841 0.873 0.911 0.937 0.069 0.051 0.037 0.030 0.896 0.898 0.903 0.913
chair 0.740 0.811 0.858 0.943 0.076 0.051 0.045 0.027 0.896 0.933 0.933 0.960
display 0.825 0.854 0.938 0.976 0.062 0.048 0.030 0.023 0.932 0.960 0.964 0.978
lamp 0.550 0.751 0.834 0.920 0.144 0.058 0.047 0.024 0.819 0.902 0.915 0.940
loudspeaker 0.833 0.892 0.938 0.965 0.090 0.059 0.041 0.033 0.910 0.938 0.945 0.952
rifle 0.678 0.757 0.936 0.957 0.057 0.038 0.021 0.012 0.860 0.915 0.960 0.970
sofa 0.876 0.893 0.927 0.974 0.055 0.047 0.041 0.024 0.939 0.952 0.949 0.969
table 0.768 0.785 0.801 0.951 0.059 0.048 0.065 0.025 0.923 0.948 0.926 0.969
telephone 0.915 0.904 0.969 0.988 0.035 0.035 0.021 0.017 0.973 0.979 0.983 0.988
watercraft 0.737 0.825 0.894 0.955 0.083 0.046 0.044 0.019 0.870 0.909 0.930 0.950
mean 0.773 0.823 0.864 0.949 0.068 0.048 0.051 0.024 0.902 0.928 0.926 0.954
0.0025 Noise
IoU  Chamfer  Normal C. 
OccNet C-OccNet* NS Ours OccNet C-OccNet* NS Ours OccNet C-OccNet* NS Ours
airplane 0.739 0.825 0.729 0.905 0.057 0.034 0.103 0.020 0.904 0.928 0.888 0.953
bench 0.713 0.758 0.723 0.867 0.053 0.040 0.068 0.025 0.889 0.906 0.892 0.935
cabinet 0.871 0.916 0.905 0.952 0.061 0.044 0.045 0.031 0.933 0.953 0.934 0.959
car 0.839 0.877 0.892 0.921 0.068 0.052 0.041 0.033 0.895 0.902 0.896 0.911
chair 0.740 0.837 0.825 0.912 0.077 0.045 0.050 0.030 0.896 0.937 0.926 0.956
display 0.818 0.890 0.902 0.953 0.063 0.039 0.036 0.026 0.932 0.963 0.958 0.975
lamp 0.547 0.774 0.784 0.880 0.153 0.050 0.053 0.026 0.824 0.907 0.906 0.936
loudspeaker 0.829 0.910 0.922 0.952 0.091 0.052 0.046 0.035 0.912 0.943 0.940 0.952
rifle 0.678 0.783 0.860 0.904 0.058 0.033 0.023 0.016 0.865 0.919 0.947 0.960
sofa 0.879 0.913 0.905 0.956 0.055 0.041 0.047 0.028 0.937 0.956 0.942 0.966
table 0.768 0.832 0.772 0.917 0.059 0.040 0.065 0.028 0.924 0.953 0.922 0.966
telephone 0.909 0.931 0.932 0.969 0.036 0.029 0.027 0.020 0.973 0.980 0.975 0.986
watercraft 0.732 0.843 0.857 0.926 0.086 0.041 0.050 0.022 0.874 0.913 0.918 0.945
mean 0.771 0.847 0.831 0.919 0.069 0.043 0.054 0.027 0.903 0.932 0.919 0.945
0.005 Noise
IoU  Chamfer  Normal C. 
OccNet C-OccNet* NS Ours OccNet C-OccNet* NS Ours OccNet C-OccNet* NS Ours
airplane 0.675 0.839 0.758 0.852 0.155 0.062 0.098 0.053 0.890 0.933 0.886 0.937
bench 0.589 0.779 0.673 0.813 0.160 0.073 0.161 0.062 0.860 0.911 0.876 0.922
cabinet 0.802 0.928 0.881 0.936 0.181 0.078 0.105 0.070 0.914 0.958 0.920 0.952
car 0.804 0.888 0.869 0.899 0.182 0.095 0.095 0.077 0.891 0.905 0.879 0.902
chair 0.652 0.859 0.779 0.876 0.217 0.081 0.119 0.071 0.884 0.944 0.910 0.946
display 0.742 0.914 0.858 0.924 0.170 0.067 0.091 0.061 0.922 0.968 0.940 0.967
lamp 0.478 0.796 0.701 0.827 0.421 0.099 0.171 0.065 0.802 0.914 0.868 0.921
loudspeaker 0.785 0.924 0.900 0.937 0.236 0.091 0.108 0.080 0.899 0.947 0.925 0.946
rifle 0.600 0.807 0.774 0.850 0.151 0.060 0.068 0.045 0.832 0.925 0.906 0.943
sofa 0.818 0.929 0.889 0.936 0.159 0.072 0.095 0.065 0.925 0.961 0.931 0.957
table 0.663 0.859 0.704 0.873 0.168 0.072 0.167 0.066 0.906 0.957 0.898 0.956
telephone 0.847 0.944 0.892 0.945 0.107 0.050 0.072 0.049 0.966 0.982 0.958 0.980
watercraft 0.695 0.863 0.808 0.890 0.216 0.074 0.147 0.056 0.861 0.921 0.890 0.931
mean 0.699 0.863 0.791 0.883 0.192 0.078 0.121 0.066 0.888 0.937 0.900 0.939
Table 8: Per-category ShapeNet reconstruction results corresponding to the experiment described in Section 4.1.
IoU  Chamfer  Normal C.  F-Score 
C-OccNet* Ours C-OccNet* Ours C-OccNet Ours C-OccNet* Ours
airplane 0.800 0.844 0.048 0.054 0.926 0.919 0.921 0.916
bench 0.615 0.705 0.082 0.086 0.868 0.872 0.808 0.853
cabinet 0.834 0.881 0.079 0.067 0.924 0.918 0.784 0.872
car 0.862 0.891 0.059 0.047 0.899 0.899 0.859 0.912
chair 0.731 0.790 0.092 0.091 0.906 0.910 0.805 0.854
display 0.768 0.850 0.088 0.079 0.921 0.925 0.774 0.876
lamp 0.620 0.685 0.138 0.159 0.864 0.866 0.751 0.797
loudspeaker 0.808 0.851 0.101 0.105 0.904 0.902 0.701 0.814
rifle 0.746 0.809 0.045 0.051 0.899 0.904 0.915 0.907
sofa 0.837 0.864 0.073 0.075 0.923 0.916 0.823 0.866
table 0.730 0.777 0.075 0.089 0.925 0.911 0.851 0.863
telephone 0.886 0.906 0.046 0.048 0.964 0.958 0.920 0.922
watercraft 0.761 0.830 0.067 0.061 0.887 0.912 0.829 0.884
mean 0.770 0.819 0.075 0.077 0.909 0.907 0.837 0.875
Table 9: Per category completion results corresponding to the experiment described in Section 4.2.