Ladybird: Quasi-Monte Carlo Sampling for Deep Implicit Field Based 3D Reconstruction with Symmetry

07/27/2020 ∙ by Yifan Xu, et al. ∙ Max Planck Society NetEase, Inc. 0

Deep implicit field regression methods are effective for 3D reconstruction from single-view images. However, the impact of different sampling patterns on the reconstruction quality is not well-understood. In this work, we first study the effect of point set discrepancy on the network training. Based on Farthest Point Sampling algorithm, we propose a sampling scheme that theoretically encourages better generalization performance, and results in fast convergence for SGD-based optimization algorithms. Secondly, based on the reflective symmetry of an object, we propose a feature fusion method that alleviates issues due to self-occlusions which makes it difficult to utilize local image features. Our proposed system Ladybird is able to create high quality 3D object reconstructions from a single input image. We evaluate Ladybird on a large scale 3D dataset (ShapeNet) demonstrating highly competitive results in terms of Chamfer distance, Earth Mover's distance and Intersection Over Union (IoU).



There are no comments yet.


page 5

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

These two authors contribute equally.            Corresponding author.
ORCIDs: Yi Yuan0000-0003-2507-8181 Gurprit Singh0000-0003-0970-5835

Due to the under-constrained nature of the problem, 3D object reconstruction from a single-view image has been a challenging task. Large shape and structure variations among objects make it difficult to define one dedicated parameterized model. Methods based on template deformation are often restricted by the initial topology of the template, and are not able to recover holes for instance. Recently, deep learning based implicit fields regression methods have shown great potential in monocular 3D reconstruction. Mescheder et al. [20] and DISN [33] create visually pleasing smooth shape reconstruction, with consistent normal and complex topology using implicit fields.

Figure 1:

Top: Demonstration of our sampling strategy for implicit field regression network training. A neighborhood of the mesh (b) is sampled from a set of dense grid points (a). A sparse set of points is sampled from (b) uniformly at random (c) or through FPS (d). Bottom: comparison of the training accuracy between Grid+FPS and Grid+Random sampling for the same network architecture during training. (e) is the plot of the training accuracy for the first 25 epoch. (f) is the plot of the training accuracy of the first epoch. Sampling with lower discrepancy results in faster convergence and better accuracy during training.

An implicit field is a real-valued function defined on

whose iso-surface recovers the mesh of interest. Common choices of implicit field are signed distance field, truncated signed distance field, or occupancy probability field. A network

is trained to predict the implicit field of point , based on the input image , where

are the parameters which are optimized with stochastic gradient descent (SGD) type algorithms. This is followed by post-processing methods like marching cube and sphere tracing to reconstruct the mesh.

The loss function for the implicit field regression problem is the

distance between the ground truth implicit field and the network

predicted output. During training, a sparse set of 3D points need to be sampled in a compact region containing the mesh to approximate the optimization objective. We formulate this empirical loss as a Monte Carlo estimator.

While most prior discussion on sampling [20] focuses on designing a probability measure for the integral that puts different weights for regions of different distance to the mesh surface, we look at the problem from a point view of discrepancy of the sample sets. When approximating an integral, different samplers have different error convergence rates with respect to the sample size [24] [22]

. Low discrepancy sequences/points or blue noise (in 2D) samples give better estimation, for instance, compared to random samples (white noise).

Given a set of locally uniform samples whose distance to the target mesh is bounded by a threshold, we show that farthest point sampling algorithm (FPS) can be used to select a sparse subset with low discrepancy for training . An overview of our method is shown in Figure 1. Our proposed sampling scheme results in better generalization performance as it provides better approximation to the expected loss, thanks to the Koksma-Hlawka inequality [15]. Empirically our sampling scheme also results in faster convergence for SGD-based optimization algorithms, which speeds up the training process significantly as shown in Figure 1(e,f).

Figure 2: Ladybird is able to produce high quality 3D reconstruction from a single input image. The consideration of symmetry allows recovering of occluded geometry and texture completion.

Many deep 3D implicit field reconstruction works [20] [5]

explore the use of global shape encoding. While being good at capturing the general shape and obtaining interesting interpolation in the latent space, sometimes it is difficult to recover fine geometric details with only global features. Local features found via aligning image to mesh by modeling the camera are used to address the issue. However, for occluded points, it is ambiguous what local features should be used. Usually all the sampled points are projected to the images

[33], and hence points in the back use features of the points that occlude them.

As most man-made objects are symmetric about a plane, we observe that this problem can be alleviated via the consideration of reflective symmetry. For a symmetric pair of points and , the implicit fields at and are the same, and often at least one of them is visible in the image. Hence we can use the local features of to improve the implicit field predication of , which can also be understood as utilizing two-view information. Our feature fusion method imposes a symmetry prior on the network , which gives significant improvement of the reconstruction quality as shown in Figure 2. Unlike previous works [30] [34] that focus on the design of loss function, detection or encoding of symmetry, our method naturally integrates into the pixel-to-mesh alignment framework.

The advantage of spatially aligning the image to mesh and utilizing the corresponding local features is that the fine shape details and textures can be better recovered. However, when is occluded, the feature obtained by such alignment no longer has an intuitive meaning. Recently Front2Back [34] addresses such issues by detecting reflective symmetries from the data and synthesizing the opposite orthographic view. Our approach is simpler and does not depend on symmetry detection.

1.0.1 Mesh

AtlasNet [11] represents a mesh as a locally parameterized surface and predicts the local patches from a latent shape representation learned via reconstruction objectives. Mitchell et al. [21] proposes to represent 3D shapes using higher order functions.

Pixel2Mesh [29] uses graph CNN to progressively deform an ellipsoid template mesh to fit the target. Features from different layers in the CNN are used to generate different resolution of details. 3DN [30] infers vertex offsets from a template mesh according to the image object’s category, and proposes differentiable mesh sampling operator to compute the loss function. SDM-NET [9] uses VAE to generate a spatial arrangement of deformable parts of an object. Pan et al. [23] proposes a progressive method that alternates between deforming the mesh and modifying the topology. Mesh R-CNN [10] unifies object detection and shape reconstruction, with a mesh prediction branch that first produces coarse cubified meshes which are refined with a graph convolution network.

DIB-R [5], Soft Rasterizer [19] design differentiable rasterization layers that enable unsupervised training for reconstruction tasks. DIST [18] proposes an optimized differentiable sphere tracing layer for differentiable SDF rendering.

1.0.2 Point Cloud and Voxel

Fan et al. [8] proposes a conditional shape sampler to predict multiple plausible point clouds from an input image. Lin et al. [17] uses an auto-encoder to synthesize partial point clouds from multiple views, which is combined as a dense point cloud. Then the loss is computed via rendering the depth images from multiple views. Li et al. [16] uses a CNN to predict multiple depth maps and corresponding deformation fields, which are fused to form the full 3D shape.

3D-R2N2 [6]

uses recurrent neural networks to generate voxelized 3D reconstruction. Pixel2Vox

[31] uses encoder-decoder structures to generate a coarse 3D voxel.

1.1 Sampling Methods in Monte Carlo Integration

Realistic image synthesis involves evaluating very high-dimensional light transport integrals. (Quasi-)Monte Carlo (MC) numerical methods are traditionally employed to approximate these integrals which is highly error prone. This error directly depends on the sampling pattern used to estimate the underlying integral [26]. These sampling patterns can be highly correlated. Fourier power spectra are commonly employed to characterize these correlations among samples (Figure 3).

Figure 3: Top row shows different point patterns for different samplers for samples. Bottom row shows the corresponding expected power spectra. Random samples are completely decorrelated which results in a flat power spectrum. 2D stratification (jittered) results in a power spectrum with small dark region around the center (DC frequency). For blue noise (Poisson Disk) sampler, this dark (no energy low-frequency) region is larger. However, for Halton and Sobol samplers, the corresponding power spectrum shows some spikes, but it preserves well the underlying stratification along dimensions which is characterized as a dark cross in the middle of the spectrum. Finally, a simple regular (Grid) pattern has a grid like power spectrum (zoom-in to the right-most bottom image to see the grid structure).

Blue noise samplers [26] are well-known to show good improvements for low-dimensional integration problems whereas low-discrepancy [22] samplers like Halton [12] and Sobol [13] are more effective for higher dimensional problems. In this work, we use farthest point selection strategy [7] from any given pointset to select our samples.

2 Our Approach

We first start with a theoretical motivation for our sampling methods. This is followed by the proposed symmetric feature fusion module and our 3D reconstruction pipeline (illustrated in Figure 4).

Figure 4: Overview of Ladybird. a) are symmetric about a plane. Their projections to the image are found via a camera model. b) Local feature consists of point feature of and local image feature from pixels corresponding to and . c) Global feature consists of point feature of and global image feature. d) Local feature and global feature are encoded through two MLPs whose parameters are shared among all . In the end, marching cube is used to extract iso-surface.

2.1 Preliminary

In Quasi-Monte Carlo integration literature, the equidistribution of a point set is tested by calculating the discrepancy of the set. This approach assigns a single quality number, the discrepancy, to every point set. The lower the discrepancy, the better (uniform) the underlying point set would be. We focus on the star discrepancy of a point set, which computes discrepancy with respect to rectangular axis-aligned sub-regions with one of their corners fixed to the origin. Mathematically, the star discrepancy can be defined as follows:

Definition 1

Let be a set of points in , then the star discrepancy of is


where is the Lebesgue measure on , is the number of points in P that are in , and .

For a given point set or a sequence (stochastic or deterministic), the error due to sampling is directly related to the star discrepancy of the point set . This relation is given by the Koksma-Hlawka inequality [15] as described below:

Theorem 1

Let and is a function on with bounded variation . Then for any ,


The above inequality states that for with bounded variation, a point set with lower discrepancy gives less error when numerically integrating .

The distance between two implicit fields is an integral, and a set of points in needs to be sampled to approximate such integral which appears in the expected loss for deep implicit fields regression. By triangle inequality, using lower discrepancy sampler indicates a better bound on the generalisation error.

2.2 Sampling

Given an input image , we denote a neural network as that predicts an implicit field of point . Let be the ground truth implicit field of the mesh from which is rendered, and let be the training set. To estimate the expected loss, we need to estimate the following:



is a probability density function in

supported in a compact region near the mesh .

Instead of studying different choices for and their effects on training, we study the impact of different sampling patterns on the integral estimation.

The error convergence rate of an estimator is greatly influenced by the sampling pattern [22, 24]. Sparse sampling could result in aliasing following the Nyquist-Shannon theorem. A better sampling strategy would allow faster convergence to the true integral resulting in better generalisation performance. Following the Koksma-Hlawka inequality, in order to better approximate the distance between and —which indicates better generalisation of the network on different input points —sample sets of lower discrepancy should be preferred.

In consideration of the time efficiency, usually we pre-compute the implicit field of a dense set of points around the mesh surface, where a sparse subset is chosen uniformly during training. Hence we consider the following problem: given a set of points , how to select a subset consisting of points with low discrepancy. It is natural to consider farthest point sampling algorithm(FPS): initially is selected uniformly at random. Then iteratively,


is added to . In Section 3.3, we show that compared to randomly selecting a sample subset from , sampling using the FPS approach results in lower discrepancy.

2.3 Feature Fusion Based on Symmetry

For a fixed camera model, let be the corresponding projection that maps 3D points to the image plane. Assume that the target mesh is symmetric about plane 111ShapeNet data set is aligned, and most objects are symmetric about plane., and is the rigid transformation such that the input image is formed via the composition . In practice, either is known or is predicted via a camera network from input image.

For a point not too far from , let be the pixel in the image that corresponds to

. A convolution neural network (CNN) is used to extract features from the input image

. Let

be the concatenation of feature vectors at

in different layers of the CNN.

We can use to guide the regression of the implicit field at . However when is occluded, the pixel value of is not determined by but by with smallest z-buffer value whose projection also lies in the pixel . There is no clear relation between the implicit field at and that at .

For a point , such that , the symmetric point of is where . The implicit field at should equal to that at . Hence it is reasonable to include as part of the local feature of , which we call feature fusion. One straight-forward and effective way to implement feature fusion is to concatenate and .

3 Experiments

To show the effectiveness of our proposed system Ladybird, we provide quantitative as well qualitative comparisons to other methods. Our backbone network architecture is based on DISN [33]

. Our implementation of Ladybird is in Tensorflow 1.9

[2], and the system is tested on Nvidia GTX 1080Ti with Cuda 9.0. In all our experiments, Adam optimizer [14] is used with and an initial learning rate of 1e-4.

3.1 Data Processing

For dataset, we use ShapeNet Core v1 [3], and use the official train/test split. There are 13 categories of objects. For each object, 24 views are rendered as in 3D-R2N2 [6]. We randomly select 6000 images from the training set as the validation set, and our training set contains 726,600 images. The data is aligned and most objects (about 80 percent) are symmetric about plane. We normalize the object mesh such that its center of mass is at the origin and the mesh lies in the unit sphere.

To efficiently and accurately compute the SDF values, we use polygon soup algorithm [32] to compute the SDF on grid points. After that, non-grid point SDF values are obtained through tri-linear interpolation.

For each mesh object, first we sample points using Grid, Jitter, or Sobol sampler [1]

and compute the corresponding SDF values. In Jitter, each grid point jitters with Gaussian noise of mean 0 and standard deviation 0.02. We then sample a subset

consisting of 32,768 points from in the following way: from each SDF range , , , and , -th of points are sampled uniformly at random. During training time, a subset consisting of 2048 points are sampled from uniformly at random or through FPS at each epoch. Depending on the sampling pattern used to sample (say ) and (say ), the resulting sampling pattern is denoted by .

At test time, the SDF of grid points are predicated and marching cube is used to extract the iso-surface.

3.2 Network Details

We use a pre-trained camera pose estimation network from DISN

[33], to predict a rigid transformation matrix described in Section 2.3. VGG-16 is used as a CNN module to extract features form the input image. For a given point , is the concatenation of features (Section 2.3) at different layers of VGG-16 at pixel that projects under the known or predicted camera intrinsics. Assuming and being symmetric about a plane, the pixel feature of is one of the following:

  1. Base: which is of dimension 1472.

  2. Symm(Near): or depending on the one having smaller z-buffer value.

  3. Symm(Avg): The average of and .

  4. Symm(Concat): The concatenation of and .

As shown in Figure 4, the image feature is the output of VGG-16 (of dimension 1024). Two stream of point features are processed with two MLPs, each of parameters (64, 256, 512). Each stream is concatenated with pixel feature and image feature respectively, to form a local and a global feature. These global and local features are encoded through two MLPs, each of parameters (512, 256, 1), and the encoded values are added as the predicted SDF at .

3.3 Samplers Impact on Training

To assess the effect of different samplers on training, we set our pixel features to Base (see Section 3.2), use ground truth camera parameters and keep the batch size to 20.

Sample size Metric Grid+Random Grid+FPS Jitter+FPS Sobol+FPS
1024 Mean 4.51 3.86 6.49 4.35
Std 0.66 0.19 0.49 1.43
2048 Mean 2.98 2.48 6.07 2.51
Std 0.26 0.16 0.77 0.47
4096 Mean 2.34 1.96 6.10 1.65
Std 0.33 0.08 0.30 0.37
Table 1: Mean () and standard deviation () of star discrepancy of different samplers. A+B means we first sample points using sampling method A and then select a subset of size with method B.
Metric Grid+Random Grid+FPS Jitter+FPS Sobol+FPS
CD 10.17 8.43 19.88 11.33
EMD 2.71 2.57 2.92 2.84
Table 2: Effect of different samplers on the reconstruction results on ShapeNet test set. Method A+B means is sampled with A and is sampled with B. Metrics are class mean of CD (), and class mean of EMD (), computed on points. Grid+FPS outperforms other methods.

In Table 1, we report the star discrepancy of different samplers in 2D. We first sample points using Grid, Jitter, Sobol sampler in , then selecting 1024, 2048, 4096 points uniformly at random or through FPS. In Jitter, each grid point jitters with Gaussian noise of mean 0 and standard deviation

. We experimentally verify in 2D that Grid+FPS sampling has lower discrepancy and lower variance compared to Grid+Random.

Figure 5: Impact of feature fusion based on reflective symmetry. (a) indicates the input images. (b) and (d) are the reconstruction results using Base in two different views. (c) and (e) are the reconstruction results using Symm(Concat). We see that Symm(Concat) helps to improve the reconstruction quality.
Grid vs. Sobol:

The SDF validation accuracy of Sobol+FPS (0.914) is similar to that of Grid+FPS (0.917), which is higher than Grid+Random (0.825). However, SDF prediction is an intermediate step for the reconstruction task. Marching Cube is used to recover the mesh from the SDF, which requires SDF values at grid points. Due to this grid restriction imposed by Marching Cube, Grid sampling ensures better training/test data consistency. In addition, Grid+FPS and Grid+Random leads to more stable training results (cf. Sobol+FPS) due to lower std. Our work advocates that Grid+FPS is suitable for 3D reconstruction based on deep implicit fields and marching cube.

In Table 2, we report the comparison of reconstruction using different samplers in terms of Chamfer distance (CD) 222For two point set and , CD is defined to be . and Earth Mover’s distance (EMD) [28]. We see that Grid+FPS outperforms Grid+Random, Jitter+FPS, as well as Sobol+FPS. Jitter+FPS performs the worst and its 2D analogue also has the highest star discrepancy. We observe that Grid+FPS reduces noisy phantom blocks around the mesh, and hence reduces the need for post-processing and cleaning. This property is highly desired, because sometimes the cleaning algorithm cannot distinguish between small components and noise. In addition, Grid+FPS encourages faster training convergence as shown in Figure 1.

3.4 Effect of Feature Fusion Based on Symmetry

To analyze the effect of symmetry-based feature fusion, we choose Grid+FPS sampling method. The corresponding batch size for this experiment is kept 16.

Metric Local image feature plane bench box car chair display lamp speaker rifle sofa table phone boat Mean
CD Base 5.33 5.37 9.33 4.42 7.73 7.07 24.36 13.65 3.32 5.78 9.37 8.13 5.79 8.43
Symm(Avg) 7.27 17.00 12.29 4.97 14.83 15.83 58.77 23.76 6.72 11.15 12.06 61.73 5.96 19.41
Symm(Near) 4.73 5.50 9.13 4.12 6.70 7.05 18.43 12.26 3.62 6.70 11.49 4.49 5.37 7.66
Symm(Concat) 3.86 4.30 8.04 4.11 5.43 6.09 14.10 10.53 3.51 5.05 8.13 4.16 4.92 6.33
EMD Base 2.35 2.30 2.91 2.47 2.66 2.44 4.21 3.19 1.69 2.29 2.78 1.95 2.14 2.57
Symm(Avg) 2.14 2.36 2.98 2.42 2.56 2.54 4.69 3.41 1.71 2.45 2.77 3.25 2.12 2.72
Symm(Near) 2.24 2.22 2.95 2.40 2.53 2.42 4.11 3.14 1.65 2.38 2.85 1.89 2.11 2.53
Symm(Concat) 2.07 2.06 2.80 2.38 2.32 2.28 3.59 2.98 1.73 2.18 2.57 1.85 2.07 2.38
IoU Base 63.4 56.3 52.0 77.8 58.1 60.2 41.7 58.4 70.4 71.3 53.8 75.7 66.0 61.9
Symm(Near) 64.7 56.3 54.3 79.0 60.2 61.1 43.4 58.5 71.6 69.8 52.8 76.4 67.5 62.7
Symm(Concat) 66.6 60.3 56.4 80.2 64.7 63.7 48.5 61.5 71.9 73.5 58.1 78.1 68.8 65.6
Table 3: Comparison between different feature fusion operation evaluated on ShapeNet test set. Metrics are CD (), and EMD (), computed on points. Groud truth camera parameters are used.
Figure 6: Symm(Concat) can produce good reconstruction result for non-symmetrical object, without ground truth camera parameters. (a) and (d) are input images. (b) and (c) are reconstruction result of (a) rendered from 2 different views. (e) and (f) are reconstruction result of (d) from two different views.

In Table 3 and Figure 5, we compare the effects of different feature fusion operations that are defined in Section 3.2 on the reconstruction result from ShapeNet. Ablation study shows that Symm(Near) and Symm(Concat) improve the reconstruction results. We see that concatenation of features from symmetrical pair performs the best. The reason is that Symm(Concat) better utilizes additional information comparing to Symm(Near) and Symm(Avg). When both and its symmetry point are visible in the image, the pixel features of and are both helpful for recovering the local shape at . We observe that Symm(Concat) is able to produce reconstruction result for non-symmetrical object as shown in Figure 6. It has the interpretation of adding the most promising additional local feature based on a symmetry prior.

Figure 7: Qualitative comparison with other methods. The first row contains the input image. The released model of P2M (Pixel2Mesh)[29], OccNet [20], Mesh-R-CNN [10], BSP-NET [4], DISN [33] are used to generate the results. The last row GT contains the ground truth meshes.

3.5 Comparison with Other Methods

In this subsection, the sampling method is Grid+FPS. The pixel feature is Symm(Concat). Camera parameters are estimated using the network mentioned in Section 4.2.

We report comparison with other state-of-the-art methods in terms of CD, EMD, and IoU. From Table 4, we see that Ladybird outperforms other methods. Figure 7 shows qualitative comparison of Ladybird with other methods. We see that Ladybird is able to reconstruct high quality mesh with fine geometric details from a single input image. Note that due to the difference between the train/test split of OccNet [20] and that of ours, we evaluate OccNet [20] on the intersection between two test sets.

Metric Method plane bench box car chair display lamp speaker rifle sofa table phone boat Mean
CD AtlasNet [11] 5.98 6.98 13.76 17.04 13.21 7.18 38.21 15.96 4.59 8.29 18.08 6.35 15.85 13.19
Pixel2Mesh [29] 6.10 6.20 12.11 13.45 11.13 6.39 31.41 14.52 4.51 6.54 15.61 6.04 12.66 11.28
3DN [30] 6.75 7.96 8.34 7.09 17.53 8.35 12.79 17.28 3.26 8.27 14.05 5.18 10.20 9.77
IMNET [5] 12.65 15.10 11.39 8.86 11.27 13.77 63.84 21.83 8.73 10.30 17.82 7.06 13.25 16.61
3DCNN [33] 10.47 10.94 10.40 5.26 11.15 11.78 35.97 17.97 6.80 9.76 13.35 6.30 9.80 12.30
OccNet [20] 7.70 6.43 9.36 5.26 7.67 7.54 26.46 17.30 4.86 6.72 10.57 7.17 9.09 9.70
DISN [33] 9.96 8.98 10.19 5.39 7.71 10.23 25.76 17.90 5.58 9.16 13.59 6.40 11.91 10.98
Ours 5.85 6.12 9.10 5.13 7.08 8.23 21.46 14.75 5.53 6.78 9.97 5.06 6.71 8.60
Ours 3.86 4.30 8.04 4.11 5.43 6.09 14.10 10.53 3.51 5.05 8.13 4.16 4.92 6.33
EMD AtlasNet [11] 3.39 3.22 3.36 3.72 3.86 3.12 5.29 3.75 3.35 3.14 3.98 3.19 4.39 3.67
Pixel2Mesh [29] 2.98 2.58 3.44 3.43 3.52 2.92 5.15 3.56 3.04 2.70 3.52 2.66 3.94 3.34
3DN [30] 3.30 2.98 3.21 3.28 4.45 3.91 3.99 4.47 2.78 3.31 3.94 2.70 3.92 3.56
IMNET[5] 2.90 2.80 3.14 2.73 3.01 2.81 5.85 3.80 2.65 2.71 3.39 2.14 2.75 3.13
3DCNN [33] 3.36 2.90 3.06 2.52 3.01 2.85 4.73 3.35 2.71 2.60 3.09 2.10 2.67 3.00
OccNet [20] 2.75 2.43 3.05 2.56 2.70 2.58 3.96 3.46 2.27 2.35 2.83 2.27 2.57 2.75
DISN [33] 2.67 2.48 3.04 2.67 2.67 2.73 4.38 3.47 2.30 2.62 3.11 2.06 2.77 2.84
Ours 2.48 2.29 3.03 2.65 2.60 2.61 4.20 3.32 2.22 2.42 2.82 2.06 2.46 2.71
Ours 2.07 2.06 2.80 2.38 2.32 2.28 3.59 2.98 1.73 2.18 2.57 1.85 2.07 2.38
IoU AtlasNet [11] 39.2 34.2 20.7 22.0 25.7 36.4 21.3 23.2 45.3 27.9 23.3 42.5 28.1 30.0
Pixel2Mesh [29] 51.5 40.7 43.4 50.1 40.2 55.9 29.1 52.3 50.9 60.0 31.2 69.4 40.1 47.3
3DN [30] 54.3 39.8 49.4 59.4 34.4 47.2 35.4 45.3 57.6 60.7 31.3 71.4 46.4 48.7
IMNET [5] 55.4 49.5 51.5 74.5 52.2 56.2 29.6 52.6 52.3 64.1 45.0 70.9 56.6 54.6
3DCNN [33] 50.6 44.3 52.3 76.9 52.6 51.5 36.2 58.0 50.5 67.2 50.3 70.9 57.4 55.3
OccNet [20] 54.7 45.2 73.2 73.1 50.2 47.9 37.0 65.3 45.8 67.1 50.6 70.9 52.1 56.4
DISN [33] 57.5 52.9 52.3 74.3 54.3 56.4 34.7 54.9 59.2 65.9 47.9 72.9 55.9 56.9
Ours 60.0 53.4 50.8 74.5 55.3 57.8 36.2 55.6 61.0 68.5 48.6 73.6 61.3 58.2
Ours 66.6 60.3 56.4 80.2 64.7 63.7 48.5 61.5 71.9 73.5 58.1 78.1 68.8 65.6
Table 4: Evaluations on ShapeNet Core test set for various methods. Metrics are CD (), EMD () and IoU (, the larger the better), computed on points. is Ladybird with estimated camera parameters, and is Ladybird with ground truth camera parameters.

Since ShapeNet is a synthesized dataset, we further provide quantitative evaluation on Pix3D [27] (Table 5), and some qualitative examples of in-the-wild images which are randomly selected from the internet (Figure 8). These results show that Ladybird generalizes well to natural images. For the experiment on Pix3D, we fine-tune Ladybird and DISN [33] (both pre-trained on ShapeNet) on Pix3D train set, and use the ground truth camera poses and the segmentation masks.

Metric Method bed bookcase chair desk misc sofa table tool wardrobe Mean
CD DISN [33] 12.74 35.29 23.82 18.70 31.18 3.85 18.46 46.00 4.23 18.51
Ours 5.73 15.89 13.03 10.38 30.34 3.28 8.38 28.39 5.58 10.02
EMD DISN [33] 2.84 4.65 3.97 4.04 4.53 1.99 3.85 5.66 2.11 3.53
Ours 2.35 3.07 3.23 2.77 4.96 1.84 2.42 3.68 1.99 2.75
IoU DISN [33] 71.2 43.0 59.0 53.7 48.8 89.4 57.8 37.3 85.6 64.4
Ours 78.2 67.8 66.5 67.5 49.5 91.8 74.2 58.4 86.8 73.3
Table 5: Evaluations on Pix3D [27] test set. Metrics are CD (), and EMD (), computed on points. Groud truth camera parameters are used.
Figure 8: Reconstruction results for online images. (a) indicates input images. (b) and (c) are our reconstruction results in mesh and voxel representation respectively. (d) shows the reconstruction results of Pixel2Vox [31]. Ladybird naturally produces accurate uv-map for texturing.

4 Conclusion

We study the impact of sample set discrepancy on the training efficiency of implicit field regression networks, and proposes to use FPS instead of Random sampling to select training points. We also propose to explore local feature fusion based on reflective symmetry to improve the reconstruction quality. Qualitatively and quantitatively we verify the efficiency of our methods through extensive experiments on large-scale dataset ShapeNet.

5 Acknowledgement

We would like to thank the anonymous reviewers for their helpful feedback and suggestions. We would like to thank Zilei Huang for his help in accelerating the data processing and debugging.

6 Appendix

6.1 Validation accuracy

In Table 6, we report the SDF validation accuracy. The experimental setup is the same as that in Section 3.3, and our validation set consists of 6000 images. We see that Grid+FPS results in faster convergence and higher SDF validation accuracy.

Epoch 1 2 3 5 10 30
Grid+Random 0.743 0.777 0.788 0.803 0.817 0.825
Grid+FPS 0.803 0.859 0.872 0.888 0.905 0.917
Table 6: Validation accuracy of different sampling method.

6.2 Spectrum, more on discrepancy

FPS induces blue-noise behavior by construction. Gaussian Jitter+FPS gives a power spectrum with blue-noise characteristics (Figure 9). However, Jitter+FPS gives higher discrepancy compared to Grid+FPS and worse 3D reconstruction results. Generating good 3D blue noise samples at resolution is computationally very expensive. Hence we excluded blue-noise samplers in this work.

Figure 9: Power spectra of (a) Grid+FPS, (b) Jitter+FPS (), (c) Jitter+FPS (), (d) Jitter+FPS (), (e) Blue noise.

The discrepancy depends on the initial sample size, final sample size, and their ratio. In Table 7, we report the Star Discrepancy (x0.01) of different samplers with varying initial sample size. In the original FPS paper [25], the author gave a deterministic bounds on the distance between sample points (Theorem 4.2), which is used to prove that FPS is a uniform sampler. This analysis shields some lights on why FPS results in low-discrepancy, as it could lead to a deterministic bounds on discrepancy.

Initial sample size Metric Grid+Random Grid+FPS Jitter+FPS Sobol+FPS
Mean 3.06 2.84 5.41 1.75
Std 0.34 0.18 0.16 0.08
Mean 2.98 2.48 6.07 2.51
Std 0.26 0.16 0.77 0.47
Mean 3.07 2.66 6.48 2.62
Std 0.5 0.1 0.31 0.23
Table 7: Mean () and standard deviation () of star discrepancy of different samplers. A+B means we first sample points using sampling method A and then select a subset of size 2048 with method B.

6.3 Marching Cube at higher resolution

Using Ladybird configured as in Section 3.5, we run Marching Cube at different resolutions ( and ). Due to the high memory and computation requirement at increased resolution, we only report CD for 100 objects that are randomly sampled from the ShapeNet test dataset. The results are summarized in Table 8.

Resolution Grid+Random Grid+FPS Sobol+FPS
10.79 9.04 10.20
10.60 8.81 9.76
Table 8: Effect of Marching Cube resolution on the reconstruction results on 100 objects randomly sampled from ShapeNet test set. Metrics are class mean of CD () computed on points.

6.4 Limitations

The reconstruction quality of Ladybird is restricted by the input image resolution (currently 137x137). However, issues such as memory, speed and compatibility with pre-trained image networks need to be considered when increasing the input image resolution. We would like to address the problem of 3D reconstruction from a high resolution image in future work.

Since we need to spatially align the image to the mesh and utilize the corresponding local features, accurate camera pose is crucial to our method (Figure 10). A better camera pose estimation network will lead to significant improvement of our system.

Figure 10: Inaccurate estimation of camera pose leads to failures in reconstruction. (a) indicates the input images. (b) and (d) are the reconstruction results using estimated camera poses in two different views. (c) and (e) are the reconstruction results using ground truth camera poses in two different views.