These two authors contribute equally.
ORCIDs: Yi Yuan0000-0003-2507-8181 Gurprit Singh0000-0003-0970-5835
Due to the under-constrained nature of the problem, 3D object reconstruction from a single-view image has been a challenging task. Large shape and structure variations among objects make it difficult to define one dedicated parameterized model. Methods based on template deformation are often restricted by the initial topology of the template, and are not able to recover holes for instance. Recently, deep learning based implicit fields regression methods have shown great potential in monocular 3D reconstruction. Mescheder et al.  and DISN  create visually pleasing smooth shape reconstruction, with consistent normal and complex topology using implicit fields.
An implicit field is a real-valued function defined on
whose iso-surface recovers the mesh of interest. Common choices of implicit field are signed distance field, truncated signed distance field, or occupancy probability field. A networkis trained to predict the implicit field of point , based on the input image , where
are the parameters which are optimized with stochastic gradient descent (SGD) type algorithms. This is followed by post-processing methods like marching cube and sphere tracing to reconstruct the mesh.
The loss function for the implicit field regression problem is thedistance between the ground truth implicit field and the network
predicted output. During training, a sparse set of 3D points need to be sampled in a compact region containing the mesh to approximate the optimization objective. We formulate this empirical loss as a Monte Carlo estimator.
While most prior discussion on sampling  focuses on designing a probability measure for the integral that puts different weights for regions of different distance to the mesh surface, we look at the problem from a point view of discrepancy of the sample sets. When approximating an integral, different samplers have different error convergence rates with respect to the sample size  
. Low discrepancy sequences/points or blue noise (in 2D) samples give better estimation, for instance, compared to random samples (white noise).
Given a set of locally uniform samples whose distance to the target mesh is bounded by a threshold, we show that farthest point sampling algorithm (FPS) can be used to select a sparse subset with low discrepancy for training . An overview of our method is shown in Figure 1. Our proposed sampling scheme results in better generalization performance as it provides better approximation to the expected loss, thanks to the Koksma-Hlawka inequality . Empirically our sampling scheme also results in faster convergence for SGD-based optimization algorithms, which speeds up the training process significantly as shown in Figure 1(e,f).
explore the use of global shape encoding. While being good at capturing the general shape and obtaining interesting interpolation in the latent space, sometimes it is difficult to recover fine geometric details with only global features. Local features found via aligning image to mesh by modeling the camera are used to address the issue. However, for occluded points, it is ambiguous what local features should be used. Usually all the sampled points are projected to the images, and hence points in the back use features of the points that occlude them.
As most man-made objects are symmetric about a plane, we observe that this problem can be alleviated via the consideration of reflective symmetry. For a symmetric pair of points and , the implicit fields at and are the same, and often at least one of them is visible in the image. Hence we can use the local features of to improve the implicit field predication of , which can also be understood as utilizing two-view information. Our feature fusion method imposes a symmetry prior on the network , which gives significant improvement of the reconstruction quality as shown in Figure 2. Unlike previous works   that focus on the design of loss function, detection or encoding of symmetry, our method naturally integrates into the pixel-to-mesh alignment framework.
The advantage of spatially aligning the image to mesh and utilizing the corresponding local features is that the fine shape details and textures can be better recovered. However, when is occluded, the feature obtained by such alignment no longer has an intuitive meaning. Recently Front2Back  addresses such issues by detecting reflective symmetries from the data and synthesizing the opposite orthographic view. Our approach is simpler and does not depend on symmetry detection.
AtlasNet  represents a mesh as a locally parameterized surface and predicts the local patches from a latent shape representation learned via reconstruction objectives. Mitchell et al.  proposes to represent 3D shapes using higher order functions.
Pixel2Mesh  uses graph CNN to progressively deform an ellipsoid template mesh to fit the target. Features from different layers in the CNN are used to generate different resolution of details. 3DN  infers vertex offsets from a template mesh according to the image object’s category, and proposes differentiable mesh sampling operator to compute the loss function. SDM-NET  uses VAE to generate a spatial arrangement of deformable parts of an object. Pan et al.  proposes a progressive method that alternates between deforming the mesh and modifying the topology. Mesh R-CNN  unifies object detection and shape reconstruction, with a mesh prediction branch that first produces coarse cubified meshes which are refined with a graph convolution network.
1.0.2 Point Cloud and Voxel
Fan et al.  proposes a conditional shape sampler to predict multiple plausible point clouds from an input image. Lin et al.  uses an auto-encoder to synthesize partial point clouds from multiple views, which is combined as a dense point cloud. Then the loss is computed via rendering the depth images from multiple views. Li et al.  uses a CNN to predict multiple depth maps and corresponding deformation fields, which are fused to form the full 3D shape.
1.1 Sampling Methods in Monte Carlo Integration
Realistic image synthesis involves evaluating very high-dimensional light transport integrals. (Quasi-)Monte Carlo (MC) numerical methods are traditionally employed to approximate these integrals which is highly error prone. This error directly depends on the sampling pattern used to estimate the underlying integral . These sampling patterns can be highly correlated. Fourier power spectra are commonly employed to characterize these correlations among samples (Figure 3).
Blue noise samplers  are well-known to show good improvements for low-dimensional integration problems whereas low-discrepancy  samplers like Halton  and Sobol  are more effective for higher dimensional problems. In this work, we use farthest point selection strategy  from any given pointset to select our samples.
2 Our Approach
We first start with a theoretical motivation for our sampling methods. This is followed by the proposed symmetric feature fusion module and our 3D reconstruction pipeline (illustrated in Figure 4).
In Quasi-Monte Carlo integration literature, the equidistribution of a point set is tested by calculating the discrepancy of the set. This approach assigns a single quality number, the discrepancy, to every point set. The lower the discrepancy, the better (uniform) the underlying point set would be. We focus on the star discrepancy of a point set, which computes discrepancy with respect to rectangular axis-aligned sub-regions with one of their corners fixed to the origin. Mathematically, the star discrepancy can be defined as follows:
Let be a set of points in , then the star discrepancy of is
where is the Lebesgue measure on , is the number of points in P that are in , and .
For a given point set or a sequence (stochastic or deterministic), the error due to sampling is directly related to the star discrepancy of the point set . This relation is given by the Koksma-Hlawka inequality  as described below:
Let and is a function on with bounded variation . Then for any ,
The above inequality states that for with bounded variation, a point set with lower discrepancy gives less error when numerically integrating .
The distance between two implicit fields is an integral, and a set of points in needs to be sampled to approximate such integral which appears in the expected loss for deep implicit fields regression. By triangle inequality, using lower discrepancy sampler indicates a better bound on the generalisation error.
Given an input image , we denote a neural network as that predicts an implicit field of point . Let be the ground truth implicit field of the mesh from which is rendered, and let be the training set. To estimate the expected loss, we need to estimate the following:
is a probability density function insupported in a compact region near the mesh .
Instead of studying different choices for and their effects on training, we study the impact of different sampling patterns on the integral estimation.
The error convergence rate of an estimator is greatly influenced by the sampling pattern [22, 24]. Sparse sampling could result in aliasing following the Nyquist-Shannon theorem. A better sampling strategy would allow faster convergence to the true integral resulting in better generalisation performance. Following the Koksma-Hlawka inequality, in order to better approximate the distance between and —which indicates better generalisation of the network on different input points —sample sets of lower discrepancy should be preferred.
In consideration of the time efficiency, usually we pre-compute the implicit field of a dense set of points around the mesh surface, where a sparse subset is chosen uniformly during training. Hence we consider the following problem: given a set of points , how to select a subset consisting of points with low discrepancy. It is natural to consider farthest point sampling algorithm(FPS): initially is selected uniformly at random. Then iteratively,
is added to . In Section 3.3, we show that compared to randomly selecting a sample subset from , sampling using the FPS approach results in lower discrepancy.
2.3 Feature Fusion Based on Symmetry
For a fixed camera model, let be the corresponding projection that maps 3D points to the image plane. Assume that the target mesh is symmetric about plane 111ShapeNet data set is aligned, and most objects are symmetric about plane., and is the rigid transformation such that the input image is formed via the composition . In practice, either is known or is predicted via a camera network from input image.
For a point not too far from , let be the pixel in the image that corresponds to
. A convolution neural network (CNN) is used to extract features from the input image. Let
be the concatenation of feature vectors atin different layers of the CNN.
We can use to guide the regression of the implicit field at . However when is occluded, the pixel value of is not determined by but by with smallest z-buffer value whose projection also lies in the pixel . There is no clear relation between the implicit field at and that at .
For a point , such that , the symmetric point of is where . The implicit field at should equal to that at . Hence it is reasonable to include as part of the local feature of , which we call feature fusion. One straight-forward and effective way to implement feature fusion is to concatenate and .
To show the effectiveness of our proposed system Ladybird, we provide quantitative as well qualitative comparisons to other methods. Our backbone network architecture is based on DISN 
. Our implementation of Ladybird is in Tensorflow 1.9, and the system is tested on Nvidia GTX 1080Ti with Cuda 9.0. In all our experiments, Adam optimizer  is used with and an initial learning rate of 1e-4.
3.1 Data Processing
For dataset, we use ShapeNet Core v1 , and use the official train/test split. There are 13 categories of objects. For each object, 24 views are rendered as in 3D-R2N2 . We randomly select 6000 images from the training set as the validation set, and our training set contains 726,600 images. The data is aligned and most objects (about 80 percent) are symmetric about plane. We normalize the object mesh such that its center of mass is at the origin and the mesh lies in the unit sphere.
To efficiently and accurately compute the SDF values, we use polygon soup algorithm  to compute the SDF on grid points. After that, non-grid point SDF values are obtained through tri-linear interpolation.
For each mesh object, first we sample points using Grid, Jitter, or Sobol sampler 
and compute the corresponding SDF values. In Jitter, each grid point jitters with Gaussian noise of mean 0 and standard deviation 0.02. We then sample a subsetconsisting of 32,768 points from in the following way: from each SDF range , , , and , -th of points are sampled uniformly at random. During training time, a subset consisting of 2048 points are sampled from uniformly at random or through FPS at each epoch. Depending on the sampling pattern used to sample (say ) and (say ), the resulting sampling pattern is denoted by .
At test time, the SDF of grid points are predicated and marching cube is used to extract the iso-surface.
3.2 Network Details
We use a pre-trained camera pose estimation network from DISN, to predict a rigid transformation matrix described in Section 2.3. VGG-16 is used as a CNN module to extract features form the input image. For a given point , is the concatenation of features (Section 2.3) at different layers of VGG-16 at pixel that projects under the known or predicted camera intrinsics. Assuming and being symmetric about a plane, the pixel feature of is one of the following:
Base: which is of dimension 1472.
Symm(Near): or depending on the one having smaller z-buffer value.
Symm(Avg): The average of and .
Symm(Concat): The concatenation of and .
As shown in Figure 4, the image feature is the output of VGG-16 (of dimension 1024). Two stream of point features are processed with two MLPs, each of parameters (64, 256, 512). Each stream is concatenated with pixel feature and image feature respectively, to form a local and a global feature. These global and local features are encoded through two MLPs, each of parameters (512, 256, 1), and the encoded values are added as the predicted SDF at .
3.3 Samplers Impact on Training
To assess the effect of different samplers on training, we set our pixel features to Base (see Section 3.2), use ground truth camera parameters and keep the batch size to 20.
In Table 1, we report the star discrepancy of different samplers in 2D. We first sample points using Grid, Jitter, Sobol sampler in , then selecting 1024, 2048, 4096 points uniformly at random or through FPS. In Jitter, each grid point jitters with Gaussian noise of mean 0 and standard deviation
. We experimentally verify in 2D that Grid+FPS sampling has lower discrepancy and lower variance compared to Grid+Random.
Grid vs. Sobol:
The SDF validation accuracy of Sobol+FPS (0.914) is similar to that of Grid+FPS (0.917), which is higher than Grid+Random (0.825). However, SDF prediction is an intermediate step for the reconstruction task. Marching Cube is used to recover the mesh from the SDF, which requires SDF values at grid points. Due to this grid restriction imposed by Marching Cube, Grid sampling ensures better training/test data consistency. In addition, Grid+FPS and Grid+Random leads to more stable training results (cf. Sobol+FPS) due to lower std. Our work advocates that Grid+FPS is suitable for 3D reconstruction based on deep implicit fields and marching cube.
In Table 2, we report the comparison of reconstruction using different samplers in terms of Chamfer distance (CD) 222For two point set and , CD is defined to be . and Earth Mover’s distance (EMD) . We see that Grid+FPS outperforms Grid+Random, Jitter+FPS, as well as Sobol+FPS. Jitter+FPS performs the worst and its 2D analogue also has the highest star discrepancy. We observe that Grid+FPS reduces noisy phantom blocks around the mesh, and hence reduces the need for post-processing and cleaning. This property is highly desired, because sometimes the cleaning algorithm cannot distinguish between small components and noise. In addition, Grid+FPS encourages faster training convergence as shown in Figure 1.
3.4 Effect of Feature Fusion Based on Symmetry
To analyze the effect of symmetry-based feature fusion, we choose Grid+FPS sampling method. The corresponding batch size for this experiment is kept 16.
|Metric||Local image feature||plane||bench||box||car||chair||display||lamp||speaker||rifle||sofa||table||phone||boat||Mean|
In Table 3 and Figure 5, we compare the effects of different feature fusion operations that are defined in Section 3.2 on the reconstruction result from ShapeNet. Ablation study shows that Symm(Near) and Symm(Concat) improve the reconstruction results. We see that concatenation of features from symmetrical pair performs the best. The reason is that Symm(Concat) better utilizes additional information comparing to Symm(Near) and Symm(Avg). When both and its symmetry point are visible in the image, the pixel features of and are both helpful for recovering the local shape at . We observe that Symm(Concat) is able to produce reconstruction result for non-symmetrical object as shown in Figure 6. It has the interpretation of adding the most promising additional local feature based on a symmetry prior.
3.5 Comparison with Other Methods
In this subsection, the sampling method is Grid+FPS. The pixel feature is Symm(Concat). Camera parameters are estimated using the network mentioned in Section 4.2.
We report comparison with other state-of-the-art methods in terms of CD, EMD, and IoU. From Table 4, we see that Ladybird outperforms other methods. Figure 7 shows qualitative comparison of Ladybird with other methods. We see that Ladybird is able to reconstruct high quality mesh with fine geometric details from a single input image. Note that due to the difference between the train/test split of OccNet  and that of ours, we evaluate OccNet  on the intersection between two test sets.
Since ShapeNet is a synthesized dataset, we further provide quantitative evaluation on Pix3D  (Table 5), and some qualitative examples of in-the-wild images which are randomly selected from the internet (Figure 8). These results show that Ladybird generalizes well to natural images. For the experiment on Pix3D, we fine-tune Ladybird and DISN  (both pre-trained on ShapeNet) on Pix3D train set, and use the ground truth camera poses and the segmentation masks.
We study the impact of sample set discrepancy on the training efficiency of implicit field regression networks, and proposes to use FPS instead of Random sampling to select training points. We also propose to explore local feature fusion based on reflective symmetry to improve the reconstruction quality. Qualitatively and quantitatively we verify the efficiency of our methods through extensive experiments on large-scale dataset ShapeNet.
We would like to thank the anonymous reviewers for their helpful feedback and suggestions. We would like to thank Zilei Huang for his help in accelerating the data processing and debugging.
6.1 Validation accuracy
In Table 6, we report the SDF validation accuracy. The experimental setup is the same as that in Section 3.3, and our validation set consists of 6000 images. We see that Grid+FPS results in faster convergence and higher SDF validation accuracy.
6.2 Spectrum, more on discrepancy
FPS induces blue-noise behavior by construction. Gaussian Jitter+FPS gives a power spectrum with blue-noise characteristics (Figure 9). However, Jitter+FPS gives higher discrepancy compared to Grid+FPS and worse 3D reconstruction results. Generating good 3D blue noise samples at resolution is computationally very expensive. Hence we excluded blue-noise samplers in this work.
The discrepancy depends on the initial sample size, final sample size, and their ratio. In Table 7, we report the Star Discrepancy (x0.01) of different samplers with varying initial sample size. In the original FPS paper , the author gave a deterministic bounds on the distance between sample points (Theorem 4.2), which is used to prove that FPS is a uniform sampler. This analysis shields some lights on why FPS results in low-discrepancy, as it could lead to a deterministic bounds on discrepancy.
|Initial sample size||Metric||Grid+Random||Grid+FPS||Jitter+FPS||Sobol+FPS|
6.3 Marching Cube at higher resolution
Using Ladybird configured as in Section 3.5, we run Marching Cube at different resolutions ( and ). Due to the high memory and computation requirement at increased resolution, we only report CD for 100 objects that are randomly sampled from the ShapeNet test dataset. The results are summarized in Table 8.
The reconstruction quality of Ladybird is restricted by the input image resolution (currently 137x137). However, issues such as memory, speed and compatibility with pre-trained image networks need to be considered when increasing the input image resolution. We would like to address the problem of 3D reconstruction from a high resolution image in future work.
Since we need to spatially align the image to the mesh and utilize the corresponding local features, accurate camera pose is crucial to our method (Figure 10). A better camera pose estimation network will lead to significant improvement of our system.
-  Uni(corn—form) tool kit, https://utk-team.github.io/utk/
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: A system for large-scale machine learning. In: 12thUSENIX Symposium on Operating Systems Design and Implementation (OSDI 16). pp. 265–283 (2016)
-  Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, Z., Tagliasacchi, A., Zhang, H.: Bsp-net: Generating compact meshes via binary space partitioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 45–54 (2020)
-  Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5939–5948 (2019)
-  Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In: European conference on computer vision. pp. 628–644. Springer (2016)
-  Eldar, Y., Lindenbaum, M., Porat, M., Zeevi, Y.Y.: The farthest point strategy for progressive image sampling. IEEE Transactions on Image Processing 6(9), 1305–1315 (1997)
-  Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 605–613 (2017)
-  Gao, L., Yang, J., Wu, T., Yuan, Y.J., Fu, H., Lai, Y.K., Zhang, H.: Sdm-net: Deep generative network for structured deformable mesh. ACM Transactions on Graphics (TOG) 38(6), 1–15 (2019)
-  Gkioxari, G., Malik, J., Johnson, J.: Mesh r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9785–9795 (2019)
-  Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: Atlasnet: A papier-m^ ach’e approach to learning 3d surface generation. arXiv preprint arXiv:1802.05384 (2018)
-  Halton, J.H.: Algorithm 247: Radical-inverse quasi-random point sequence. Communications of the ACM 7(12), 701–702 (1964)
-  Joe, S., Kuo, F.Y.: Constructing sobol sequences with better two-dimensional projections. SIAM Journal on Scientific Computing 30(5), 2635–2654 (2008)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arxiv:1412.6980 (2014)
Kuipers, L., Niederreiter, H.: Uniform distribution of sequences. Courier Corporation (2012)
-  Li, K., Pham, T., Zhan, H., Reid, I.: Efficient dense point cloud object reconstruction using deformation vector fields. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 497–513 (2018)
Lin, C.H., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3d object reconstruction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
-  Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. arXiv preprint arXiv:1911.13225 (2019)
-  Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: A differentiable renderer for image-based 3d reasoning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 7708–7717 (2019)
-  Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4460–4470 (2019)
-  Mitchell, E., Engin, S., Isler, V., Lee, D.D.: Higher-order function networks for learning composable 3d object representations. arXiv preprint arXiv:1907.10388 (2019)
-  Niederreiter, H.: Low-discrepancy and low-dispersion sequences. Journal of number theory 30(1), 51–70 (1988)
-  Pan, J., Han, X., Chen, W., Tang, J., Jia, K.: Deep mesh reconstruction from single rgb images via topology modification networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9964–9973 (2019)
-  Pilleboue, A., Singh, G., Coeurjolly, D., Kazhdan, M., Ostromoukhov, V.: Variance analysis for monte carlo integration. ACM Trans. Graph. (Proc. SIGGRAPH) 34(4), 124:1–124:14 (2015)
-  Schlömer, T., Heck, D., Deussen, O.: Farthest-point optimized point sets with maximized minimum distance. In: Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics. pp. 135–142 (2011)
-  Singh, G., Öztireli, C., Ahmed, A.G., Coeurjolly, D., Subr, K., Deussen, O., Ostromoukhov, V., Ramamoorthi, R., Jarosz, W.: Analysis of sample correlations for Monte Carlo rendering. Computer Graphics Forum 38(2), 473–491 (2019)
-  Sun, X., Wu, J., Zhang, X., Zhang, Z., Zhang, C., Xue, T., Tenenbaum, J.B., Freeman, W.T.: Pix3d: Dataset and methods for single-image 3d shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2974–2983 (2018)
-  Villani, C.: Optimal transport: old and new, vol. 338. Springer Science & Business Media (2008)
-  Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single rgb images. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 52–67 (2018)
-  Wang, W., Ceylan, D., Mech, R., Neumann, U.: 3dn: 3d deformation network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1038–1046 (2019)
-  Xie, H., Yao, H., Sun, X., Zhou, S., Zhang, S.: Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2690–2698 (2019)
-  Xu, H., Barbič, J.: Signed distance fields for polygon soup meshes. In: Proceedings of Graphics Interface 2014, pp. 35–41 (2014)
-  Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. arXiv preprint arXiv:1905.10711 (2019)
-  Yao, Y., Schertler, N., Rosales, E., Rhodin, H., Sigal, L., Sheffer, A.: Front2back: Single view 3d shape reconstruction via front to back prediction. arXiv preprint arXiv:1912.10589 (2019)