Single-image 3D reconstruction has a wide variety of applications including AR, robotics, and autonomous driving. For example, image-based 3D estimation can provide an initial guess to grasp targets for robots and to avoid obstacles for autonomous vehicles. Similarly, it can provide cues for inter-frame egomotion estimation as well as for superimposing synthetic objects into the real world.
Image-based 3D reconstruction is, in general, an ill-posed problem. As a 2D image is a projection of the 3D scene, the recovery of the original 3D information from this degenerated observation is generally not unique. For this, past methods have leveraged constraints arising from projective geometry[23, 10, 1, 24], radiometric surface properties[3, 2, 12], and optical imaging properties[8, 18].
Recently, convolutional neural networks have been widely used to devise prior knowledge of the target 3D shape for these 3D shape estimation problems. By using a sufficient number of paired training data of 3D shapes and their 2D appearances, recent works [7, 13, 29, 6, 16] have shown that 3D geometry recovery from a single image can be learned end-to-end.
In this paper, we introduce a novel single-image 3D shape recovery network that outputs a compact, analytical model with a number of advantages over conventional representations, namely point clouds, voxels, and mesh models. The key idea is to model the 3D shape with a Gaussian mixture (Figure 1
). For this, 3D shape recovery is formulated as kernel density estimation,i.e
., 3D shape reconstruction as estimation of a 3D probability distribution that generates the target point cloud. The resulting 3D Gaussian mixture shape representation significantly reduces memory footprint compared to voxel-based occupancy estimation approaches[6, 31] and also provides a straightforward means for defining the surface in contrast to unstructured point cloud representation[7, 13]. Also in sharp contrast to mesh-based shape representations, the Gaussian mixture model enables the network to adaptively refine the shape topology.
To recover the 3D shape from a single image as a 3D Gaussian mixture model, we introduce a novel deep neural network which we refer to as 3D-GMNet. Given a single image as an input and its corresponding 3D shape as the supervision, the proposed network learns to output a set of parameters that define the Gaussian mixture shape model. We propose two novel loss functions to train 3D-GMNet end-to-end. The first is the 3D Gaussian mixture loss, which evaluates the accuracy of the estimated Gaussian mixture model with regards to the target 3D shape. This is achieved by maximizing the likelihood of the Gaussian mixture which in turn is evaluated by considering the target 3D points as samples from the true distribution. The second is the multi-view 2D loss that evaluates the accuracy of the 2D projections of the Gaussian mixture to random viewpoints against the true silhouettes, i.e., the projections of the ground truth 3D shape to the same viewpoints.
The recovered 3D Gaussian mixture shape model has a number of advantages over conventional 3D shape representations. Most important, it is compact as it requires only the 3D mean and covariance for each mixture component. In addition, it admits a number of direct favorable applications, including automatic level-of-detail computation via Gaussian mixture reduction, pose estimation, and distance measurement. We demonstrate these properties, in particular in the context of pose alignment and level-of-detail computation, together with the effectiveness of 3D-GMNet with a number of synthetic and real images. The results show that 3D-GMNet recovers accurate 3D shape from a single image with a representation that opens a new avenue of use in fundamental applications.
2 Related Work
Learning-based 3D shape estimation studies can be categorized by their shape representations: voxel-grid[6, 31], point clouds[7, 13], patches, mesh models[29, 14], primitive sets[19, 27, 21], or learned function [17, 20].
Wu et al.  discretize the target 3D shape into a voxel grid and their neural network estimates the occupancy of each voxel. This is a memory-intensive approach, although it can handle 3D shapes of different topology in a unified manner.
Point cloud approaches represent the target 3D shape as a collection of 3D points on the object surface. Lin et al.  propose a network that estimates multi-view depth-maps from a single image. They fuse the multi-view depth-maps and project it to 2D image planes for jointly optimizing a depth and a silhouette loss function. Groueix et al.  represents the target 3D shape by a collection of 3D patches. Although memory efficient, fusing multiple depth-maps or multiple patches into a single watertight 3D shape remains challenging.
Mesh-based approaches[29, 14] can make use of local connectivity of the 3D shape. Handling different topologies, however, becomes an inherently challenging task with meshes. Primitive-based approaches[19, 27, 21] represent the target 3D shape as a collection of simple objects such as cuboids or superquadrics. They can realize a compact representation of the target volume, but cannot represent smooth and fine structures by definition.
Mescheder et al.  train the network as a nonlinear function representing the occupancy probability of 3D object shape. For a given 3D position, the network infers the probability of being part of the object volume. This approach is highly scalable in resolution. Compared with voxel-grid approaches whose networks are trained for specific voxel resolutions, it can sample 3D points as dense as required. On the other hand, it is a computation-intensive shape representation since the network should infer the probability for each and every sample.
Our Gaussian mixture-based representation has the advantages of these representations. It models not only the surface points but also the interior of the volume, with an efficient parameterization, i.e
., a set of Gaussian parameters, and can generate a watertight 3D surface of arbitrary resolution as its isosurface. This can be seen as a probability density approach with Gaussian distributions, and also as a primitive-based approach with Gaussians as primitives. Additionally, in contrast to cuboid-based approaches, our shape representation can realize 3D registration with a simple algorithm as described in Section4.2.
Neural Networks for Mixture Density Estimation
Mixture density network is a method to predict a target multimodal distribution as a mixture density distribution. Bishop  introduced this network architecture with isotropic Gaussian basis functions. Williams extended it to utilize a general multivariate Gaussian distribution as the basis function. As describe in Section 3.2, our density estimation network is inspired by these works.
When training deep neural networks for 3D shape estimation, most existing studies use combinations of a 3D shape loss, a multi-view 2D loss and an adversarial loss. The 3D shape loss evaluates the difference between the predicted and the target shapes. Choy et al.  used the cross entropy for a voxel-grid representation. Chamfer distance can measure the distance to the target shape from point clouds[7, 13] or mesh models. When using mesh representations, consistencies of surface normals and edge lengths in 3D can also be used.
Multi-view 2D consistency evaluates the difference between the 2D projection of the predicted shape and its ground truth such as silhouettes[33, 14, 13] or depth-maps. Adversarial training is also used to make the predicted shapes plausible. Jiang et al. 
proposed a discriminator that classifies whether the 3D shape is a predicted one or a true one based on semantic features extracted from the 3D shape and the 2D input image. In this paper, we introduce 3D shape and multi-view 2D losses that take full advantage of statistical and geometric properties of 3D Gaussian distributions.
3 3D Gaussian Mixture Network
Figure 2 shows an overview of our 3D-GMNet. Given a 2D image of an object, our 3D-GMNet estimates a set of parameters that defines a Gaussian mixture that best represents the 3D shape of the object in the input image.
In this section, we first introduce the Gaussian-mixture 3D shape representation (Section 3.1), and then the proposed network for estimating its parameters from an image (Section 3.2) using a 3D shape loss (Section 3.3) and a multi-view 2D loss (Section 3.4).
3.1 3D Shape as A Probability Density Function
Our key idea is to consider the target 3D volume as a collection of observations of a random variable with a Gaussian mixture distribution. Suppose a ground-truth 3D shape is given as a collection of 3D point cloud. We assume that this 3D point cloud samples the object volume, which can easily be computed from the 3D models from the training data. As such, we may regard them as voxels, too. Each one of the voxels is a sample of the random variable, and our goal when training the network is to estimate a Gaussian mixture distribution that describes these samples best.
We assume that the density is uniform in the object and normalized. The density function is described as
where is a 3D position and is volume of the object.
3D Gaussian Mixture Distribution
We use a 3D Gaussian mixture to approximate the true distribution . A 3D Gaussian distribution is defined as
where is the mean, is the covariance matrix and
A 3D Gaussian mixture distribution is defined as
where is the number of mixture components and are the mixing coefficients that satisfy
Gaussian mixtures can approximate various kinds of distributions with an appropriate .
Surface Generation from 3D Gaussian Mixture distribution
Once the density function is obtained, we can generate the 3D surface as follows.
Assume that we knew the volume of the object , though it is not available in reality. The object surface is given as the isosurface of the density at
where the parameter decides the level of thresholding. We approximate the unknown by the expectation of the density
since holds if is identical to the true distribution . The parameter is determined experimentally in the evaluations in Section 4.
By thresholding the target space with this value, we obtain the volumetric representation of the 3D shape, which can then be converted to a surface model using the marching cubes algorithm .
3.2 Network Architecture
Our 3D-GMNet outputs a set of the parameters of Gaussian mixture model, . As Figure 2 depicts, it has an encoder module to predict these parameters and an projection module to render multi-view 2D silhouettes.
The numbers in the colored blocks in Figure 2
show the detail of the encoder. The encoder is composed of 5 convolutional layers, 5 max pooling layers of kernel size 2, and 3 fully-connected layers. Each of the convolution layers is followed by a batch normalization layer and a leaky ReLU activation layer. Each fully-connected layers except the last one is followed also by a leaky ReLU layer.
The mean should be a 3D position in Euclidean space , and we use an identity mapping for as
where is the corresponding output of the last layer.
In order to ensure that the coefficient satisfies Eq. (5), the output layer applies the softmax activation as
The precision matrix of a Gaussian component should be a symmetric positive definite matrix, and can be decomposed as using Cholesky decomposition where is a lower triangular matrix. Thus our network predicts instead of where
where is the corresponding output of the last layer. Notice that this enforces to be positive so that the mapping from to is bijective.
3.3 3D Gaussian Mixture Loss
As introduced in the last section, we assume that the target 3D shape is given as voxels. To quantitatively evaluate the fit of the estimated Gaussian mixture to these voxels, we use the Kullback-Leibler (KL) divergence.
In general, KL divergence between the target density and the predicted density is defined as
Since is constant, minimization of is equivalent to minimization of
By considering the target voxels as observations from the true distribution, we can compute this efficiently with Monte Carlo sampling
where is the output of our network, is a sample from the target density and is the number of sampled points. In our training, we randomly sampled a fixed number of 3D voxels from the original target voxels for each mini batch.
We also use
so that Gaussian components are distributed within a distance from the object center. In this paper we use to cover the entire object space.
3.4 Multi-view 2D Loss
We also introduce a novel multi-view 2D loss for a 3D Gaussian mixture that evaluates the consistency of its 2D projection with a given silhouette.
3.4.1 Para-perspective Projection of a Gaussian Mixture
To generate a silhouette of a 3D Gaussian mixture of Eq. (4), we use para-perspective projection for each mixture component since it projects a 3D Gaussian as a 2D Gaussian. As a result, we obtain a 2D Gaussian mixture as a projection of our 3D Gaussian mixture shape representation. Note that perspective projection does not result in a Gaussian due to its nonlinearity.
Since we are only interested in object shape recovery, we can safely assume that the camera pose with respect to the object is defined by rotation about its center. Suppose we project a 3D Gaussian mixture defined in the world coordinate system to a camera of pose . A 3D Gaussian rotated by is given by
where is the rotation matrix transforming from the world coordinate system to the camera local coordinate system.
Para-perspective projection first defines a 3D plane located at the centroid of the object and parallel to the image plane. Since we project each of Gaussian component independently, the centroid is identical to the mean of each Gaussian .
Suppose an oblique coordinate system centered at and whose and axes are identical to the original Euclidean system but its third axis is the direction from the camera center (i.e., the origin of the camera coordinate system) to the centroid. Transforming a 3D point in the camera coordinate system to a 3D point in this oblique system can then be described by
where and . Therefore, the Gaussian of Eq. (16) is transformed to a Gaussian of parameters
and the parallel projection to the plane is given by marginalizing this Gaussian in the direction
where denotes the element of . By applying a 2D affine transform from the plane to the image plane, we obtain the 2D Gaussian on the image plane as
where is the depth of the centroid, i.e., the element of .
As a result, the para-perspective projection of a 3D Gaussian of Eq. (16) is given by
where denotes a 2D Gaussian of the form
This para-perspective projection is differentiable and denoted as the projection module in Figure 2.
3.4.2 Silhouette Loss
Given a random sampling of
points from the 2D probability density functionof Eq. (22), the probability of observing at least a point out of the points at a pixel position is given by
By approximating the silhouette generated from the probability density function by this , we define an L2 loss
as our silhouette loss. In the experiments, we determined using validation data ().
As Figure 2 illustrates, given a 2D image of a 3D object as the input, the 3D-GMNet estimates the parameters of a 3D Gaussian mixture that represents the 3D shape of the target in the camera coordinate system. For this, the target voxels used in the 3D Gaussian mixture loss (Section 3.3) is transformed according to the ground truth camera pose of the input image beforehand. Once the GMM parameters are estimated, the network renders 2D silhouettes from random viewpoints and evaluates the multi-view 2D loss (Section 3.4). This process also requires ground truth camera poses in addition to their ground truth silhouettes. In the experiments, we use .
4 Experimental Results
We evaluate our method for the task of class-specific single-image 3D reconstruction. We first describe data and metrics used for our experiments and then detail quantitative evaluation on synthetic and real images. In addition, we show two applications of our method, a 3D pose alignment of reconstructed 3D shapes and an automatic level-of-computation, both of which demonstrate advantages of a 3D Gaussian mixture shape representation that conventional methods do not have.
To conduct thorough quantitative evaluation, we synthesized multi-view images and 3D point clouds of 3D models in ShapeNet. Each 3D model in ShapeNet has a polygon CAD model and its volume data. For each polygon model, multi-view RGB images are rendered from random 100 viewpoints at a unit distance from the model using a tessellated icosahedron. To normalize the apparent size of the model in the rendered images, the distance is adjusted on a per-model basis as the diagonal size of the object bounding box. The virtual camera is configured as resolution and field-of-view. We train and evaluate our network for 4 object categories, namely Chair, Car, Airplane and Table.
For real images, we evaluate our method using real chair images in Pix3D Dataset. We remove images in which the object is partly in the image or occluded by other objects for simplicity. Occlusion handling will be addressed in future work. We resize and crop images using manually annotated 2D masks.
|3D + MV||0.0842||0.0889||0.482|
|3D + MV||0.0468||0.558||0.317|
Following , we use three metrics for evaluation: intersection of union (IoU), earth mover’s distance (EMD), and chamfer distance (CD). IoU evaluates the coverage of the estimated volume w.r.t. the ground truth volume, using the Gaussian mixture voxelization in Section 3.1. Higher IoU means better reconstruction results. EMD and CD evaluate geodesic and shortest distances between two surfaces via point clouds sampled on them. As described in , we uniformly sampled points on the estimated and the ground truth surface to generate a dense point cloud, and then randomly sampled points from the point cloud. They are scaled to fit a unit cube for normalization for EMD and CD calculation. We used the implementation by Sun et al. .
To further analyze the effect of silhouette loss, we generate silhouette images of the reconstructed 3D shape. For quantitative evaluation, we use , , and . is the pixel-wise squared error of the silhouette images. is the average distance to the ground truth silhouette for each pixel in the generated silhouette. is the one from the ground truth to the generated silhouette.
We use the Adam optimizer with learning rate of . The mini batch size is set to . Training loss is averaged in each mini batch. We use of the 3D models in ShapeNet for training and for validation. The rest of is kept for evaluation.
4.1 Single Image 3D Reconstruction
Figure 3 shows predicted 3D models. Given an input image shown in the first and fourth columns from left, our 3D-GMNet estimates its 3D shape as a 3D Gaussian mixture. The rest of the columns show renderings of the estimated 3D Gaussian mixture shape as a surface model (Section 3.1) from different viewpoints. The renderings from the novel views demonstrate qualitatively that the proposed 3D-GMNet can estimate the full 3D shape from a single image successfully.
Contribution of Silhouette Loss
Table 1 shows quantitative evaluation of multi-view silhouette loss. The results show that the silhouette loss reduces 3D shape reconstruction errors.
Figure 4 shows silhouettes of reconstructed 3D shapes with/without the silhouette loss. Table 2 shows quantitative evaluation of the silhouette. Though the silhouette loss is computed by using pseudo silhouettes, the quality of the actual silhouette is improved. Especially, the completeness() improves drastically.
Number of Gaussian Components
3D Reconstruction from Real Images
4.2 3D Pose Alignment
As described in Section 3.5, our 3D-GMNet generates the shape from the input camera local coordinate system. This suggests that given two images of a single object from different viewpoints, our 3D-GMNet generates a 3D shape described in different coordinate systems, and hence we can estimate the relative pose of the cameras via aligning the estimated 3D shapes.
A challenge here is the view-dependent assignment of Gaussian components on representing a same 3D object shape from different views, and we cannot directly consider component-wise correspondences. This is because our loss functions only require the Gaussian mixture represent the object shape as a whole and the network can change the assignment of each Gaussian component freely for each image.
We solve this by aligning the covariance matrices of Gaussian mixtures from different viewpoints. The covariance matrix of a Gaussian mixture (Eq. (4)) is given by
. We then estimate the rotation via diagonalization with its eigenvectors sorted in descending order according to the magnitude of the corresponding eigenvalues. The correspondence based on the order of eigenvalues has a sign ambiguity on each of the eigenvectors. To find the correct signs, we evaluate the rotation that minimizes thedistance between two Gaussian mixtures:
Figure 8 shows the alignment results. The results show that our method can provide reasonable pose alignments without explicit point cloud generation.
4.3 Level-of-Detail Computation
Figure 9 demonstrates a level-of-detail visualization of Gaussian mixture representation. Given an input image (the leftmost column), our network with infers a Gaussian mixture representation of its 3D shape (the second column). By applying Gaussian mixture reduction, we obtain level-of-detail visualizations as shown in the second and the fourth rows. Compared with the 3D shapes estimated by 3D-GMNet originally trained with the corresponding number of components (the first and the third rows), this automatic computation yields reasonable approximations.
We proposed 3D-GMNet for single image 3D object shape reconstruction with a Gaussian mixture representation. The proposed network utilizes the 3D loss and the multi-view 2D loss to evaluate the fitting of the estimated Gaussian mixture to the 3D ground truth shape and its 2D silhouettes. Experimental results show that our 3D-GMNet successfully estimates the object 3D shape as a compact Gaussian mixture representation, even with a lower number of components, while maintaining reconstruction accuracy comparable to state-of-the-art results.
This work was in part supported by JSPS KAKENHI 17K20143 and JST PREST JPMJPR1858.
-  (2011) Building rome in a day. Communications of the ACM 54 (10), pp. 105–112. Cited by: §1.
-  (2006) What is the range of surface reconstructions from a gradient field?. In Proc. ECCV, pp. 578–591. Cited by: §1.
-  (2008) Photometric stereo with non-parametric and spatially-varying reflectance. In Proc. CVPR, pp. 1–8. Cited by: §1.
-  (1994) Mixture density networks. Technical report Aston University. Cited by: §2, §3.2.
-  (2015) Shapenet: an information-rich 3d model repository. arXiv:1512.03012. Cited by: §1, §4.
-  (2016) 3D-r2n2: a unified approach for single and multi-view 3d object reconstruction. In Proc. ECCV, Cited by: §1, §1, §2, §2, Table 3.
-  (2017-07) A point set generation network for 3d object reconstruction from a single image. In Proc. CVPR, Cited by: §1, §1, §2, §2, Table 3.
-  (2008) Shape from defocus via diffusion. TPAMI 30 (3), pp. 518–531. Cited by: §1.
-  (2018) AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In Proc. CVPR, Cited by: §2, §2, Table 3.
-  (2010) The structure-from-motion reconstruction pipeline - a survey with focus on short image sequences. Kybernetika 46 (5), pp. 926–937 (eng). External Links: Cited by: §1.
Multiple view geometry in computer vision. 2 edition, Cambridge University Press, New York, NY, USA. External Links: Cited by: §3.4.1.
-  (2007) Non-rigid photometric stereo with colored lights. In Proc. ICCV, pp. 1–8. Cited by: §1.
-  (2018) GAL: geometric adversarial loss for single-view 3d-object reconstruction. In Proc. ECCV, Cited by: §1, §1, §2, §2, §2.
-  (2018) Neural 3d mesh renderer. In Proc. CVPR, Cited by: §2, §2, §2.
-  (2003) Efficient implementation of marching cubes’ cases with topological guarantees. Journal of graphics tools 8 (2), pp. 1–15. Cited by: §3.1, Figure 3.
-  (2018) Learning efficient point cloud generation for dense 3d object reconstruction. In Proc. AAAI, Cited by: §1, §2, §2.
-  (2019) Occupancy networks: learning 3d reconstruction in function space. In Proc. CVPR, Cited by: §2, §2.
-  (1994) Shape from focus. TPAMI (8), pp. 824–831. Cited by: §1.
-  (2018) Im2Struct: recovering 3d shape structure from a single rgb image. In Proc. CVPR, Cited by: §2, §2.
-  (2019) DeepSDF: learning continuous signed distance functions for shape representation. In Proc. CVPR, Cited by: §2.
-  (2019) Superquadrics revisited: learning 3d shape parsing beyond cuboids. In Proc. CVPR, Cited by: §2, §2.
-  (2007-07) Kullback-leibler approach to gaussian mixture reduction. IEEE Transactions on Aerospace and Electronic Systems 43 (3), pp. 989–999. Cited by: §4.3.
-  (2002) A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47 (1-3), pp. 7–42. Cited by: §1.
-  (2016) Structure-from-motion revisited. In Proc. CVPR, pp. 4104–4113. Cited by: §1.
-  (2018) Pix3D: dataset and methods for single-image 3d shape modeling. In Proc. CVPR, Cited by: Figure 6, §4, §4, §4.1, Table 3.
-  (2017) Cnn-slam: real-time dense monocular slam with learned depth prediction. In Proc. CVPR, pp. 6243–6252. Cited by: §1.
-  (2017) Learning shape abstractions by assembling volumetric primitives. In Proc. CVPR, Cited by: §2, §2.
-  (2017-07) Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proc. CVPR, Cited by: Table 3.
-  (2018) Pixel2Mesh: generating 3d mesh models from single rgb images. In Proc. ECCV, Cited by: §1, §2, §2, §2.
-  (1996-06) Using neural networks to model conditional multivariate densities. Neural computation 8, pp. 843–54. External Links: Cited by: §2, §3.2.
-  (2017) MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Proc. NIPS, Cited by: §1, §2, §2, Table 3.
-  (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Proc. NIPS, pp. 82–90. External Links: Cited by: Table 3.
-  (2016) Perspective transformer nets: learning single-view 3d object reconstruction without 3d supervision. In Proc. NIPS, pp. 1696–1704. External Links: Cited by: §2.