1 Introduction
Singleimage 3D reconstruction has a wide variety of applications including AR, robotics, and autonomous driving. For example, imagebased 3D estimation can provide an initial guess to grasp targets for robots and to avoid obstacles for autonomous vehicles. Similarly, it can provide cues for interframe egomotion estimation[26] as well as for superimposing synthetic objects into the real world.
Imagebased 3D reconstruction is, in general, an illposed problem. As a 2D image is a projection of the 3D scene, the recovery of the original 3D information from this degenerated observation is generally not unique. For this, past methods have leveraged constraints arising from projective geometry[23, 10, 1, 24], radiometric surface properties[3, 2, 12], and optical imaging properties[8, 18].
Recently, convolutional neural networks have been widely used to devise prior knowledge of the target 3D shape for these 3D shape estimation problems. By using a sufficient number of paired training data of 3D shapes and their 2D appearances
[5], recent works [7, 13, 29, 6, 16] have shown that 3D geometry recovery from a single image can be learned endtoend.In this paper, we introduce a novel singleimage 3D shape recovery network that outputs a compact, analytical model with a number of advantages over conventional representations, namely point clouds, voxels, and mesh models. The key idea is to model the 3D shape with a Gaussian mixture (Figure 1
). For this, 3D shape recovery is formulated as kernel density estimation,
i.e., 3D shape reconstruction as estimation of a 3D probability distribution that generates the target point cloud. The resulting 3D Gaussian mixture shape representation significantly reduces memory footprint compared to voxelbased occupancy estimation approaches
[6, 31] and also provides a straightforward means for defining the surface in contrast to unstructured point cloud representation[7, 13]. Also in sharp contrast to meshbased shape representations, the Gaussian mixture model enables the network to adaptively refine the shape topology.To recover the 3D shape from a single image as a 3D Gaussian mixture model, we introduce a novel deep neural network which we refer to as 3DGMNet. Given a single image as an input and its corresponding 3D shape as the supervision, the proposed network learns to output a set of parameters that define the Gaussian mixture shape model. We propose two novel loss functions to train 3DGMNet endtoend. The first is the 3D Gaussian mixture loss, which evaluates the accuracy of the estimated Gaussian mixture model with regards to the target 3D shape. This is achieved by maximizing the likelihood of the Gaussian mixture which in turn is evaluated by considering the target 3D points as samples from the true distribution. The second is the multiview 2D loss that evaluates the accuracy of the 2D projections of the Gaussian mixture to random viewpoints against the true silhouettes, i.e., the projections of the ground truth 3D shape to the same viewpoints.
The recovered 3D Gaussian mixture shape model has a number of advantages over conventional 3D shape representations. Most important, it is compact as it requires only the 3D mean and covariance for each mixture component. In addition, it admits a number of direct favorable applications, including automatic levelofdetail computation via Gaussian mixture reduction, pose estimation, and distance measurement. We demonstrate these properties, in particular in the context of pose alignment and levelofdetail computation, together with the effectiveness of 3DGMNet with a number of synthetic and real images. The results show that 3DGMNet recovers accurate 3D shape from a single image with a representation that opens a new avenue of use in fundamental applications.
2 Related Work
Shape Representation
Learningbased 3D shape estimation studies can be categorized by their shape representations: voxelgrid[6, 31], point clouds[7, 13], patches[9], mesh models[29, 14], primitive sets[19, 27, 21], or learned function [17, 20].
Wu et al. [31] discretize the target 3D shape into a voxel grid and their neural network estimates the occupancy of each voxel. This is a memoryintensive approach, although it can handle 3D shapes of different topology in a unified manner.
Point cloud approaches represent the target 3D shape as a collection of 3D points on the object surface. Lin et al. [16] propose a network that estimates multiview depthmaps from a single image. They fuse the multiview depthmaps and project it to 2D image planes for jointly optimizing a depth and a silhouette loss function. Groueix et al. [9] represents the target 3D shape by a collection of 3D patches. Although memory efficient, fusing multiple depthmaps or multiple patches into a single watertight 3D shape remains challenging.
Meshbased approaches[29, 14] can make use of local connectivity of the 3D shape. Handling different topologies, however, becomes an inherently challenging task with meshes. Primitivebased approaches[19, 27, 21] represent the target 3D shape as a collection of simple objects such as cuboids or superquadrics. They can realize a compact representation of the target volume, but cannot represent smooth and fine structures by definition.
Mescheder et al. [17] train the network as a nonlinear function representing the occupancy probability of 3D object shape. For a given 3D position, the network infers the probability of being part of the object volume. This approach is highly scalable in resolution. Compared with voxelgrid approaches whose networks are trained for specific voxel resolutions, it can sample 3D points as dense as required. On the other hand, it is a computationintensive shape representation since the network should infer the probability for each and every sample.
Our Gaussian mixturebased representation has the advantages of these representations. It models not only the surface points but also the interior of the volume, with an efficient parameterization, i.e
., a set of Gaussian parameters, and can generate a watertight 3D surface of arbitrary resolution as its isosurface. This can be seen as a probability density approach with Gaussian distributions, and also as a primitivebased approach with Gaussians as primitives. Additionally, in contrast to cuboidbased approaches, our shape representation can realize 3D registration with a simple algorithm as described in Section
4.2.Neural Networks for Mixture Density Estimation
Mixture density network[4] is a method to predict a target multimodal distribution as a mixture density distribution. Bishop [4] introduced this network architecture with isotropic Gaussian basis functions. Williams extended it to utilize a general multivariate Gaussian distribution as the basis function[30]. As describe in Section 3.2, our density estimation network is inspired by these works.
Loss Functions
When training deep neural networks for 3D shape estimation, most existing studies use combinations of a 3D shape loss, a multiview 2D loss and an adversarial loss. The 3D shape loss evaluates the difference between the predicted and the target shapes. Choy et al. [6] used the cross entropy for a voxelgrid representation. Chamfer distance can measure the distance to the target shape from point clouds[7, 13] or mesh models[29]. When using mesh representations, consistencies of surface normals and edge lengths in 3D can also be used.
Multiview 2D consistency evaluates the difference between the 2D projection of the predicted shape and its ground truth such as silhouettes[33, 14, 13] or depthmaps[16]. Adversarial training is also used to make the predicted shapes plausible[13]. Jiang et al. [13]
proposed a discriminator that classifies whether the 3D shape is a predicted one or a true one based on semantic features extracted from the 3D shape and the 2D input image. In this paper, we introduce 3D shape and multiview 2D losses that take full advantage of statistical and geometric properties of 3D Gaussian distributions.
3 3D Gaussian Mixture Network
Figure 2 shows an overview of our 3DGMNet. Given a 2D image of an object, our 3DGMNet estimates a set of parameters that defines a Gaussian mixture that best represents the 3D shape of the object in the input image.
In this section, we first introduce the Gaussianmixture 3D shape representation (Section 3.1), and then the proposed network for estimating its parameters from an image (Section 3.2) using a 3D shape loss (Section 3.3) and a multiview 2D loss (Section 3.4).
3.1 3D Shape as A Probability Density Function
Our key idea is to consider the target 3D volume as a collection of observations of a random variable with a Gaussian mixture distribution. Suppose a groundtruth 3D shape is given as a collection of 3D point cloud. We assume that this 3D point cloud samples the object volume, which can easily be computed from the 3D models from the training data. As such, we may regard them as voxels, too. Each one of the voxels is a sample of the random variable, and our goal when training the network is to estimate a Gaussian mixture distribution that describes these samples best.
We assume that the density is uniform in the object and normalized. The density function is described as
(1) 
where is a 3D position and is volume of the object.
3D Gaussian Mixture Distribution
We use a 3D Gaussian mixture to approximate the true distribution . A 3D Gaussian distribution is defined as
(2) 
where is the mean, is the covariance matrix and
(3) 
A 3D Gaussian mixture distribution is defined as
(4) 
where is the number of mixture components and are the mixing coefficients that satisfy
(5) 
Gaussian mixtures can approximate various kinds of distributions with an appropriate .
Surface Generation from 3D Gaussian Mixture distribution
Once the density function is obtained, we can generate the 3D surface as follows.
Assume that we knew the volume of the object , though it is not available in reality. The object surface is given as the isosurface of the density at
(6) 
where the parameter decides the level of thresholding. We approximate the unknown by the expectation of the density
(7) 
since holds if is identical to the true distribution . The parameter is determined experimentally in the evaluations in Section 4.
By substituting Eqs. (2), (3), and (4) for Eq. (7), we obtain the following closedform solution of the expectation
(8) 
By thresholding the target space with this value, we obtain the volumetric representation of the 3D shape, which can then be converted to a surface model using the marching cubes algorithm [15].
3.2 Network Architecture
Our 3DGMNet outputs a set of the parameters of Gaussian mixture model, . As Figure 2 depicts, it has an encoder module to predict these parameters and an projection module to render multiview 2D silhouettes.
The numbers in the colored blocks in Figure 2
show the detail of the encoder. The encoder is composed of 5 convolutional layers, 5 max pooling layers of kernel size 2, and 3 fullyconnected layers. Each of the convolution layers is followed by a batch normalization layer and a leaky ReLU activation layer. Each fullyconnected layers except the last one is followed also by a leaky ReLU layer.
After these layers, 3DGMNet introduces an output layer tailored for Gaussian mixture parameters to enforce constraints such as Eq. (5)[4, 30].
The mean should be a 3D position in Euclidean space , and we use an identity mapping for as
(9) 
where is the corresponding output of the last layer.
In order to ensure that the coefficient satisfies Eq. (5), the output layer applies the softmax activation as
(10) 
The precision matrix of a Gaussian component should be a symmetric positive definite matrix, and can be decomposed as using Cholesky decomposition where is a lower triangular matrix. Thus our network predicts instead of where
(11) 
where is the corresponding output of the last layer. Notice that this enforces to be positive so that the mapping from to is bijective.
3.3 3D Gaussian Mixture Loss
As introduced in the last section, we assume that the target 3D shape is given as voxels. To quantitatively evaluate the fit of the estimated Gaussian mixture to these voxels, we use the KullbackLeibler (KL) divergence.
In general, KL divergence between the target density and the predicted density is defined as
(12) 
Since is constant, minimization of is equivalent to minimization of
(13) 
By considering the target voxels as observations from the true distribution, we can compute this efficiently with Monte Carlo sampling
(14) 
where is the output of our network, is a sample from the target density and is the number of sampled points. In our training, we randomly sampled a fixed number of 3D voxels from the original target voxels for each mini batch.
We also use
(15) 
so that Gaussian components are distributed within a distance from the object center. In this paper we use to cover the entire object space.
3.4 Multiview 2D Loss
We also introduce a novel multiview 2D loss for a 3D Gaussian mixture that evaluates the consistency of its 2D projection with a given silhouette.
3.4.1 Paraperspective Projection of a Gaussian Mixture
To generate a silhouette of a 3D Gaussian mixture of Eq. (4), we use paraperspective projection[11] for each mixture component since it projects a 3D Gaussian as a 2D Gaussian. As a result, we obtain a 2D Gaussian mixture as a projection of our 3D Gaussian mixture shape representation. Note that perspective projection does not result in a Gaussian due to its nonlinearity.
Since we are only interested in object shape recovery, we can safely assume that the camera pose with respect to the object is defined by rotation about its center. Suppose we project a 3D Gaussian mixture defined in the world coordinate system to a camera of pose . A 3D Gaussian rotated by is given by
(16) 
where is the rotation matrix transforming from the world coordinate system to the camera local coordinate system.
Paraperspective projection first defines a 3D plane located at the centroid of the object and parallel to the image plane. Since we project each of Gaussian component independently, the centroid is identical to the mean of each Gaussian .
Suppose an oblique coordinate system centered at and whose and axes are identical to the original Euclidean system but its third axis is the direction from the camera center (i.e., the origin of the camera coordinate system) to the centroid. Transforming a 3D point in the camera coordinate system to a 3D point in this oblique system can then be described by
(17)  
(18) 
where and . Therefore, the Gaussian of Eq. (16) is transformed to a Gaussian of parameters
(19) 
and the parallel projection to the plane is given by marginalizing this Gaussian in the direction
(20) 
where denotes the element of . By applying a 2D affine transform from the plane to the image plane, we obtain the 2D Gaussian on the image plane as
(21) 
where is the depth of the centroid, i.e., the element of .
As a result, the paraperspective projection of a 3D Gaussian of Eq. (16) is given by
(22) 
where denotes a 2D Gaussian of the form
(23) 
This paraperspective projection is differentiable and denoted as the projection module in Figure 2.
3.4.2 Silhouette Loss
In order to evaluate the consistency between the projected 2D Gaussian mixture of Eq. (22) and the ground truth silhouette , we generate a pseudo soft silhouette from Eq. (22).
Given a random sampling of
points from the 2D probability density function
of Eq. (22), the probability of observing at least a point out of the points at a pixel position is given by(24) 
By approximating the silhouette generated from the probability density function by this , we define an L2 loss
(25) 
as our silhouette loss. In the experiments, we determined using validation data ().
3.5 Training
As Figure 2 illustrates, given a 2D image of a 3D object as the input, the 3DGMNet estimates the parameters of a 3D Gaussian mixture that represents the 3D shape of the target in the camera coordinate system. For this, the target voxels used in the 3D Gaussian mixture loss (Section 3.3) is transformed according to the ground truth camera pose of the input image beforehand. Once the GMM parameters are estimated, the network renders 2D silhouettes from random viewpoints and evaluates the multiview 2D loss (Section 3.4). This process also requires ground truth camera poses in addition to their ground truth silhouettes. In the experiments, we use .
4 Experimental Results
We evaluate our method for the task of classspecific singleimage 3D reconstruction. We first describe data and metrics used for our experiments and then detail quantitative evaluation on synthetic and real images. In addition, we show two applications of our method, a 3D pose alignment of reconstructed 3D shapes and an automatic levelofcomputation, both of which demonstrate advantages of a 3D Gaussian mixture shape representation that conventional methods do not have.
Data
To conduct thorough quantitative evaluation, we synthesized multiview images and 3D point clouds of 3D models in ShapeNet[5]. Each 3D model in ShapeNet has a polygon CAD model and its volume data. For each polygon model, multiview RGB images are rendered from random 100 viewpoints at a unit distance from the model using a tessellated icosahedron. To normalize the apparent size of the model in the rendered images, the distance is adjusted on a permodel basis as the diagonal size of the object bounding box. The virtual camera is configured as resolution and fieldofview. We train and evaluate our network for 4 object categories, namely Chair, Car, Airplane and Table.
For real images, we evaluate our method using real chair images in Pix3D Dataset[25]. We remove images in which the object is partly in the image or occluded by other objects for simplicity. Occlusion handling will be addressed in future work. We resize and crop images using manually annotated 2D masks.
CD  EMD  IoU  

3D  0.0866  0.0923  0.466 
3D + MV  0.0842  0.0889  0.482 
MSE  

3D  0.0487  0.574  0.419 
3D + MV  0.0468  0.558  0.317 
Evaluation Metrics
Following [25], we use three metrics for evaluation: intersection of union (IoU), earth mover’s distance (EMD), and chamfer distance (CD). IoU evaluates the coverage of the estimated volume w.r.t. the ground truth volume, using the Gaussian mixture voxelization in Section 3.1. Higher IoU means better reconstruction results. EMD and CD evaluate geodesic and shortest distances between two surfaces via point clouds sampled on them. As described in [25], we uniformly sampled points on the estimated and the ground truth surface to generate a dense point cloud, and then randomly sampled points from the point cloud. They are scaled to fit a unit cube for normalization for EMD and CD calculation. We used the implementation by Sun et al. [25].
To further analyze the effect of silhouette loss, we generate silhouette images of the reconstructed 3D shape. For quantitative evaluation, we use , , and . is the pixelwise squared error of the silhouette images. is the average distance to the ground truth silhouette for each pixel in the generated silhouette. is the one from the ground truth to the generated silhouette.
Training Parameters
We use the Adam optimizer with learning rate of . The mini batch size is set to . Training loss is averaged in each mini batch. We use of the 3D models in ShapeNet for training and for validation. The rest of is kept for evaluation.
4.1 Single Image 3D Reconstruction
Figure 3 shows predicted 3D models. Given an input image shown in the first and fourth columns from left, our 3DGMNet estimates its 3D shape as a 3D Gaussian mixture. The rest of the columns show renderings of the estimated 3D Gaussian mixture shape as a surface model (Section 3.1) from different viewpoints. The renderings from the novel views demonstrate qualitatively that the proposed 3DGMNet can estimate the full 3D shape from a single image successfully.
Contribution of Silhouette Loss
Table 1 shows quantitative evaluation of multiview silhouette loss. The results show that the silhouette loss reduces 3D shape reconstruction errors.
Figure 4 shows silhouettes of reconstructed 3D shapes with/without the silhouette loss. Table 2 shows quantitative evaluation of the silhouette. Though the silhouette loss is computed by using pseudo silhouettes, the quality of the actual silhouette is improved. Especially, the completeness() improves drastically.
Number of Gaussian Components
3D Reconstruction from Real Images
4.2 3D Pose Alignment
As described in Section 3.5, our 3DGMNet generates the shape from the input camera local coordinate system. This suggests that given two images of a single object from different viewpoints, our 3DGMNet generates a 3D shape described in different coordinate systems, and hence we can estimate the relative pose of the cameras via aligning the estimated 3D shapes.
A challenge here is the viewdependent assignment of Gaussian components on representing a same 3D object shape from different views, and we cannot directly consider componentwise correspondences. This is because our loss functions only require the Gaussian mixture represent the object shape as a whole and the network can change the assignment of each Gaussian component freely for each image.
We solve this by aligning the covariance matrices of Gaussian mixtures from different viewpoints. The covariance matrix of a Gaussian mixture (Eq. (4)) is given by
where
. We then estimate the rotation via diagonalization with its eigenvectors sorted in descending order according to the magnitude of the corresponding eigenvalues. The correspondence based on the order of eigenvalues has a sign ambiguity on each of the eigenvectors. To find the correct signs, we evaluate the rotation that minimizes the
distance between two Gaussian mixtures:(26) 
where
(27) 
Figure 8 shows the alignment results. The results show that our method can provide reasonable pose alignments without explicit point cloud generation.
IoU  EMD  CD  
3DR2N2[6]  0.136  0.211  0.239 
PSGN[7]  N/A  0.216  0.200 
3DVAEGAN[32]  0.171  0.176  0.182 
DRC[28]  0.265  0.144  0.160 
MarrNet[31]  0.231  0.136  0.144 
AtlasNet[9]  N/A  0.128  0.125 
Pix3D[25]  0.287  0.120  0.119 
Ours  0.250  0.131  0.134 
Ours*  0.259  0.129  0.130 
4.3 LevelofDetail Computation
Figure 9 demonstrates a levelofdetail visualization of Gaussian mixture representation. Given an input image (the leftmost column), our network with infers a Gaussian mixture representation of its 3D shape (the second column). By applying Gaussian mixture reduction[22], we obtain levelofdetail visualizations as shown in the second and the fourth rows. Compared with the 3D shapes estimated by 3DGMNet originally trained with the corresponding number of components (the first and the third rows), this automatic computation yields reasonable approximations.
5 Conclusion
We proposed 3DGMNet for single image 3D object shape reconstruction with a Gaussian mixture representation. The proposed network utilizes the 3D loss and the multiview 2D loss to evaluate the fitting of the estimated Gaussian mixture to the 3D ground truth shape and its 2D silhouettes. Experimental results show that our 3DGMNet successfully estimates the object 3D shape as a compact Gaussian mixture representation, even with a lower number of components, while maintaining reconstruction accuracy comparable to stateoftheart results.
Acknowledgement
This work was in part supported by JSPS KAKENHI 17K20143 and JST PREST JPMJPR1858.
References
 [1] (2011) Building rome in a day. Communications of the ACM 54 (10), pp. 105–112. Cited by: §1.
 [2] (2006) What is the range of surface reconstructions from a gradient field?. In Proc. ECCV, pp. 578–591. Cited by: §1.
 [3] (2008) Photometric stereo with nonparametric and spatiallyvarying reflectance. In Proc. CVPR, pp. 1–8. Cited by: §1.
 [4] (1994) Mixture density networks. Technical report Aston University. Cited by: §2, §3.2.
 [5] (2015) Shapenet: an informationrich 3d model repository. arXiv:1512.03012. Cited by: §1, §4.
 [6] (2016) 3Dr2n2: a unified approach for single and multiview 3d object reconstruction. In Proc. ECCV, Cited by: §1, §1, §2, §2, Table 3.
 [7] (201707) A point set generation network for 3d object reconstruction from a single image. In Proc. CVPR, Cited by: §1, §1, §2, §2, Table 3.
 [8] (2008) Shape from defocus via diffusion. TPAMI 30 (3), pp. 518–531. Cited by: §1.
 [9] (2018) AtlasNet: A PapierMâché Approach to Learning 3D Surface Generation. In Proc. CVPR, Cited by: §2, §2, Table 3.
 [10] (2010) The structurefrommotion reconstruction pipeline  a survey with focus on short image sequences. Kybernetika 46 (5), pp. 926–937 (eng). External Links: Link Cited by: §1.

[11]
(2003)
Multiple view geometry in computer vision
. 2 edition, Cambridge University Press, New York, NY, USA. External Links: ISBN 0521540518 Cited by: §3.4.1.  [12] (2007) Nonrigid photometric stereo with colored lights. In Proc. ICCV, pp. 1–8. Cited by: §1.
 [13] (2018) GAL: geometric adversarial loss for singleview 3dobject reconstruction. In Proc. ECCV, Cited by: §1, §1, §2, §2, §2.
 [14] (2018) Neural 3d mesh renderer. In Proc. CVPR, Cited by: §2, §2, §2.
 [15] (2003) Efficient implementation of marching cubes’ cases with topological guarantees. Journal of graphics tools 8 (2), pp. 1–15. Cited by: §3.1, Figure 3.
 [16] (2018) Learning efficient point cloud generation for dense 3d object reconstruction. In Proc. AAAI, Cited by: §1, §2, §2.
 [17] (2019) Occupancy networks: learning 3d reconstruction in function space. In Proc. CVPR, Cited by: §2, §2.
 [18] (1994) Shape from focus. TPAMI (8), pp. 824–831. Cited by: §1.
 [19] (2018) Im2Struct: recovering 3d shape structure from a single rgb image. In Proc. CVPR, Cited by: §2, §2.
 [20] (2019) DeepSDF: learning continuous signed distance functions for shape representation. In Proc. CVPR, Cited by: §2.
 [21] (2019) Superquadrics revisited: learning 3d shape parsing beyond cuboids. In Proc. CVPR, Cited by: §2, §2.
 [22] (200707) Kullbackleibler approach to gaussian mixture reduction. IEEE Transactions on Aerospace and Electronic Systems 43 (3), pp. 989–999. Cited by: §4.3.
 [23] (2002) A taxonomy and evaluation of dense twoframe stereo correspondence algorithms. IJCV 47 (13), pp. 7–42. Cited by: §1.
 [24] (2016) Structurefrommotion revisited. In Proc. CVPR, pp. 4104–4113. Cited by: §1.
 [25] (2018) Pix3D: dataset and methods for singleimage 3d shape modeling. In Proc. CVPR, Cited by: Figure 6, §4, §4, §4.1, Table 3.
 [26] (2017) Cnnslam: realtime dense monocular slam with learned depth prediction. In Proc. CVPR, pp. 6243–6252. Cited by: §1.
 [27] (2017) Learning shape abstractions by assembling volumetric primitives. In Proc. CVPR, Cited by: §2, §2.
 [28] (201707) Multiview supervision for singleview reconstruction via differentiable ray consistency. In Proc. CVPR, Cited by: Table 3.
 [29] (2018) Pixel2Mesh: generating 3d mesh models from single rgb images. In Proc. ECCV, Cited by: §1, §2, §2, §2.
 [30] (199606) Using neural networks to model conditional multivariate densities. Neural computation 8, pp. 843–54. External Links: Document Cited by: §2, §3.2.
 [31] (2017) MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Proc. NIPS, Cited by: §1, §2, §2, Table 3.
 [32] (2016) Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Proc. NIPS, pp. 82–90. External Links: Link Cited by: Table 3.
 [33] (2016) Perspective transformer nets: learning singleview 3d object reconstruction without 3d supervision. In Proc. NIPS, pp. 1696–1704. External Links: Link Cited by: §2.