3D Shape Reconstruction from a Single 2D Image via 2D-3D Self-Consistency

11/29/2018 ∙ by Yi-Lun Liao, et al. ∙ National Taiwan University 2

Aiming at inferring 3D shapes from 2D images, 3D shape reconstruction has drawn huge attention from researchers in computer vision and deep learning communities. However, it is not practical to assume that 2D input images and their associated ground truth 3D shapes are always available during training. In this paper, we propose a framework for semi-supervised 3D reconstruction. This is realized by our introduced 2D-3D self-consistency, which aligns the predicted 3D models and the projected 2D foreground segmentation masks. Moreover, our model not only enables recovering 3D shapes with the corresponding 2D masks, camera pose information can be jointly disentangled and predicted, even such supervision is never available during training. In the experiments, we qualitatively and quantitatively demonstrate the effectiveness of our model, which performs favorably against state-of-the-art approaches in either supervised or semi-supervised settings.



There are no comments yet.


page 3

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D modeling and reconstruction can be applied to a variety of real-world applications, including visual rendering, modeling, and robotics. While it might not be difficult for human to infer 3D information from the observed 2D visual data, it is a very challenging task for machines to do so. Without sufficient 3D shapes, viewpoint information or 2D images from different viewpoints for as training data, it is difficult to reconstruct 3D models using 2D data.

Over the past few years, Convolution Neural Network (CNN) and generative adversarial networks (GAN) 

[9] have shown impressive progresses and results particularly in the areas of computer vision and image processing. For 3D shape reconstruction, several solutions have been proposed [8, 4, 5, 17, 1, 26, 13, 29, 7, 11, 21, 22, 28, 22, 14, 24, 19, 30, 6, 27]. As discussed later in Section 2, different settings and limitations such as the required number of 2D input images, availability of the viewpoint information, and supervision of ground truth labels would limit the use of existing models for 3D reconstruction applications.

In this paper, we address a challenging task of 3D model reconstruction from a single 2D image. That is, during training and testing, we only allow our model to observe a single 2D image as the input, without knowing its viewpoint information (e.g., azimuth, elevation, and roll). If full label supervision is available, we allow both 3D ground truth shape (in voxels) and the associated 2D foreground segmentation mask to guide the learning of our proposed model. As for unlabeled data during the semi-supervised learning process, we only observe 2D images for training our model.

To handle the aforementioned challenging setting, we propose a deep learning architecture which not only recovers 3D shape and 2D mask outputs with full supervision. We further exploit their 2D-3D self-consistency during the learning process, which is the reason why we are able to utilize unlabeled image inputs to realize semi-supervised learning. Finally, as a special characteristics of our model, we are able to jointly disentangle camera pose information during the above process, even no ground truth pose information is observed any time during training.

The contributions of this work are highlighted below:

  • In this paper, a unique semi-supervised deep learning framework is proposed for 3D shape prediction from a single 2D image.

  • The presented network is trained in an end-to-end fashion, while 2D-3D self-consistency is particularly introduced to handle unlabeled training image data.

  • In addition to 3D shape prediction, our model is able to disentangle camera pose information from the derived latent feature space, while supervision of such information is never available during training.

  • Experimental results quantitatively and qualitatively show that our method performs favorably against state-of-the-art fully supervised and weakly-supervised methods.

2 Related Work

2.1 Learning for 3D Model Prediction

Most existing methods for 3D shape reconstruction require supervised settings, i.e., both 2D input images and their corresponding 3D models need to be observed during the training stage. With the development of large-scale shape repository like ShapeNet [2], several methods of this category have been proposed [8, 4]. For example, the work of Girdhar et al. [8] is fully trained with pairwise 2D images and 3D models, which is realized by learning joint embedding for both 3D shapes and 2D images. On the other hand, Wang et al. [27] choose to predict 3D shape, 2D mask, and pose simultaneously, followed by construction of probabilistic visual hulls. However, they require additional ground truth pose information to train their model.

Since it might not be practical to assume that the ground truth 3D object information is always available, recent approaches like [29, 11, 28, 22, 24] manage to learn object representations in weakly-supervised settings. For example, Yan et al. [29] present Perspective Transformer Nets. Guided by the projected 2D masks and given camera viewpoints, their proposed network architecture learns the perspective transformations of the target 3D object. As a result, the 3D objects can be recovered using 2D images without the supervision of ground truth shape information. Alternatively, Gwak et al. [11] take 2D images and viewpoint information as input, and utilize 2D masks as weak supervision information, which is enabled by perspective projection of reconstructed 3D shapes to foreground masks. They constrain the reconstructed 3D shapes to a manifold observed by unlabeled shapes of real-world objects. Soltani et al. [22] learn a generative model over depth maps or their corresponding silhouettes, and then use a deterministic rendering function to construct 3D shapes from the generated depth maps and silhouettes. Although their network is able to generate images at different viewpoints from a single input when testing, it requires both silhouettes and depth maps from different views as ground truth for training purposes. To reconstruct 3D shapes, Tulsiani et al. [24] choose to utilize the consistency between shape and viewpoint information independently predicted from two camera views of the same instance. However, they also require multiple images of the same instance taken from different camera viewpoints for training purposes.

Different from the works requiring images taken by different cameras for 3D reconstruction, our method only needs a single image for 3D shape reconstruction, and can be trained in a semi-supervised setting. Moreover, as noted and discussed in the following subsection, our model is able to disentangle the camera pose without the associated ground truth information.

2.2 Learning Interpretable Representation

Learning interpretable representation, or representation disentanglement, has attracted the attention fro researchers in the fields of computer vision and machine learning. Although Kingma

et al. [15]

utilize variational autoencoders to handle the generality of objects with input noise for improved data distribution. However, there is no guarantee that particular attributes in the derived latent space would correspond to desirable features. When it comes to recover 3D information from input 2D images, it is often desirable to be able to separate external parameters like viewpoint during the learning of visual object representation, so that the output data can be properly recovered or manipulated.

For representation disentanglement in 3D shape reconstruction, some existing approaches such as Adversarial Autoencoders (AAE) [18] manage to derive disentangled representations by supervised learning, matching the derived representations with specific labels and adversarial losses calculated from the discriminators. Weakly-supervised learning methods have also been proposed, which alleviate the need of utilizing fully labeled training data in the above process. For example, Deep Convolutional Inverse Graphics Network (DC-IGN) [16]

clamps certain dimensions of representation vectors from a mini-batch of training instances for retrieving factors such as azimuth angle, elevation angle, azimuth of light source, or intrinsic properties. To perform representation disentanglement in a unsupervised setting, Information Maximizing GAN (InfoGAN) 

[3] chooses to maximize mutual information between latent variables and observation to learn disentangled representations.

Recently, Grant et al. [10] propose Deep Disentangled Representations for Volumetric Reconstruction, which takes 2D images as inputs and produces separate representations for 3D shapes of objects and parameters of viewpoint and lighting. Since their approach requires full data supervision for deriving the associated shape and transformation information (i.e., ground truth volumetric shapes always available during training), it would not be easy to extend their work to practical scenarios where only a portion of 2D image data are with ground truth 3D information. Sharing similar goals, we aim at identifying shape and viewpoint information in a semi-supervised setting, while no ground truth camera pose information is observed during training.

Figure 1: Overview of the proposed network architecture, which contains encoder , voxel and mask decoders and , respectively. An additional module enforcing 2D-3D self-consistency is introduced, allowing camera pose disentanglement without supervision of its ground truth information.

3 Pose-Aware 3D Shape Reconstruction via Representation Disentanglement

In this paper, we propose a unique deep learning framework which observes a single 2D image for 3D shape reconstruction. Our model not only can be extended from fully supervised to semi-supervised settings (i.e., only a portion of 2D images are with ground truth 3D information), it also exhibits abilities in disentangling camera pose information from the derived representation. This disentanglement process is performed in a unsupervised fashion, and is achieved by observing 2D-3D self-consistency as we detail later.

In the remaining of this section, we first describe the notation and architecture in Section 3.1. How to advance representation disentanglement with 2D-3D self-consistency for 3D reconstruction is detailed in Section 3.2. Finally, Section 3.3 summarizes learning process of our model.

3.1 Notations and Architecture

To reconstruct 3D shape information, we describe shapes in volumetric forms of probabilistic occupancy in this paper. For the sake of completeness, we now define the notations which will be used.

Let denote the training data, where indicates the input 2D image, and are the associated ground truth 3D voxel and 2D mask. Thus, our goal is to predict the 3D voxel (and the 2D mask ), while the camera pose information will be jointly predicted in the latent space . It is worth noting that, since we focus on a semi-supervised setting, the data used for semi-supervised learning are and , while the size of is considered to be larger than that of .

For the fully supervised version of our framework, every training input image has the corresponding ground truth 3D voxel and ground truth 2D mask . And, for the semi-supervised version, only a portion of training images are with their 3D ground truth and 2D masks. Nevertheless, supervised or semi-supervised learning, we never observe ground truth camera pose information during training, while jointly reconstructing 3D shape and camera pose information are our goals.

As shown in Figure 1, our proposed network architecture consists of four components:

Image Encoder . The encoder has a residual structure [12], which maps the input RGB image into an intrinsic shape representation (i.e., shape code) and an extrinsic viewpoint representation (i.e., pose code) . This disentangled pose code is then converted into camera pose through a FC layer. We note that, translation in the 3D space does not affect the quality of shape reconstruction, and we consider only elevation and azimuth for describing the camera pose.

Voxel Decoder . The goal of this decode is to recover the 3D voxels based on the input shape code . We follow [19] for apply 2D deconvolution layers as our voxel tube decoder. However, we choose to apply three separate 2D deconvolution operations, with the channel dimensions of the three deconvolution layers corresponding to height, width, and depth, respectively.

Mask Decoder . Different from the voxel decoder , the purpose of the mask decoder is to output the 2D mask by observing both the input pose code and shape code . We utilize a U-Net [20] based structure for this 2D foreground mask segmentation procedure.

Figure 2: Illustration of 2D-3D self-consistency. When the mask value (,) is 1, the ray passing through (,) would stop at the associated voxel and cannot reach the 2D projection. In contrast, if (,) is 0, the ray passing through (,) penetrates the corresponding voxels, and the projected output can observed.

Module for 2D-3D Self-Consistency. As a unique design in our network, this module observes the output 3D model and the 2D mask. More precisely, by taking the disentangled camera pose information from the pose code , this module enforces the consistency between the predicted 3D voxel and the corresponding 2D mask via the projection loss. As discussed in the following subsection, this is how we achieve disentanglement of extrinsic camera pose from intrinsic shape representation without observing any ground truth information. In other words, the introduction of this module realizes our pose-aware 3D shape reconstruction.

3.2 Feature Disentanglement via 2D-3D Self-Consistency

Our proposed model is capable of deep feature disentanglement and pose-aware 3D reconstruction. It is worth repeating that, while our model can be train in fully supervised or semi-supervised settings (i.e., observing ground truth voxels and 2D masks), ground truth camera pose information is never required.

As noted in Sect. 3.1, our image encoder maps input images into an intrinsic shape representation (or shape code) and an extrinsic viewpoint information (or pose code) . The former is fed into to the voxel decoder to recover pose-invariant 3D voxels, while both and are the inputs to the mask decoder for 2D foreground mask segmentation. In order to convert the extracted pose code into an exact camera pose value (such as rotation and elevation), we transform pose code through one FC layer into camera pose and impose a 2D-3D self-consistency loss, aiming to align the predicted 3D voxels and the 2D mask with the disentangled pose code. We note that, as detailed below, the enforcement of 2D-3D self-consistency is the critical component for camera pose disentanglement, pose-aware 3D reconstruction. It is also the key to enable semi-supervised learning for our proposed network.

2D-3D Self-Consistency Differentiable ray consistency loss has been utilized in [24, 25], which evaluates the inconsistency between mask and shape viewed from predicted camera pose. To achieve camera pose disentanglement, we advance the predicted camera pose information and the corresponding voxel outputs to generate a projection 2D mask, and then calculate the difference between the projection 2D mask and predicted ones from . This allows the alignment between the the 2D projection of 3D voxels using the disentangled camera pose.

To realize the above disentanglment process via camera pose alignment, we particularly consider a unique 2D-3D self-consistency loss which integrates the projection loss and ray consistency loss as follows L_sc = α_1L_ray + α_2L_proj.

For ray consistency loss, we consider a ray passing through a mask at location (,) to be projected, and traveling along the voxel (as illustrated in Fig. 2). We sample

values along this ray, and the sampled value represents the occupancy at this sample point. Next, for each sample point, the probability of the ray stops at that point is calculated. If the mask value at (

,) is 0, the probability that the ray penetrates across sample points is close to 1. On the other hand, if the mask value at (,) is 1, this probability would be close to 0.

Now we introduce the details of ray consistency loss. We consider the ray passing through a mask at location (, ). Given camera intrinsic parameters (, , , ), where (, ) is focal length of camera and (, ) is optical center of camera, we can determine the direction of this ray as (, , ). We sample points along this ray, and the location of sample point in the camera coordinate frame is ( (), ( ), ), where .

To calculate the probability of occupancy at this sample point, given camera rotation matrix (parameterized by camera pose ) and camera translation , we first map the location of the sample point into (). Then we determine the occupancy of the sample point by trilinear sampling , which is shown as below:


Next, the probability of the ray passing through the pixel (,) stops at sample point can be obtained, which is shown as below:


If the ray does not stop at any sample point, then it will penetrate the whole voxel. As a result, we can extend (2) to obtained the probability that this ray that escapes the voxel, and we denote this probability as ,


With the above observation, we have the ray consistency loss at pixel (, ) defined below: L_ray(u, v) = m(u, v)q_(u, v), escape^p
+ (1-m(u, v))∑_i=1^Nq_u, v^p(i), where is the value of the mask at location (,). If is 0, the probability that the ray penetrates the voxel is close to 1. On the other hand, when is 1, the probability that the ray terminates at any sample point is near 0. As a result, we have the differentiable ray consistency loss () calculated as the mean of () over all pixels in .

In addition to ray consistency, we also consider the projection loss for 2D-3D self-consistency. As shown in (3.2), () represents the pixel (, ) of 3D-2D projection (, ). As a result, we obtain the projection (, ) after calculation every ray passing through mask , as shown in Fig 2. That is, if the ray terminates at that voxels, we expect to be close to 0, and () would be near and represents an object pixel in projection (, ), as depicted in Fig. 2. To evaluate the 3D reconstruction, we consider the metric of intersection over union (IoU) and impose the IoU loss [19] between projection and mask as shown below:


As (3.2)-(4) are differentiable w.r.t. camera pose and voxel

, this loss would guide the training of both camera pose estimation and 3D shape reconstruction.

In our work, we consider and fix and in (3.2). Because some fine structures like the bases of chairs would diminish when mask is of size as used in [24], we use mask of size .

Finally, we note that the voxel and mask discussed here can be either ground truth or predicted ones. As shown in Fig. 1, we use ground truth voxel and mask for 2D-3D self-consistency loss if full-supervised learning is applicable. If semi-supervised learning is of interest, we then consider the predicted voxel and its mask to calculate the 2D-3D self-consistency loss for the unlabeled data.

3.3 Learning of Our Model

Figure 3: Performance comparisons in IoU using our models training with different degrees of supervision.

Supervised Learning. To train our model in a fully supervised setting, both ground truth voxel and mask

are observed during training, while the ground truth camera pose is not available. The overall loss function for fully supervised learning is shown as below:

L_sup = α_3L_3D + α_4L_2D + α_5L_sc + α_6L_KL.

Here, we calculate the 2D-3D consistency loss between the ground truth voxel and mask instead of the predicted ones, which allows the training efficiency and the effectiveness in camera pose disentanglement. We use , and to update image encoder , and and for updating voxel decoder and mask decoder , respectively. In our work, we fix , , , and .

The voxel reconstruction loss consists of positive weighted cross entropy and IoU losses as shown below:


where adopts a positive weight to better preserve the fine structures in 3D, and is defined below:


Without this technique, the model tends to predict zeros for the voxels corresponding to such structures which minimizes the overall cross entropy loss. As for the IoU loss  [19], it is calculated as:


For 2D mask segmentation, we calculate the cross entropy between the predicted and ground truth masks as our 2D reconstruction loss .

As for the KL divergence loss, it is enforced to regularize the distribution of shape code and model ambiguity of 3D reconstruction due to unseen parts of shapes [6]. We adopt conditional variational autoencoder [18, 15] for and as shown below:


where the shape code consists of mean

and variance

. Only the mean is utilized by the mask decoder as mask segmentation is under-determined.

Semi-supervised Learning. In practice, we cannot collect the ground truth 3D voxels and 2D masks for all images, and thus a semi-supervised setting would be of interest. In this setting, only a portion of input images are with ground truth and , while the remaining data for training are the 2D images only. With such training data, we first pre-train our network using fully supervised data, followed by fine-tuning the network using unlabeled input images only. To be more specific, for semi-supervised learning, the overall loss function is defined as below: L_semi = L_sc + α_6L_KL. Note that and are calculated to update image encoder , while for updating voxel decoder .

We note that, in the network refinement process under this semi-supervised setting, our introduced 2D-3D self-consistency is critical. This allows our model to align the predicted 3D voxel and the predicted 2D mask with disentangled camera pose , all in an unsupervised fashion.

airplane car chair Mean
3D-R2N2 [4] 51.3 79.8 46.6 59.2
OGN [23] 58.7 81.6 48.3 62.9
PSGN [6] 60.1 83.1 54.4 65.9
voxel tube [19] 67.1 82.1 55.0 68.1
Matryoshka [19] 64.7 85.0 54.7 68.1
Ours 69.2 85.8 56.7 70.6
Table 1: Comparisons of 3D shape reconstruction in fully supervised settings in terms of mean IoU (%).

4 Experiments

4.1 Dataset

We consider the ShapeNet dataset [2] which contains a rich collection of 3D CAD models, and is widely used in recent research works related to 2D/3D data. Three categories, airplane, car, and chair, are selected for our experiments. For fair comparisons, we consider two different data settings. For supervised learning of our model and to perform comparisons, we follow the works of 3D-R2N2 [4], Octree Generating Network (OGN) [23], Point Set Generation Network (PSGN) [6], voxel tube network and Matryoshka network [19], which scale the ground truth voxels to fit into grids. This makes ground truth voxels larger than those considered in MVC [24] and DRC [25]. We use the same rendered images, ground truth voxel, and data split as used in these work.

For semi-supervised learning, we generate rendered images of size pixels and the corresponding ground truth 2D masks, using the same camera pose information and data split as used in Perspective Transformer Nets (PTN) [29]. We utilize the same ground truth voxels in Multi-view Consistency (MVC) [24] and Differentiable Ray Consistency (DRC) [25] to fit our projection module, and the grid size of voxels is .

4.2 Fully Supervised Learning

To train our model using fully-supervised data, we consider state-of-the-art methods of [4, 23, 6, 19] for comparisons. Since both ground truth voxels and 2D masks are available during training, there is no need to utilize the 2D-3D self-consistency loss for camera pose disentanglement.

Quantitative results of different 3D reconstruction methods are listed in Table. 1. From this table, we see that our network achieved favorable performances, showing that our network architecture is preferable under such settings.

Training data Label percentage Single/multi views airplane car chair Mean
DRC [25] 2D mask & pose 100% M 53.5 65.7 49.3 56.2
MVC [24] 2D mask 100% M 49.4 63.9 39.6 51.0
Ours (Semi) 2D mask & voxel 15% S 59.6 75.3 52.3 62.4
Ours (Sup) 2D mask & voxel 100% S 68.8 80.1 57.2 68.7
Table 2: Comparisons in IoU with approaches not requiring ground truth 3D shape information in weakly or semi-supervised settings. Note that Ours (Sup) denotes the fully-supervised version of our method.

4.3 Semi-Supervised Learning

To demonstrate the effect of our semi-supervised learning, we use different portions of labeled/unlabeled images to train the network for 3D reconstruction. For comparison purposes, we use 100, 80, 60, 40, 20 of labeled images for training fully-supervised version of our model, and the rest of the images are unused. As for semi-supervised learning, we use 80, 60, 40, 20 of labeled images, plus the remaining unlabeled ones for training our model.

We compare the results of our models with fully supervised and semi-supervised learning, and compare the quantitative results in Fig. 3. From this table, we see that our semi-supervised version exhibited very promising capabilities in handling unlabeled data, which resulted in improved performances compared to the versions using only labeled data for training. Qualitative results are also shown in Fig. 4 for visualization purposes.

Figure 4: Visualization of shape reconstruction using our model. (a) Input image , (b) predicted voxel using our supervised version, (c) predicted voxel using our semi-supervised model, (d) ground truth voxel , (e) mesh of (b), (f) mesh of (c), and (g) mesh of (d). Note that (e)-(g) are produced using the disentangled camera pose information.

To perform more complete comparisons with recent approaches not requiring ground truth 3D shape information, we consider the works of MVC [24] and DRC [25]. Note that both DRC and MVC require multi-view images as inputs for training; morevoer, DRC requires ground truth camera pose information. Table. 2 lists and compares the performances, in which our method clearly achieved the best results among all. Fig. 5 additionally presents qualitative results, which visually present and compare the 3D reconstruction abilities of different models.

Figure 5: Visualization of shape reconstruction using different weakly or semi-supervised methods. (a) Input image , (b) predicted voxel of DRC [25], (c) predicted voxel of our method, and (d) ground truth voxel . Note that DRC requires supervision of pose information, while ours does not.

4.4 Camera Pose Prediction

To demonstrate the ability of our model in disentangling camera pose information without such supervision, we transform our voxels into pose-aware shapes with our predicted and manipulated camera poses, as depicted in Fig. 6. From the results shown in figure, we see that the use of our model for learning interpretable visual representations from 2D images, including shape and camera pose features, can be successfully verified.

We also present the quantitative results of camera pose prediction compared to MVC [24] in Table. 3. It is worth pointing out that, MVC [24] requires multi-view images as their inputs. From this table, we see that our model succesfully disentangle and predict the camera pose information, without supervision of such information.

Figure 6: Visualization of pose-aware 3D reconstruction. (a) Input image , (b) predicted 2D mask, (c) predicted 3D voxels using our disentangled pose, and (d) ground truth voxel , (e) mesh of (c), and (f) mesh of (d).

4.5 Ablation Studies

To assess the contributions of each component and the design of our proposed network architecture, we perform the following ablation studies.

Network Architecture. Recall that our proposed voxel decoder consists of three separate 2D deconvolution layers in three directions, with each of the channel dimensions corresponding to height, width, and depth, respectively. We compare our proposed voxel decoder with the one using 2D deconvolution layers in only one direction [19]. For fair comparisons, we fix other components in our network, and compare the performances of the networks utilizing these two different kinds of voxel decoders. From the results shown in Table 4, we see that our decoder design resulted in improved 3D shape reconstruction performances.

2D-3D Self-Consistency. We assess the contribution of our network module in observing 2D-3D self-consistency. For simplicity, we only consider ray consistency loss with and without the projection loss, and we use ground truth camera pose to calculate (3.2) and (4). Then, we compare the resulting 3D reconstruction results.

From the comparison results shown in Table 5, we confirm that the introduced 2D-3D consistency combining both ray consistency and projection losses would improve the performance of 3D reconstruction. Thus, exploiting this property for 3D shape reconstruction would be preferable.

Method airplane chair
MVC [24] 35.10° 9.41°
Ours (Sup) 11.24° 9.82°
Ours (Semi) 9.34° 6.24°
Table 3: Mean prediction error of camera poses. Note that MVC requires multiple input images, while supervised/semi-supervised versions of our model always observe a single input without utilizing ground truth pose information. Note that Ours (Sup) is trained using only 15% of the data with ground truth shape/mask information, while Ours (Semi) utilizes both 15% labeled and 85% unlabeled data for training.

5 Conclusion

In this paper, we proposed a deep learning framework for single-image 3D reconstruction. Our proposed model is able to learn deep representation from a single 2D input for recovering its 3D voxel. This is achieved by disentangling unknown camera pose information from the above features via exploiting 2D-3D self-consistency. More importantly, no camera pose information, classification labels, or discriminator is required. Our method can be trained on fully-supervised setting as most 3D shape reconstruction models do, which utilize 2D images and their ground truth 3D voxels for training. Also, it can be trained on semi-supervised settings, which use additional unlabeled 2D images to further enhance the reconstruction results. Both quantitative and qualitative results demonstrated that our method was able to produce satisfactory results when comparing to state-of-the-art approaches with fully or semi-supervised settings. Thus, the effectiveness and robustness of our model can be successively verified.

Decoder design airplane car chair Mean
1 deconv 68.6 83.7 56.2 69.5
3 deconvs 69.2 85.8 56.7 70.6
Table 4: Comparisons of voxel decoder designs. Note that 1 deconv indicates the voxel decoder using 2D deconvolution layers in only one direction, while 3 deconvs denotes our design using doconvolution layers in three directions.
Method airplane car chair Mean
Ray 53.5 65.7 49.3 56.2
Ray + Projection 55.7 66.0 49.6 57.1
Table 5: Evaluation of different losses for 2D-3D self-consistency. Ray means that only use ray consistency loss, and Ray + Projection means using both ray consistency loss and projection loss.


  • [1] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3d morphable models from 2d images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):232–244, Jan. 2013.
  • [2] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q.-X. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. CoRR, abs/1512.03012, 2015.
  • [3] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems (NIPS), 2016.
  • [4] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European Conference Computer Vision (ECCV), 2016.
  • [5] J. Delanoy, M. Aubry, P. Isola, A. Efros, and A. Bousseau. 3d sketching using multi-view deep volumetric prediction. Proceedings of the ACM on Computer Graphics and Interactive Techniques, 1(21), may 2018.
  • [6] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 2463–2471, 2017.
  • [7] M. Gadelha, S. Maji, and R. Wang. 3d shape induction from 2d views of multiple objects. In International Conference on 3D Vision (3DV), pages 402–411, 2017.
  • [8] R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta. Learning a predictable and generative vector representation for objects. In European Conference Computer Vision (ECCV), 2016.
  • [9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In Neural Information Processing Systems (NIPS), 2014.
  • [10] E. Grant, P. Kohli, and M. van Gerven. Deep disentangled representations for volumetric reconstruction. In ECCV Workshops, 2016.
  • [11] J. Gwak, C. B. Choy, M. Chandraker, A. Garg, and S. Savarese. Weakly supervised 3d reconstruction with adversarial constraint. In 3D Vision (3DV), International Conference on 3D Vision, 2017.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [13] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In Neural Information Processing Systems (NIPS), pages 4996–5004, 2016.
  • [14] A. Kar, C. Häne, and J. Malik. Learning a multi-view stereo machine. In Neural Information Processing Systems (NIPS), 2017.
  • [15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  • [16] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. In Neural Information Processing Systems (NIPS), 2015.
  • [17] Z. Lun, M. Gadelha, E. Kalogerakis, S. Maji, and R. Wang. 3d shape reconstruction from sketches via multi-view convolutional networks. In International Conference on 3D Vision (3DV), 2017.
  • [18] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In International Conference on Learning Representations (ICLR), 2016.
  • [19] S. R. Richter and S. Roth. Matryoshka networks: Predicting 3d geometry via nested shape layers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [20] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241. Springer International Publishing, 2015.
  • [21] N. Savinov, C. Häne, L. Ladický, and M. Pollefeys. Semantic 3d reconstruction with continuous regularization and ray potentials using a visibility consistency constraint. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5460–5469, June 2016.
  • [22] A. A. Soltani, H. Huang, J. Wu, T. D. Kulkarni, and J. B. Tenenbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2511–2519, 2017.
  • [23] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In IEEE International Conference on Computer Vision (ICCV), pages 2107–2115, 2017.
  • [24] S. Tulsiani, A. A. Efros, and J. Malik. Multi-view consistency as supervisory signal for learning shape and pose prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [25] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 209–217, 2017.
  • [26] A. O. Ulusoy, A. Geiger, and M. J. Black. Towards probabilistic volumetric reconstruction using ray potentials. In International Conference on 3D Vision (3DV), pages 10–18, 2015.
  • [27] H. Wang, J. Yang, W. Liang, and X. Tong. Deep single-view 3d object reconstruction with visual hull embedding. In

    Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

    , 2019.
  • [28] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. MarrNet: 3D Shape Reconstruction via 2.5D Sketches. In Neural Information Processing Systems (NIPS), 2017.
  • [29] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In Neural Information Processing Systems (NIPS), 2016.
  • [30] R. Zhu, H. K. Galoogahi, C. Wang, and S. Lucey. Rethinking reprojection: Closing the loop for pose-aware shape reconstruction from a single image. In IEEE International Conference on Computer Vision (ICCV), pages 57–65, Oct 2017.


Appendix A Implementation Details

We implement the proposed network by PyTorch. The image encoder

, voxel decoder , and mask decoder are not pre-trained and are all randomly initialized. We choose to use ADAM optimizer to train , and . With fully-supervised learning settings, the learning rate for , , and is set to . As for the semi-supervised learning, the learning rates for , , and are all set to . The batch size is set to 48. The positive weight for is set to 3. We train our model on a single NVIDIA GeForce GTX 1080 Ti GPU with 11 GB memory. More details about the network architecture is in Sec. B. Besides, we will release our code of this work so that more details can be shown.

Appendix B Network Architecture

We now describe the detailed network architectures, including image encoder , mask decoder , and voxel decoder .

Image Encoder . Our image encoder

has residual structures. More specifically, it is composed of five designed residual blocks. The activation function for each block output is a leaky rectified linear unit (Leaky ReLU).

Figure 7: Different voxel decoder designs. (a) 3D deconvolution, (b) one 2D deconvolution, and (c) our proposed module with three 2D deconvolutions. Note that while the uses of 2D deconvlutions require fewer number of network parameters than that of 3D deconvolutions, our design using three 3D deconvolution achieves the best performance than those using either one 2D or full 3D deconvolutions.

Mask Decoder . Our mask decoder has a U-Net based structure. It is composed of 5 upsampling blocks.

Voxel Decoder . The proposed voxel decoder contains three separate 2D deconvolution layers, each of the three channel dimensions corresponding to height, width, and depth, respectively. As illustrated in Fig. 7

, the two output tensors are first transposed so that the channels would be mapped to different spatial dimensions. We take the element-wise mean from the three tensors to obtain the predicted shape. The complete architecture design of each network component is shown in Fig. 


Figure 8: Detailed designs of each component in our network architecture. (a) Image encoder , (b) mask decoder , and (c) voxel decoder .

Appendix C Visualization and Comparisons

We now provide additional visualization of 3D shape reconstruction using different weakly or semi-supervised methods, including ours. Note that both DRC and MVC require multi-view images as inputs for training. In addition, DRC requires ground truth camera pose information when learning its model. Example results are shown in Fig. 9.

Figure 9: Visualization of shape reconstruction using different weakly or semi-supervised methods. Note that our semi-supervised method utilized both 15% labeled and 85% unlabeled data for training. (a) Input image, (b) predicted voxel of DRC, (c) predicted voxel of MVC, (d) predicted voxel of our method, and (e) ground truth voxel.