3D reconstruction is an important problem in robotics, CAD, virtual reality and augmented reality. Traditional methods, such as Structure from Motion (SfM)  and Simultaneous Localization and Mapping (SLAM) , match image features across views. However, establishing feature correspondences becomes extremely difficult when multiple viewpoints are separated by a large margin due to local appearance changes or self-occlusions 
. To overcome these limitations, several deep learning based approaches, including 3D-R2N2, LSM , and 3DensiNet , have been proposed to recover the 3D shape of an object and obtained promising results.
formulate multi-view 3D reconstruction as a sequence learning problem and use recurrent neural networks (RNNs) to fuse multiple feature maps extracted by a shared encoder from input images. The feature maps are incrementally refined when more views of an object are available. However, RNN-based methods suffer from three limitations. First, when given the same set of images with different orders, RNNs are unable to estimate the 3D shape of an object consistently results due to permutation variance. Second, due to long-term memory loss of RNNs, the input images cannot be fully exploited to refine reconstruction results . Last but not least, RNN-based methods are time-consuming since input images are processed sequentially without parallelization .
To address the issues mentioned above, we propose Pix2Vox, a novel framework for single-view and multi-view 3D reconstruction that contains four modules: encoder, decoder, context-aware fusion, and refiner. The encoder and decoder generate coarse 3D volumes from multiple input images in parallel, which eliminates the effect of the orders of input images and accelerates the computation. Then, the context-aware fusion module selects high-quality reconstructions from all coarse 3D volumes and generates a fused 3D volume, which fully exploits information of all input images without long-term memory loss. Finally, the refiner further correct wrongly recovered parts of the fused 3D volumes to obtain a refined reconstruction. To achieve a good balance between accuracy and model size, we implement two versions of the proposed framework: Pix2Vox-F and Pix2Vox-A (Figure 1).
The contributions can be summarized as follows:
We present a unified framework for both single-view and multi-view 3D reconstruction, namely Pix2Vox. We equip Pix2Vox with well-designed encoder, decoder, and refiner, which shows a powerful ability to handle 3D reconstruction in both synthetic and real-world images.
We propose a context-aware fusion module to adaptively select high-quality reconstructions for each part from different coarse 3D volumes in parallel to produce a fused reconstruction of the whole object. To the best of our knowledge, it is the first time to exploit context across multiple views for 3D reconstruction.
2 Related Work
Single-view 3D Reconstruction Theoretically, recovering 3D shape from single-view images is an ill-posed problem. To address this issue, many attempts have been made, such as ShapeFromX [1, 17], where X may represent silhouettes , shading , and texture . However, these methods are barely applicable to use in the real-world scenarios, because all of them require strong presumptions and abundant expertise in natural images .
With the success of generative adversarial networks (GANs) 
and variational autoencoders (VAEs), 3D-VAE-GAN  adopts GAN and VAE to generate 3D objects by taking a single-view image as input. However, 3D-VAE-GAN requires class labels for reconstruction. MarrNet  reconstructs 3D objects by estimating depth, surface normals, and silhouettes of 2D images, which is challenging and usually leads to severe distortion . OGN  and O-CNN  use octree to represent higher resolution volumetric 3D objects with a limited memory budget. However, OGN representations are complex and consume more computational resources due to the complexity of octree representations. PSGN  and 3D-LMNet 
generate point clouds from single-view images. However, the points have a large degree of freedom in the point cloud representation because of the limited connections between points. Consequently, these methods cannot recover 3D volumes accurately.
Multi-view 3D Reconstruction SfM  and SLAM  methods are successful in handling many scenarios. These methods match features among images and estimate the camera pose for each image. However, the matching process becomes difficult when multiple viewpoints are separated by a large margin. Besides, scanning all surfaces of an object before reconstruction is sometimes impossible, which leads to incomplete 3D shapes with occluded or hollowed-out areas .
Powered by large-scale datasets of 3D CAD models (e.g., ShapeNet ), deep-learning-based methods have been proposed for 3D reconstruction. Both 3D-R2N2  and LSM  use RNNs to infer 3D shape from single or multiple input images and achieve impressive results. However, RNNs are time-consuming and permutation-variant, which produce inconsistent reconstruction results. 3DensiNet 
uses max pooling to aggregate the features from multiple images. However, max pooling only extracts maximum values from features, which may ignore other valuable features that are useful for 3D reconstruction.
3 The Method
The proposed Pix2Vox aims to reconstruct the 3D shape of an object from either single or multiple RGB images. The 3D shape of an object is represented by a 3D voxel grid, where is an empty cell and denotes an occupied cell. The key components of Pix2Vox are shown in Figure 2. First, the encoder produces feature maps from input images. Second, the decoder takes each feature map as input and generates a coarse 3D volume correspondingly. Third, single or multiple 3D volumes are forwarded to the context-aware fusion module, which adaptively selects high-quality reconstructions for each part from coarse 3D volumes to obtain a fused 3D volume. Finally, the refiner with skip-connections further refines the fused 3D volume to generate the final reconstruction result.
3.2 Network Architecture
Figure 3 shows the detailed architectures of Pix2Vox-F and Pix2Vox-A. The former involves much fewer parameters and lower computational complexity, while the latter has more parameters, which can construct more accurate 3D shapes but has higher computational complexity.
The encoder is to compute a set of features for the decoder to recover the 3D shape of the object. The first nine convolutional layers, along with the corresponding batch normalization layers and ReLU activations of a pre-trained VGG-16, are used to extract a
feature tensor from a
image. This feature extraction is followed by three sets of 2D convolutional layers, batch normalization layers and ELU layers to embed semantic information into feature vectors. In Pix2Vox-F, the kernel size of the first convolutional layer iswhile the kernel sizes of the other two are . The number of output channels of the convolutional layer starts with and decreases by half for the subsequent layer and ends up with . In Pix2Vox-A, the kernel sizes of the three convolutional layers are , , and , respectively. The output channels of the three convolutional layers are , , and , respectively. After the second convolutional layer, there is a max pooling layer with kernel sizes of and in Pix2Vox-F and Pix2Vox-A, respectively. The feature vectors produced by Pix2Vox-F and Pix2Vox-A are of sizes and , respectively.
The decoder is responsible for transforming information of 2D feature maps into 3D volumes. There are five 3D transposed convolutional layers in both Pix2Vox-F and Pix2Vox-A. Specifically, the first four transposed convolutional layers are of a kernel size of
, with stride of
and padding of. There is an additional transposed convolutional layer with a bank of
filter. Each transposed convolutional layer is followed by a batch normalization layer and a ReLU activation except for the last layer followed by a sigmoid function. In Pix2Vox-F, the numbers of output channels of the transposed convolutional layers are, , , , and , respectively. In Pix2Vox-A, the numbers of output channels of the five transposed convolutional layers are , , , , and , respectively. The decoder outputs a voxelized shape in the object’s canonical view.
3.2.3 Context-aware Fusion
From different viewpoints, we can see different visible parts of an object. The reconstruction qualities of visible parts are much higher than those of invisible parts. Inspired by this observation, we propose a context-aware fusion module to adaptively select high-quality reconstruction for each part (e.g., table legs) from different coarse 3D volumes. The selected reconstructions are fused to generate a 3D volume of the whole object (Figure 4).
As shown in Figure 5, given coarse 3D volumes and the corresponding context, the context-aware fusion module generates a score map for each coarse volume and then fuses them into one volume by the weighted summation of all coarse volumes according to their score maps. The spatial information of voxels is preserved in the context-aware fusion module, and thus Pix2Vox can utilize multi-view information to recover the structure of an object better.
Specifically, the context-aware fusion module generates the context of the -th coarse volume by concatenating the output of the last two layers in the decoder. Then, the context scoring network generates a score for the context of the -th coarse voxel. The context scoring network is composed of five sets of 3D convolutional layers, each of which has a kernel size of and padding of , followed by a batch normalization and a leaky ReLU activation. The numbers of output channels of convolutional layers are , , , , and , respectively. The learned score for context are normalized across all learnt scores. We choose softmax as the normalization function. Therefore, the score at position for the -th voxel can be calculated as
where represents the number of views. Finally, the fused voxel is produced by summing up the product of coarse voxels and the corresponding scores altogether.
|Category||3D-R2N2 ||3DensiNet ||OGN ||DRC ||PSGN ||Pix2Vox-F||Pix2Vox-A|
The refiner can be seen as a residual network, which aims to correct wrongly recovered parts of a 3D volume. It follows the idea of a 3D encoder-decoder with the U-net connections . With the help of the U-net connections between the encoder and decoder, the local structure in the fused volume can be preserved. Specifically, the encoder has three 3D convolutional layers, each of which has a bank of filters with padding of , followed by a batch normalization layer, a leaky ReLU activation and a max pooling layer with a kernel size of . The numbers of output channels of convolutional layers are , , and , respectively. The encoder is finally followed by two fully connected layers with dimensions of and . The decoder consists of three transposed convolutional layers, each of which has a bank of filters with padding of and stride of . Except for the last transposed convolutional layer that is followed by a sigmoid function, other layers are followed by a batch normalization layer and a ReLU activation.
3.2.5 Loss Function
The loss function of the network is defined as the mean value of the voxel-wise binary cross entropies between the reconstructed object and the ground truth. More formally, it can be defined as
where denotes the number of voxels in the ground truth. and represent the predicted occupancy and the corresponding ground truth. The smaller the value is, the closer the prediction is to the ground truth.
4.1 Datasets and Metrics
Datasets We evaluate the proposed Pix2Vox-F and Pix2Vox-A on both synthetic images of objects from the ShapeNet  dataset and real images from the Pascal 3D+  dataset. More specifically, we use a subset of ShapeNet consisting of 13 major categories and 44k 3D models following the settings of . As for Pascal 3D+, there are 12 categories and 22k models.
To evaluate the quality of the output from the proposed methods, we binarize the probabilities at a fixed threshold of 0.4 and use intersection over union (IoU) as the similarity measure. More formally,
where and represent the predicted occupancy probability and the ground truth at , respectively. is an indicator function and denotes a voxelization threshold. Higher IoU values indicate better reconstruction results.
4.2 Implementation Details
We use RGB images as input to train the proposed methods with a shape batch size of . The output voxelized reconstruction is
in size. We implement our network in PyTorch111Source code will be publicly available. and train both Pix2Vox-F and Pix2Vox-A using an Adam optimizer  with a of and a of . The initial learning rate is set to
and decayed by 2 after 150 epochs. The optimization is set to stop after 250 epochs.
4.3 Reconstruction of Synthetic Images
|Methods||1 view||2 views||3 views||4 views||5 views||8 views||12 views||16 views||20 views|
To evaluate the performance of the proposed methods in handling synthetic images, we compare our methods against several state-of-the-art methods on the ShapeNet testing set. Table 1 shows the performance of single-view reconstruction, while Table 2 shows the mean IoU scores of multi-view reconstruction with different numbers of views.
The single-view reconstruction results of Pix2Vox-F and Pix2Vox-A significantly outperform other methods (Table 1). Pix2Vox-A increases IoU over 3D-R2N2 by 18%. In multi-view reconstruction, Pix2Vox-A consistently outperforms 3D-R2N2 in all numbers of views (Table 2). The IoU of Pix2Vox-A is 13% higher than that of 3D-R2N2.
Figure 6 shows several reconstruction examples from the ShapeNet testing set. Both Pix2Vox-F and Pix2Vox-A are able to recover the thin parts of objects, such as lamps and table legs. Compare with Pix2Vox-F, we also observe that higher dimensional feature maps in Pix2Vox-A do contribute to 3D reconstruction. Moreover, in multi-view reconstruction, both Pix2Vox-A and Pix2Vox-F produce better results than 3D-R2N2.
4.4 Reconstruction of Real-world Images
|Category||3D-R2N2 ||3DensiNet ||OGN ||DRC ||Pix2Vox-F||Pix2Vox-A|
To evaluate the performance on of the proposed methods on real-world images, we test our methods for single-view reconstruction on the Pascal 3D+ dataset. First, the images are cropped according to the bounding box of the largest objects within the image. Then, these cropped images are rescaled to the input size of the reconstruction network.
The mean IoU of each category is reported in Table 3. Both Pix2Vox-F and Pix2Vox-A significantly outperform the competing approaches on the Pascal 3D+ testing set. Compared with other methods, our methods are able to better reconstruct the overall shape and capture finer details from the input images. The qualitative analysis is given in Figure 7, which indicate that the proposed methods are more effective in handling real-world scenarios.
4.5 Reconstruction of Unseen Objects
In order to test how well our methods can generalize to unseen objects, we conduct additional experiments on ShapeNet. More specifically, all models are trained on the 13 major categories of ShapeNet and tested on the remaining 44 categories of ShapeNet. All pretrained models have never “seen” either the objects in these categories or the labels of objects before. The reconstruction results of 3D-R2N2 are obtained with the released pretrained model.
Several reconstruction results are presented in Figure 8. The reconstruction IoU of 3D-R2N2 on unseen objects is , while Pix2Vox-F and Pix2Vox-A are and , respectively. Experimental results demonstrate that 3D-R2N2 can hardly recover the shape of unseen objects. In contrast, Pix2Vox-F and Pix2Vox-A show satisfactory generalization abilities to unseen objects.
4.6 Ablation Study
In this section, we validate the context-aware fusion and the refiner by ablation studies.
Context-aware fusion To quantitatively evaluate the context-aware fusion, we replace the context-aware fusion in Pix2Vox-A with the average fusion, where the fused voxel can be calculated as
Table 2 shows that the context-aware fusion performs better than the average fusion in selecting the high-quality reconstructions for each part from different coarse volumes.
Refiner Pix2Vox-A uses a refiner to further refine the fused 3D volume. For single-view reconstruction on ShapeNet, the IoU of Pix2Vox-A is . In contrast, the IoU of Pix2Vox-A without the refiner decreases to . Removing refiner causes considerable degeneration for the reconstruction accuracy. However, as the number of views increases, the effect of the refiner becomes weaker. The reconstruction results of the two networks (with/without the refiner) are almost the same if the number of the input images is more than 3.
The ablation studies indicate that both the context-aware fusion and the refiner play important roles in our framework for the performance improvements against previous state-of-the-art methods.
4.7 Space and Time Complexity
|Forward, 1-view (ms)||73.35||37.90||9.25||9.90|
|Forward, 2-views (ms)||108.11||N/A||12.05||13.69|
|Forward, 4-views (ms)||112.36||N/A||23.26||26.31|
|Forward, 8-views (ms)||117.64||N/A||52.63||55.56|
The running times are obtained on the same PC with an NVIDIA GTX 1080 Ti GPU. For more precise timing, we exclude the reading and writing time when evaluating the forward and backward inference time. Both Pix2Vox-F and Pix2Vox-A are about times faster in forward inference than 3D-R2N2 in single-view reconstruction. In backward inference, Pix2Vox-F and Pix2Vox-A are about and times faster than 3D-R2N2, respectively.
To give a detailed analysis of the context-aware fusion module, we visualized the score maps of three coarse volumes when reconstructing the 3D shape of a table from 3-view images, as shown in Figure 4. The reconstruction quality of the table tops on the right is clearly of low quality, and the score of the corresponding part is lower than those in the other two coarse volumes. The fused 3D volume is obtained by combining the selected high-quality reconstruction parts, where bad reconstructions can be eliminated effectively by our scoring scheme.
Although our methods outperform state-of-the-arts, the reconstruction results of our methods are still with a low resolution. We can further improve the reconstruction resolutions in the future work by introducing GANs .
5 Conclusion and Future Works
In this paper, we propose a unified framework for both single-view and multi-view 3D reconstruction, named Pix2Vox. Compared with existing methods that fuse deep features generated by a shared encoder, the proposed method fuses multiple coarse volumes produced by a decoder and preserves multi-view spatial constraints better. Quantitative and qualitative evaluation for both single-view and multi-view reconstruction on the ShapeNet and Pascal 3D+ benchmarks indicate that the proposed methods outperform state-of-the-arts by a large margin. Pix2Vox is computationally efficient, which is 24 times faster than 3D-R2N2 in terms of backward inference time. In future work, we will work on improving the resolution of the reconstructed 3D objects. In addition, we also plan to extend Pix2Vox to reconstruct 3D objects from RGB-D images.
This work was supported by the National Natural Science Foundation of China under Project No. 61772158, 61702136 and U1711265. We gratefully acknowledge Prof. Junbao Li and Huanyu Liu for providing GPU hours for this research. We would also like to thank Prof. Wangmeng Zuo for his valuable feedback and help during this research.
-  J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. TPAMI, 37(8):1670–1687, 2015.
-  C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In ECCV 2016.
-  E. Dibra, H. Jain, A. C. Öztireli, R. Ziegler, and M. H. Gross. Human shape from silhouettes using generative HKS descriptors and cross-modal neural networks. In CVPR 2017.
-  H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3D object reconstruction from a single image. In CVPR 2017.
-  J. Fuentes-Pacheco, J. R. Ascencio, and J. M. Rendón-Mancha. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81, 2015.
-  I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS 2014.
-  K. Hwang and W. Sung. Single stream parallelization of generalized LSTM-like RNNs on a GPU. In ICASSP 2015.
-  A. Kar, C. Häne, and J. Malik. Learning a multi-view stereo machine. In NIPS 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv, abs/1312.6114, 2013.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
-  P. Mandikal, N. K. L., M. Agarwal, and V. B. Radhakrishnan. 3D-LMNet: Latent embedding matching for accurate and diverse 3D point cloud reconstruction from a single image. In BMVC 2018.
-  O. Özyeşil, V. Voroninski, R. Basri, and A. Singer. A survey of structure from motion. Acta Numerica, 26:305–364.
-  R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML 2013.
-  S. R. Richter and S. Roth. Discriminative shape from shading in uncalibrated illumination. In CVPR 2015.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015.
-  S. Savarese, M. Andreetto, H. E. Rushmeier, F. Bernardini, and P. Perona. 3D reconstruction by shadow carving: Theory and practical evaluation. IJCV, 71(3):305–336, 2007.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR 2015.
-  M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In ICCV 2017.
-  S. Tulsiani. Learning Single-view 3D Reconstruction of Objects and Scenes. PhD thesis, UC Berkeley, 2018.
-  S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR 2017.
-  O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. In ICLR 2016.
-  M. Wang, L. Wang, and Y. Fang. 3DensiNet: A robust neural network architecture towards 3D volumetric object prediction from 2D image. In ACM MM 2017.
-  N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang. Pixel2Mesh: Generating 3D mesh models from single RGB images. In ECCV 2018.
P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong.
O-CNN: octree-based convolutional neural networks for 3D shape analysis.ACM Trans. Graph., 36(4):72:1–72:11, 2017.
-  A. P. Witkin. Recovering surface shape and orientation from texture. Artif. Intell., 17(1-3):17–45, 1981.
-  J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. MarrNet: 3D shape reconstruction via 2.5D sketches. In NIPS 2017.
-  J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In NIPS 2016.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR 2015.
-  Y. Xia, Y. Zhang, D. Zhou, X. Huang, C. Wang, and R. Yang. RealPoint3D: Point cloud generation from a single image with complex background. arXiv, abs/1809.02743, 2018.
-  Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV 2014.
-  B. Yang, S. Rosa, A. Markham, N. Trigoni, and H. Wen. Dense 3D object reconstruction from a single depth view. TPAMI, DOI: 10.1109/TPAMI.2018.2868195, 2018.