1 Introduction
3D reconstruction is an important problem in robotics, CAD, virtual reality and augmented reality. Traditional methods, such as Structure from Motion (SfM) [13] and Simultaneous Localization and Mapping (SLAM) [5], match image features across views. However, establishing feature correspondences becomes extremely difficult when multiple viewpoints are separated by a large margin due to local appearance changes or selfocclusions [11]
. To overcome these limitations, several deep learning based approaches, including 3DR2N2
[2], LSM [8], and 3DensiNet [23], have been proposed to recover the 3D shape of an object and obtained promising results.To generate 3D volumes, 3DR2N2 [2] and LSM [8]
formulate multiview 3D reconstruction as a sequence learning problem and use recurrent neural networks (RNNs) to fuse multiple feature maps extracted by a shared encoder from input images. The feature maps are incrementally refined when more views of an object are available. However, RNNbased methods suffer from three limitations. First, when given the same set of images with different orders, RNNs are unable to estimate the 3D shape of an object consistently results due to permutation variance
[22]. Second, due to longterm memory loss of RNNs, the input images cannot be fully exploited to refine reconstruction results [14]. Last but not least, RNNbased methods are timeconsuming since input images are processed sequentially without parallelization [7].To address the issues mentioned above, we propose Pix2Vox, a novel framework for singleview and multiview 3D reconstruction that contains four modules: encoder, decoder, contextaware fusion, and refiner. The encoder and decoder generate coarse 3D volumes from multiple input images in parallel, which eliminates the effect of the orders of input images and accelerates the computation. Then, the contextaware fusion module selects highquality reconstructions from all coarse 3D volumes and generates a fused 3D volume, which fully exploits information of all input images without longterm memory loss. Finally, the refiner further correct wrongly recovered parts of the fused 3D volumes to obtain a refined reconstruction. To achieve a good balance between accuracy and model size, we implement two versions of the proposed framework: Pix2VoxF and Pix2VoxA (Figure 1).
The contributions can be summarized as follows:

We present a unified framework for both singleview and multiview 3D reconstruction, namely Pix2Vox. We equip Pix2Vox with welldesigned encoder, decoder, and refiner, which shows a powerful ability to handle 3D reconstruction in both synthetic and realworld images.

We propose a contextaware fusion module to adaptively select highquality reconstructions for each part from different coarse 3D volumes in parallel to produce a fused reconstruction of the whole object. To the best of our knowledge, it is the first time to exploit context across multiple views for 3D reconstruction.
2 Related Work
Singleview 3D Reconstruction Theoretically, recovering 3D shape from singleview images is an illposed problem. To address this issue, many attempts have been made, such as ShapeFromX [1, 17], where X may represent silhouettes [3], shading [15], and texture [26]. However, these methods are barely applicable to use in the realworld scenarios, because all of them require strong presumptions and abundant expertise in natural images [30].
With the success of generative adversarial networks (GANs) [6]
and variational autoencoders (VAEs)
[10], 3DVAEGAN [28] adopts GAN and VAE to generate 3D objects by taking a singleview image as input. However, 3DVAEGAN requires class labels for reconstruction. MarrNet [27] reconstructs 3D objects by estimating depth, surface normals, and silhouettes of 2D images, which is challenging and usually leads to severe distortion [20]. OGN [19] and OCNN [25] use octree to represent higher resolution volumetric 3D objects with a limited memory budget. However, OGN representations are complex and consume more computational resources due to the complexity of octree representations. PSGN [4] and 3DLMNet [12]generate point clouds from singleview images. However, the points have a large degree of freedom in the point cloud representation because of the limited connections between points. Consequently, these methods cannot recover 3D volumes accurately
[24].Multiview 3D Reconstruction SfM [13] and SLAM [5] methods are successful in handling many scenarios. These methods match features among images and estimate the camera pose for each image. However, the matching process becomes difficult when multiple viewpoints are separated by a large margin. Besides, scanning all surfaces of an object before reconstruction is sometimes impossible, which leads to incomplete 3D shapes with occluded or hollowedout areas [32].
Powered by largescale datasets of 3D CAD models (e.g., ShapeNet [29]), deeplearningbased methods have been proposed for 3D reconstruction. Both 3DR2N2 [2] and LSM [8] use RNNs to infer 3D shape from single or multiple input images and achieve impressive results. However, RNNs are timeconsuming and permutationvariant, which produce inconsistent reconstruction results. 3DensiNet [23]
uses max pooling to aggregate the features from multiple images. However, max pooling only extracts maximum values from features, which may ignore other valuable features that are useful for 3D reconstruction.
3 The Method
3.1 Overview
The proposed Pix2Vox aims to reconstruct the 3D shape of an object from either single or multiple RGB images. The 3D shape of an object is represented by a 3D voxel grid, where is an empty cell and denotes an occupied cell. The key components of Pix2Vox are shown in Figure 2. First, the encoder produces feature maps from input images. Second, the decoder takes each feature map as input and generates a coarse 3D volume correspondingly. Third, single or multiple 3D volumes are forwarded to the contextaware fusion module, which adaptively selects highquality reconstructions for each part from coarse 3D volumes to obtain a fused 3D volume. Finally, the refiner with skipconnections further refines the fused 3D volume to generate the final reconstruction result.
3.2 Network Architecture
Figure 3 shows the detailed architectures of Pix2VoxF and Pix2VoxA. The former involves much fewer parameters and lower computational complexity, while the latter has more parameters, which can construct more accurate 3D shapes but has higher computational complexity.
3.2.1 Encoder
The encoder is to compute a set of features for the decoder to recover the 3D shape of the object. The first nine convolutional layers, along with the corresponding batch normalization layers and ReLU activations of a pretrained VGG16
[18], are used to extract afeature tensor from a
image. This feature extraction is followed by three sets of 2D convolutional layers, batch normalization layers and ELU layers to embed semantic information into feature vectors. In Pix2VoxF, the kernel size of the first convolutional layer is
while the kernel sizes of the other two are . The number of output channels of the convolutional layer starts with and decreases by half for the subsequent layer and ends up with . In Pix2VoxA, the kernel sizes of the three convolutional layers are , , and , respectively. The output channels of the three convolutional layers are , , and , respectively. After the second convolutional layer, there is a max pooling layer with kernel sizes of and in Pix2VoxF and Pix2VoxA, respectively. The feature vectors produced by Pix2VoxF and Pix2VoxA are of sizes and , respectively.3.2.2 Decoder
The decoder is responsible for transforming information of 2D feature maps into 3D volumes. There are five 3D transposed convolutional layers in both Pix2VoxF and Pix2VoxA. Specifically, the first four transposed convolutional layers are of a kernel size of
, with stride of
and padding of
. There is an additional transposed convolutional layer with a bank offilter. Each transposed convolutional layer is followed by a batch normalization layer and a ReLU activation except for the last layer followed by a sigmoid function. In Pix2VoxF, the numbers of output channels of the transposed convolutional layers are
, , , , and , respectively. In Pix2VoxA, the numbers of output channels of the five transposed convolutional layers are , , , , and , respectively. The decoder outputs a voxelized shape in the object’s canonical view.3.2.3 Contextaware Fusion
From different viewpoints, we can see different visible parts of an object. The reconstruction qualities of visible parts are much higher than those of invisible parts. Inspired by this observation, we propose a contextaware fusion module to adaptively select highquality reconstruction for each part (e.g., table legs) from different coarse 3D volumes. The selected reconstructions are fused to generate a 3D volume of the whole object (Figure 4).
As shown in Figure 5, given coarse 3D volumes and the corresponding context, the contextaware fusion module generates a score map for each coarse volume and then fuses them into one volume by the weighted summation of all coarse volumes according to their score maps. The spatial information of voxels is preserved in the contextaware fusion module, and thus Pix2Vox can utilize multiview information to recover the structure of an object better.
Specifically, the contextaware fusion module generates the context of the th coarse volume by concatenating the output of the last two layers in the decoder. Then, the context scoring network generates a score for the context of the th coarse voxel. The context scoring network is composed of five sets of 3D convolutional layers, each of which has a kernel size of and padding of , followed by a batch normalization and a leaky ReLU activation. The numbers of output channels of convolutional layers are , , , , and , respectively. The learned score for context are normalized across all learnt scores. We choose softmax as the normalization function. Therefore, the score at position for the th voxel can be calculated as
(1) 
where represents the number of views. Finally, the fused voxel is produced by summing up the product of coarse voxels and the corresponding scores altogether.
(2) 
3.2.4 Refiner
Category  3DR2N2 [2]  3DensiNet [23]  OGN [19]  DRC [21]  PSGN [4]  Pix2VoxF  Pix2VoxA 

airplane  0.513    0.587  0.570  0.601  0.624  0.689 
bench  0.421    0.481  (0.453)  0.550  0.537  0.613 
cabinet  0.716  0.743  0.729  (0.635)  0.771  0.720  0.757 
car  0.798  0.818  0.828  0.760  0.831  0.798  0.806 
chair  0.466  0.451  0.483  0.470  0.544  0.570  0.599 
display  0.468  0.452  0.502  (0.419)  0.552  0.543  0.560 
lamp  0.381    0.398  (0.415)  0.462  0.463  0.492 
speaker  0.662    0.637  (0.609)  0.737  0.637  0.641 
rifle  0.544    0.593  (0.608)  0.604  0.627  0.638 
sofa  0.628  0.690  0.646  (0.606)  0.708  0.669  0.688 
table  0.513  0.549  0.536  (0.424)  0.606  0.592  0.613 
telephone  0.661  0.726  0.702  (0.413)  0.749  0.745  0.761 
watercraft  0.513  0.557  0.632  (0.556)  0.611  0.569  0.586 
Overall  0.560    0.596  (0.546)  0.640  0.633  0.658 
The refiner can be seen as a residual network, which aims to correct wrongly recovered parts of a 3D volume. It follows the idea of a 3D encoderdecoder with the Unet connections [16]. With the help of the Unet connections between the encoder and decoder, the local structure in the fused volume can be preserved. Specifically, the encoder has three 3D convolutional layers, each of which has a bank of filters with padding of , followed by a batch normalization layer, a leaky ReLU activation and a max pooling layer with a kernel size of . The numbers of output channels of convolutional layers are , , and , respectively. The encoder is finally followed by two fully connected layers with dimensions of and . The decoder consists of three transposed convolutional layers, each of which has a bank of filters with padding of and stride of . Except for the last transposed convolutional layer that is followed by a sigmoid function, other layers are followed by a batch normalization layer and a ReLU activation.
3.2.5 Loss Function
The loss function of the network is defined as the mean value of the voxelwise binary cross entropies between the reconstructed object and the ground truth. More formally, it can be defined as
(3) 
where denotes the number of voxels in the ground truth. and represent the predicted occupancy and the corresponding ground truth. The smaller the value is, the closer the prediction is to the ground truth.
4 Experiments
4.1 Datasets and Metrics
Datasets We evaluate the proposed Pix2VoxF and Pix2VoxA on both synthetic images of objects from the ShapeNet [29] dataset and real images from the Pascal 3D+ [31] dataset. More specifically, we use a subset of ShapeNet consisting of 13 major categories and 44k 3D models following the settings of [2]. As for Pascal 3D+, there are 12 categories and 22k models.
To evaluate the quality of the output from the proposed methods, we binarize the probabilities at a fixed threshold of 0.4 and use intersection over union (IoU) as the similarity measure. More formally,
(4) 
where and represent the predicted occupancy probability and the ground truth at , respectively. is an indicator function and denotes a voxelization threshold. Higher IoU values indicate better reconstruction results.
4.2 Implementation Details
We use RGB images as input to train the proposed methods with a shape batch size of . The output voxelized reconstruction is
in size. We implement our network in PyTorch
^{1}^{1}1Source code will be publicly available. and train both Pix2VoxF and Pix2VoxA using an Adam optimizer [9] with a of and a of . The initial learning rate is set toand decayed by 2 after 150 epochs. The optimization is set to stop after 250 epochs.
4.3 Reconstruction of Synthetic Images
Methods  1 view  2 views  3 views  4 views  5 views  8 views  12 views  16 views  20 views 

3DR2N2 [2]  0.560  0.603  0.617  0.625  0.634  (0.635)  (0.636)  (0.636)  (0.636) 
Pix2VoxF  0.633  0.658  0.671  0.675  0.679  0.684  0.687  0.689  0.690 
Pix2VoxF  0.633  0.665  0.677  0.681  0.685  0.688  0.690  0.692  0.693 
Pix2VoxA  0.658  0.675  0.685  0.688  0.691  0.696  0.698  0.699  0.701 
Pix2VoxA  0.658  0.682  0.692  0.695  0.697  0.701  0.702  0.704  0.705 
To evaluate the performance of the proposed methods in handling synthetic images, we compare our methods against several stateoftheart methods on the ShapeNet testing set. Table 1 shows the performance of singleview reconstruction, while Table 2 shows the mean IoU scores of multiview reconstruction with different numbers of views.
The singleview reconstruction results of Pix2VoxF and Pix2VoxA significantly outperform other methods (Table 1). Pix2VoxA increases IoU over 3DR2N2 by 18%. In multiview reconstruction, Pix2VoxA consistently outperforms 3DR2N2 in all numbers of views (Table 2). The IoU of Pix2VoxA is 13% higher than that of 3DR2N2.
Figure 6 shows several reconstruction examples from the ShapeNet testing set. Both Pix2VoxF and Pix2VoxA are able to recover the thin parts of objects, such as lamps and table legs. Compare with Pix2VoxF, we also observe that higher dimensional feature maps in Pix2VoxA do contribute to 3D reconstruction. Moreover, in multiview reconstruction, both Pix2VoxA and Pix2VoxF produce better results than 3DR2N2.
4.4 Reconstruction of Realworld Images
Category  3DR2N2 [2]  3DensiNet [23]  OGN [19]  DRC [21]  Pix2VoxF  Pix2VoxA 

airplane  0.544    (0.515)  0.550  0.668  0.690 
bicycle  0.499    (0.523)  (0.504)  0.621  0.643 
boat  0.560  0.326  (0.598)  (0.537)  0.787  0.800 
bottle  (0.331)    (0.466)  (0.349)  0.579  0.616 
bus  0.816    (0.515)  (0.541)  0.729  0.760 
car  0.699  0.607  (0.520)  0.720  0.654  0.657 
chair  0.280  0.259  (0.376)  0.340  0.526  0.593 
motor  0.649    (0.534)  (0.514)  0.763  0.780 
sofa  0.332  0.574  (0.461)  (0.406)  0.628  0.634 
table  (0.261)    (0.277)  (0.327)  0.433  0.481 
train  0.672    (0.508)  (0.503)  0.747  0.785 
TV  0.574  0.606  (0.589)  (0.535)  0.655  0.694 
Overall  (0.537)    (0.504)  (0.536)  0.645  0.669 
To evaluate the performance on of the proposed methods on realworld images, we test our methods for singleview reconstruction on the Pascal 3D+ dataset. First, the images are cropped according to the bounding box of the largest objects within the image. Then, these cropped images are rescaled to the input size of the reconstruction network.
The mean IoU of each category is reported in Table 3. Both Pix2VoxF and Pix2VoxA significantly outperform the competing approaches on the Pascal 3D+ testing set. Compared with other methods, our methods are able to better reconstruct the overall shape and capture finer details from the input images. The qualitative analysis is given in Figure 7, which indicate that the proposed methods are more effective in handling realworld scenarios.
4.5 Reconstruction of Unseen Objects
In order to test how well our methods can generalize to unseen objects, we conduct additional experiments on ShapeNet. More specifically, all models are trained on the 13 major categories of ShapeNet and tested on the remaining 44 categories of ShapeNet. All pretrained models have never “seen” either the objects in these categories or the labels of objects before. The reconstruction results of 3DR2N2 are obtained with the released pretrained model.
Several reconstruction results are presented in Figure 8. The reconstruction IoU of 3DR2N2 on unseen objects is , while Pix2VoxF and Pix2VoxA are and , respectively. Experimental results demonstrate that 3DR2N2 can hardly recover the shape of unseen objects. In contrast, Pix2VoxF and Pix2VoxA show satisfactory generalization abilities to unseen objects.
4.6 Ablation Study
In this section, we validate the contextaware fusion and the refiner by ablation studies.
Contextaware fusion To quantitatively evaluate the contextaware fusion, we replace the contextaware fusion in Pix2VoxA with the average fusion, where the fused voxel can be calculated as
(5) 
Table 2 shows that the contextaware fusion performs better than the average fusion in selecting the highquality reconstructions for each part from different coarse volumes.
Refiner Pix2VoxA uses a refiner to further refine the fused 3D volume. For singleview reconstruction on ShapeNet, the IoU of Pix2VoxA is . In contrast, the IoU of Pix2VoxA without the refiner decreases to . Removing refiner causes considerable degeneration for the reconstruction accuracy. However, as the number of views increases, the effect of the refiner becomes weaker. The reconstruction results of the two networks (with/without the refiner) are almost the same if the number of the input images is more than 3.
The ablation studies indicate that both the contextaware fusion and the refiner play important roles in our framework for the performance improvements against previous stateoftheart methods.
4.7 Space and Time Complexity
Methods  3DR2N2  OGN  Pix2VoxF  Pix2VoxA 

#Parameters (M)  35.97  12.46  7.41  114.24 
Training (hours)  169  192  12  25 
Backward (ms)  312.50  312.25  12.93  72.01 
Forward, 1view (ms)  73.35  37.90  9.25  9.90 
Forward, 2views (ms)  108.11  N/A  12.05  13.69 
Forward, 4views (ms)  112.36  N/A  23.26  26.31 
Forward, 8views (ms)  117.64  N/A  52.63  55.56 
Table 4 and Figure 1 show the numbers of parameters of different methods. There is an reduction in parameters in Pix2VoxF compared to 3DR2N2.
The running times are obtained on the same PC with an NVIDIA GTX 1080 Ti GPU. For more precise timing, we exclude the reading and writing time when evaluating the forward and backward inference time. Both Pix2VoxF and Pix2VoxA are about times faster in forward inference than 3DR2N2 in singleview reconstruction. In backward inference, Pix2VoxF and Pix2VoxA are about and times faster than 3DR2N2, respectively.
4.8 Discussion
To give a detailed analysis of the contextaware fusion module, we visualized the score maps of three coarse volumes when reconstructing the 3D shape of a table from 3view images, as shown in Figure 4. The reconstruction quality of the table tops on the right is clearly of low quality, and the score of the corresponding part is lower than those in the other two coarse volumes. The fused 3D volume is obtained by combining the selected highquality reconstruction parts, where bad reconstructions can be eliminated effectively by our scoring scheme.
Although our methods outperform stateofthearts, the reconstruction results of our methods are still with a low resolution. We can further improve the reconstruction resolutions in the future work by introducing GANs [6].
5 Conclusion and Future Works
In this paper, we propose a unified framework for both singleview and multiview 3D reconstruction, named Pix2Vox. Compared with existing methods that fuse deep features generated by a shared encoder, the proposed method fuses multiple coarse volumes produced by a decoder and preserves multiview spatial constraints better. Quantitative and qualitative evaluation for both singleview and multiview reconstruction on the ShapeNet and Pascal 3D+ benchmarks indicate that the proposed methods outperform stateofthearts by a large margin. Pix2Vox is computationally efficient, which is 24 times faster than 3DR2N2 in terms of backward inference time. In future work, we will work on improving the resolution of the reconstructed 3D objects. In addition, we also plan to extend Pix2Vox to reconstruct 3D objects from RGBD images.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Project No. 61772158, 61702136 and U1711265. We gratefully acknowledge Prof. Junbao Li and Huanyu Liu for providing GPU hours for this research. We would also like to thank Prof. Wangmeng Zuo for his valuable feedback and help during this research.
References
 [1] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. TPAMI, 37(8):1670–1687, 2015.
 [2] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3DR2N2: A unified approach for single and multiview 3D object reconstruction. In ECCV 2016.
 [3] E. Dibra, H. Jain, A. C. Öztireli, R. Ziegler, and M. H. Gross. Human shape from silhouettes using generative HKS descriptors and crossmodal neural networks. In CVPR 2017.
 [4] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3D object reconstruction from a single image. In CVPR 2017.
 [5] J. FuentesPacheco, J. R. Ascencio, and J. M. RendónMancha. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81, 2015.
 [6] I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS 2014.
 [7] K. Hwang and W. Sung. Single stream parallelization of generalized LSTMlike RNNs on a GPU. In ICASSP 2015.
 [8] A. Kar, C. Häne, and J. Malik. Learning a multiview stereo machine. In NIPS 2017.
 [9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR 2015.
 [10] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv, abs/1312.6114, 2013.
 [11] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91–110, 2004.
 [12] P. Mandikal, N. K. L., M. Agarwal, and V. B. Radhakrishnan. 3DLMNet: Latent embedding matching for accurate and diverse 3D point cloud reconstruction from a single image. In BMVC 2018.
 [13] O. Özyeşil, V. Voroninski, R. Basri, and A. Singer. A survey of structure from motion. Acta Numerica, 26:305–364.
 [14] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML 2013.
 [15] S. R. Richter and S. Roth. Discriminative shape from shading in uncalibrated illumination. In CVPR 2015.
 [16] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In MICCAI 2015.
 [17] S. Savarese, M. Andreetto, H. E. Rushmeier, F. Bernardini, and P. Perona. 3D reconstruction by shadow carving: Theory and practical evaluation. IJCV, 71(3):305–336, 2007.
 [18] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR 2015.
 [19] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3D outputs. In ICCV 2017.
 [20] S. Tulsiani. Learning Singleview 3D Reconstruction of Objects and Scenes. PhD thesis, UC Berkeley, 2018.
 [21] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. In CVPR 2017.
 [22] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. In ICLR 2016.
 [23] M. Wang, L. Wang, and Y. Fang. 3DensiNet: A robust neural network architecture towards 3D volumetric object prediction from 2D image. In ACM MM 2017.
 [24] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang. Pixel2Mesh: Generating 3D mesh models from single RGB images. In ECCV 2018.

[25]
P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong.
OCNN: octreebased convolutional neural networks for 3D shape analysis.
ACM Trans. Graph., 36(4):72:1–72:11, 2017.  [26] A. P. Witkin. Recovering surface shape and orientation from texture. Artif. Intell., 17(13):17–45, 1981.
 [27] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. MarrNet: 3D shape reconstruction via 2.5D sketches. In NIPS 2017.
 [28] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In NIPS 2016.
 [29] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR 2015.
 [30] Y. Xia, Y. Zhang, D. Zhou, X. Huang, C. Wang, and R. Yang. RealPoint3D: Point cloud generation from a single image with complex background. arXiv, abs/1809.02743, 2018.
 [31] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV 2014.
 [32] B. Yang, S. Rosa, A. Markham, N. Trigoni, and H. Wen. Dense 3D object reconstruction from a single depth view. TPAMI, DOI: 10.1109/TPAMI.2018.2868195, 2018.
Comments
There are no comments yet.