Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images

01/31/2019 ∙ by Haozhe Xie, et al. ∙ SenseTime Corporation Harbin Institute of Technology 0

Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same set of input images with different orders, RNN-based approaches are unable to produce consistent reconstruction results. Moreover, due to long-term memory loss, RNNs cannot fully exploit input images to refine reconstruction results. To solve these problems, we propose a novel framework for single-view and multi-view 3D reconstruction, named Pix2Vox. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. Then, a context-aware fusion module is introduced to adaptively select high-quality reconstructions for each part (e.g., table legs) from different coarse 3D volumes to obtain a fused 3D volume. Finally, a refiner further refines the fused 3D volume to generate the final output. Experimental results on the ShapeNet and Pascal 3D+ benchmarks indicate that the proposed Pix2Vox outperforms state-of-the-arts by a large margin. Furthermore, the proposed method is 24 times faster than 3D-R2N2 in terms of backward inference time. The experiments on ShapeNet unseen 3D categories have shown the superior generalization abilities of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

0.54

0.56

0.58

0.60

0.62

0.64

0.66

Pix2Vox-A

Pix2Vox-F

PSGN

3D-R2N2

OGN

5 M

10 M

50 M

100 M

# Parameters

Forward Inference Time (ms)

Intersection over Union (IoU)
Figure 1: Forward inference time, model size, and IoU of state-of-the-arts and our methods for voxel-based 3D reconstruction on the ShapeNet testing set. The radius of each circle represents the size of the corresponding model. Pix2Vox outperforms state-of-the-arts in forward inference time and reaches the best balance between accuracy and model size.
Figure 2: An overview of the proposed Pix2Vox. The network recovers the shape of 3D objects from arbitrary (uncalibrated) single or multiple images. The reconstruction results can be refined when more input images are available. Note that the weights of the encoder and decoder are shared among all views.

3D reconstruction is an important problem in robotics, CAD, virtual reality and augmented reality. Traditional methods, such as Structure from Motion (SfM) [13] and Simultaneous Localization and Mapping (SLAM) [5], match image features across views. However, establishing feature correspondences becomes extremely difficult when multiple viewpoints are separated by a large margin due to local appearance changes or self-occlusions [11]

. To overcome these limitations, several deep learning based approaches, including 3D-R2N2

[2], LSM [8], and 3DensiNet [23], have been proposed to recover the 3D shape of an object and obtained promising results.

To generate 3D volumes, 3D-R2N2 [2] and LSM [8]

formulate multi-view 3D reconstruction as a sequence learning problem and use recurrent neural networks (RNNs) to fuse multiple feature maps extracted by a shared encoder from input images. The feature maps are incrementally refined when more views of an object are available. However, RNN-based methods suffer from three limitations. First, when given the same set of images with different orders, RNNs are unable to estimate the 3D shape of an object consistently results due to permutation variance

[22]. Second, due to long-term memory loss of RNNs, the input images cannot be fully exploited to refine reconstruction results [14]. Last but not least, RNN-based methods are time-consuming since input images are processed sequentially without parallelization [7].

To address the issues mentioned above, we propose Pix2Vox, a novel framework for single-view and multi-view 3D reconstruction that contains four modules: encoder, decoder, context-aware fusion, and refiner. The encoder and decoder generate coarse 3D volumes from multiple input images in parallel, which eliminates the effect of the orders of input images and accelerates the computation. Then, the context-aware fusion module selects high-quality reconstructions from all coarse 3D volumes and generates a fused 3D volume, which fully exploits information of all input images without long-term memory loss. Finally, the refiner further correct wrongly recovered parts of the fused 3D volumes to obtain a refined reconstruction. To achieve a good balance between accuracy and model size, we implement two versions of the proposed framework: Pix2Vox-F and Pix2Vox-A (Figure 1).

The contributions can be summarized as follows:

  • We present a unified framework for both single-view and multi-view 3D reconstruction, namely Pix2Vox. We equip Pix2Vox with well-designed encoder, decoder, and refiner, which shows a powerful ability to handle 3D reconstruction in both synthetic and real-world images.

  • We propose a context-aware fusion module to adaptively select high-quality reconstructions for each part from different coarse 3D volumes in parallel to produce a fused reconstruction of the whole object. To the best of our knowledge, it is the first time to exploit context across multiple views for 3D reconstruction.

  • Experimental results on the ShapeNet [29] and Pascal 3D+ [31] datasets demonstrate that the proposed approaches outperform state-of-the-art methods in terms of both accuracy and efficiency. Additional experiments also show its strong generalization abilities in reconstructing unseen 3D objects.

2 Related Work

Figure 3: The network architecture of (top) Pix2Vox-F and (bottom) Pix2Vox-A. The EDLoss and the RLoss are defined as Equation 3. To reduce the model size, the refiner is removed in Pix2Vox-F.

Single-view 3D Reconstruction Theoretically, recovering 3D shape from single-view images is an ill-posed problem. To address this issue, many attempts have been made, such as ShapeFromX [1, 17], where X may represent silhouettes [3], shading [15], and texture [26]. However, these methods are barely applicable to use in the real-world scenarios, because all of them require strong presumptions and abundant expertise in natural images [30].

With the success of generative adversarial networks (GANs) [6]

and variational autoencoders (VAEs)

[10], 3D-VAE-GAN [28] adopts GAN and VAE to generate 3D objects by taking a single-view image as input. However, 3D-VAE-GAN requires class labels for reconstruction. MarrNet [27] reconstructs 3D objects by estimating depth, surface normals, and silhouettes of 2D images, which is challenging and usually leads to severe distortion [20]. OGN [19] and O-CNN [25] use octree to represent higher resolution volumetric 3D objects with a limited memory budget. However, OGN representations are complex and consume more computational resources due to the complexity of octree representations. PSGN [4] and 3D-LMNet [12]

generate point clouds from single-view images. However, the points have a large degree of freedom in the point cloud representation because of the limited connections between points. Consequently, these methods cannot recover 3D volumes accurately

[24].

Multi-view 3D Reconstruction SfM [13] and SLAM [5] methods are successful in handling many scenarios. These methods match features among images and estimate the camera pose for each image. However, the matching process becomes difficult when multiple viewpoints are separated by a large margin. Besides, scanning all surfaces of an object before reconstruction is sometimes impossible, which leads to incomplete 3D shapes with occluded or hollowed-out areas [32].

Powered by large-scale datasets of 3D CAD models (e.g., ShapeNet [29]), deep-learning-based methods have been proposed for 3D reconstruction. Both 3D-R2N2 [2] and LSM [8] use RNNs to infer 3D shape from single or multiple input images and achieve impressive results. However, RNNs are time-consuming and permutation-variant, which produce inconsistent reconstruction results. 3DensiNet [23]

uses max pooling to aggregate the features from multiple images. However, max pooling only extracts maximum values from features, which may ignore other valuable features that are useful for 3D reconstruction.

3 The Method

3.1 Overview

The proposed Pix2Vox aims to reconstruct the 3D shape of an object from either single or multiple RGB images. The 3D shape of an object is represented by a 3D voxel grid, where is an empty cell and denotes an occupied cell. The key components of Pix2Vox are shown in Figure 2. First, the encoder produces feature maps from input images. Second, the decoder takes each feature map as input and generates a coarse 3D volume correspondingly. Third, single or multiple 3D volumes are forwarded to the context-aware fusion module, which adaptively selects high-quality reconstructions for each part from coarse 3D volumes to obtain a fused 3D volume. Finally, the refiner with skip-connections further refines the fused 3D volume to generate the final reconstruction result.

3.2 Network Architecture

Figure 3 shows the detailed architectures of Pix2Vox-F and Pix2Vox-A. The former involves much fewer parameters and lower computational complexity, while the latter has more parameters, which can construct more accurate 3D shapes but has higher computational complexity.

3.2.1 Encoder

The encoder is to compute a set of features for the decoder to recover the 3D shape of the object. The first nine convolutional layers, along with the corresponding batch normalization layers and ReLU activations of a pre-trained VGG-16

[18], are used to extract a

feature tensor from a

image. This feature extraction is followed by three sets of 2D convolutional layers, batch normalization layers and ELU layers to embed semantic information into feature vectors. In Pix2Vox-F, the kernel size of the first convolutional layer is

while the kernel sizes of the other two are . The number of output channels of the convolutional layer starts with and decreases by half for the subsequent layer and ends up with . In Pix2Vox-A, the kernel sizes of the three convolutional layers are , , and , respectively. The output channels of the three convolutional layers are , , and , respectively. After the second convolutional layer, there is a max pooling layer with kernel sizes of and in Pix2Vox-F and Pix2Vox-A, respectively. The feature vectors produced by Pix2Vox-F and Pix2Vox-A are of sizes and , respectively.

3.2.2 Decoder

The decoder is responsible for transforming information of 2D feature maps into 3D volumes. There are five 3D transposed convolutional layers in both Pix2Vox-F and Pix2Vox-A. Specifically, the first four transposed convolutional layers are of a kernel size of

, with stride of

and padding of

. There is an additional transposed convolutional layer with a bank of

filter. Each transposed convolutional layer is followed by a batch normalization layer and a ReLU activation except for the last layer followed by a sigmoid function. In Pix2Vox-F, the numbers of output channels of the transposed convolutional layers are

, , , , and , respectively. In Pix2Vox-A, the numbers of output channels of the five transposed convolutional layers are , , , , and , respectively. The decoder outputs a voxelized shape in the object’s canonical view.

3.2.3 Context-aware Fusion

Figure 4: Visualization of the score maps in the context-aware fusion module. The context-aware fusion module generates higher scores for high-quality reconstructions, which can eliminate the effect of the missing or wrongly recovered parts.
Figure 5: An overview of the context-aware fusion module. It aims to select high-quality reconstructions for each part to construct the final results. The objects in the bounding box describe the procedure score calculation for a coarse volume . The other scores are calculated according to the same procedure. Note that the weights of the context scoring network are shared among different views.

From different viewpoints, we can see different visible parts of an object. The reconstruction qualities of visible parts are much higher than those of invisible parts. Inspired by this observation, we propose a context-aware fusion module to adaptively select high-quality reconstruction for each part (e.g., table legs) from different coarse 3D volumes. The selected reconstructions are fused to generate a 3D volume of the whole object (Figure 4).

As shown in Figure 5, given coarse 3D volumes and the corresponding context, the context-aware fusion module generates a score map for each coarse volume and then fuses them into one volume by the weighted summation of all coarse volumes according to their score maps. The spatial information of voxels is preserved in the context-aware fusion module, and thus Pix2Vox can utilize multi-view information to recover the structure of an object better.

Specifically, the context-aware fusion module generates the context of the -th coarse volume by concatenating the output of the last two layers in the decoder. Then, the context scoring network generates a score for the context of the -th coarse voxel. The context scoring network is composed of five sets of 3D convolutional layers, each of which has a kernel size of and padding of , followed by a batch normalization and a leaky ReLU activation. The numbers of output channels of convolutional layers are , , , , and , respectively. The learned score for context are normalized across all learnt scores. We choose softmax as the normalization function. Therefore, the score at position for the -th voxel can be calculated as

(1)

where represents the number of views. Finally, the fused voxel is produced by summing up the product of coarse voxels and the corresponding scores altogether.

(2)

3.2.4 Refiner

Category 3D-R2N2 [2] 3DensiNet [23] OGN [19] DRC [21] PSGN [4] Pix2Vox-F Pix2Vox-A
airplane 0.513 - 0.587 0.570 0.601 0.624 0.689
bench 0.421 - 0.481 (0.453) 0.550 0.537 0.613
cabinet 0.716 0.743 0.729 (0.635) 0.771 0.720 0.757
car 0.798 0.818 0.828 0.760 0.831 0.798 0.806
chair 0.466 0.451 0.483 0.470 0.544 0.570 0.599
display 0.468 0.452 0.502 (0.419) 0.552 0.543 0.560
lamp 0.381 - 0.398 (0.415) 0.462 0.463 0.492
speaker 0.662 - 0.637 (0.609) 0.737 0.637 0.641
rifle 0.544 - 0.593 (0.608) 0.604 0.627 0.638
sofa 0.628 0.690 0.646 (0.606) 0.708 0.669 0.688
table 0.513 0.549 0.536 (0.424) 0.606 0.592 0.613
telephone 0.661 0.726 0.702 (0.413) 0.749 0.745 0.761
watercraft 0.513 0.557 0.632 (0.556) 0.611 0.569 0.586
Overall 0.560 - 0.596 (0.546) 0.640 0.633 0.658
Table 1: Single-view reconstruction on ShapeNet compared using Intersection-over-Union (IoU). The best number for each category is highlighted in bold. The numbers in the parenthesis are results trained and tested with the released code. Note that DRC [21] is trained/tested per category and PSGN [4] takes object masks as an additional input.

The refiner can be seen as a residual network, which aims to correct wrongly recovered parts of a 3D volume. It follows the idea of a 3D encoder-decoder with the U-net connections [16]. With the help of the U-net connections between the encoder and decoder, the local structure in the fused volume can be preserved. Specifically, the encoder has three 3D convolutional layers, each of which has a bank of filters with padding of , followed by a batch normalization layer, a leaky ReLU activation and a max pooling layer with a kernel size of . The numbers of output channels of convolutional layers are , , and , respectively. The encoder is finally followed by two fully connected layers with dimensions of and . The decoder consists of three transposed convolutional layers, each of which has a bank of filters with padding of and stride of . Except for the last transposed convolutional layer that is followed by a sigmoid function, other layers are followed by a batch normalization layer and a ReLU activation.

3.2.5 Loss Function

The loss function of the network is defined as the mean value of the voxel-wise binary cross entropies between the reconstructed object and the ground truth. More formally, it can be defined as

(3)

where denotes the number of voxels in the ground truth. and represent the predicted occupancy and the corresponding ground truth. The smaller the value is, the closer the prediction is to the ground truth.

4 Experiments

4.1 Datasets and Metrics

Datasets We evaluate the proposed Pix2Vox-F and Pix2Vox-A on both synthetic images of objects from the ShapeNet [29] dataset and real images from the Pascal 3D+ [31] dataset. More specifically, we use a subset of ShapeNet consisting of 13 major categories and 44k 3D models following the settings of [2]. As for Pascal 3D+, there are 12 categories and 22k models.

Evaluation Metrics

To evaluate the quality of the output from the proposed methods, we binarize the probabilities at a fixed threshold of 0.4 and use intersection over union (IoU) as the similarity measure. More formally,

(4)

where and represent the predicted occupancy probability and the ground truth at , respectively. is an indicator function and denotes a voxelization threshold. Higher IoU values indicate better reconstruction results.

4.2 Implementation Details

We use RGB images as input to train the proposed methods with a shape batch size of . The output voxelized reconstruction is

in size. We implement our network in PyTorch

111Source code will be publicly available. and train both Pix2Vox-F and Pix2Vox-A using an Adam optimizer [9] with a of and a of . The initial learning rate is set to

and decayed by 2 after 150 epochs. The optimization is set to stop after 250 epochs.

4.3 Reconstruction of Synthetic Images

Methods 1 view 2 views 3 views 4 views 5 views 8 views 12 views 16 views 20 views
3D-R2N2 [2] 0.560 0.603 0.617 0.625 0.634 (0.635) (0.636) (0.636) (0.636)
Pix2Vox-F   0.633 0.658 0.671 0.675 0.679 0.684 0.687 0.689 0.690
Pix2Vox-F 0.633 0.665 0.677 0.681 0.685 0.688 0.690 0.692 0.693
Pix2Vox-A 0.658 0.675 0.685 0.688 0.691 0.696 0.698 0.699 0.701
Pix2Vox-A 0.658 0.682 0.692 0.695 0.697 0.701 0.702 0.704 0.705
Table 2: Multi-view reconstruction on ShapeNet compared using Intersection-over-Union (IoU). The best results for different numbers of views are highlighted in bold. The numbers in the parenthesis are results tested with the released model. The marker indicates that the context-aware fusion is replaced with the average fusion.
Figure 6: Single-view (left) and multi-view (right) reconstructions on the ShapeNet testing set. GT represents the ground truth of the 3D object. Note that DRC [21] is trained/tested per category.

To evaluate the performance of the proposed methods in handling synthetic images, we compare our methods against several state-of-the-art methods on the ShapeNet testing set. Table 1 shows the performance of single-view reconstruction, while Table 2 shows the mean IoU scores of multi-view reconstruction with different numbers of views.

The single-view reconstruction results of Pix2Vox-F and Pix2Vox-A significantly outperform other methods (Table 1). Pix2Vox-A increases IoU over 3D-R2N2 by 18%. In multi-view reconstruction, Pix2Vox-A consistently outperforms 3D-R2N2 in all numbers of views (Table 2). The IoU of Pix2Vox-A is 13% higher than that of 3D-R2N2.

Figure 6 shows several reconstruction examples from the ShapeNet testing set. Both Pix2Vox-F and Pix2Vox-A are able to recover the thin parts of objects, such as lamps and table legs. Compare with Pix2Vox-F, we also observe that higher dimensional feature maps in Pix2Vox-A do contribute to 3D reconstruction. Moreover, in multi-view reconstruction, both Pix2Vox-A and Pix2Vox-F produce better results than 3D-R2N2.

4.4 Reconstruction of Real-world Images

Category 3D-R2N2 [2] 3DensiNet [23] OGN [19] DRC [21] Pix2Vox-F Pix2Vox-A
airplane 0.544 - (0.515) 0.550 0.668 0.690
bicycle 0.499 - (0.523) (0.504) 0.621 0.643
boat 0.560 0.326 (0.598) (0.537) 0.787 0.800
bottle (0.331) - (0.466) (0.349) 0.579 0.616
bus 0.816 - (0.515) (0.541) 0.729 0.760
car 0.699 0.607 (0.520) 0.720 0.654 0.657
chair 0.280 0.259 (0.376) 0.340 0.526 0.593
motor 0.649 - (0.534) (0.514) 0.763 0.780
sofa 0.332 0.574 (0.461) (0.406) 0.628 0.634
table (0.261) - (0.277) (0.327) 0.433 0.481
train 0.672 - (0.508) (0.503) 0.747 0.785
TV 0.574 0.606 (0.589) (0.535) 0.655 0.694
Overall (0.537) - (0.504) (0.536) 0.645 0.669
Table 3: Single-view reconstruction on Pascal 3D+ compared using Intersection-over-Union (IoU). The best number for each category is highlighted in bold. The numbers in the parenthesis are results trained and tested with the released code. Note that DRC is trained/tested per category.
Figure 7: Reconstructions on the Pascal 3D+ testing set from single-view images. GT represents the ground truth of the 3D object. Note that DRC [21] is trained/tested per category.

To evaluate the performance on of the proposed methods on real-world images, we test our methods for single-view reconstruction on the Pascal 3D+ dataset. First, the images are cropped according to the bounding box of the largest objects within the image. Then, these cropped images are rescaled to the input size of the reconstruction network.

The mean IoU of each category is reported in Table 3. Both Pix2Vox-F and Pix2Vox-A significantly outperform the competing approaches on the Pascal 3D+ testing set. Compared with other methods, our methods are able to better reconstruct the overall shape and capture finer details from the input images. The qualitative analysis is given in Figure 7, which indicate that the proposed methods are more effective in handling real-world scenarios.

4.5 Reconstruction of Unseen Objects

Figure 8: Reconstruction on unseen objects of ShapeNet from 5-view images. GT represents the ground truth of the 3D object.

In order to test how well our methods can generalize to unseen objects, we conduct additional experiments on ShapeNet. More specifically, all models are trained on the 13 major categories of ShapeNet and tested on the remaining 44 categories of ShapeNet. All pretrained models have never “seen” either the objects in these categories or the labels of objects before. The reconstruction results of 3D-R2N2 are obtained with the released pretrained model.

Several reconstruction results are presented in Figure 8. The reconstruction IoU of 3D-R2N2 on unseen objects is , while Pix2Vox-F and Pix2Vox-A are and , respectively. Experimental results demonstrate that 3D-R2N2 can hardly recover the shape of unseen objects. In contrast, Pix2Vox-F and Pix2Vox-A show satisfactory generalization abilities to unseen objects.

4.6 Ablation Study

In this section, we validate the context-aware fusion and the refiner by ablation studies.

Context-aware fusion To quantitatively evaluate the context-aware fusion, we replace the context-aware fusion in Pix2Vox-A with the average fusion, where the fused voxel can be calculated as

(5)

Table 2 shows that the context-aware fusion performs better than the average fusion in selecting the high-quality reconstructions for each part from different coarse volumes.

Refiner Pix2Vox-A uses a refiner to further refine the fused 3D volume. For single-view reconstruction on ShapeNet, the IoU of Pix2Vox-A is . In contrast, the IoU of Pix2Vox-A without the refiner decreases to . Removing refiner causes considerable degeneration for the reconstruction accuracy. However, as the number of views increases, the effect of the refiner becomes weaker. The reconstruction results of the two networks (with/without the refiner) are almost the same if the number of the input images is more than 3.

The ablation studies indicate that both the context-aware fusion and the refiner play important roles in our framework for the performance improvements against previous state-of-the-art methods.

4.7 Space and Time Complexity

Methods 3D-R2N2 OGN Pix2Vox-F Pix2Vox-A
#Parameters (M) 35.97 12.46 7.41 114.24
Training (hours) 169 192 12 25
Backward (ms) 312.50 312.25 12.93 72.01
Forward, 1-view (ms) 73.35 37.90 9.25 9.90
Forward, 2-views (ms) 108.11 N/A 12.05 13.69
Forward, 4-views (ms) 112.36 N/A 23.26 26.31
Forward, 8-views (ms) 117.64 N/A 52.63 55.56
Table 4: Memory usage and running time on ShapeNet dataset. Note that backward time is measured in single-view reconstruction with a batch size of 1.

Table 4 and Figure 1 show the numbers of parameters of different methods. There is an reduction in parameters in Pix2Vox-F compared to 3D-R2N2.

The running times are obtained on the same PC with an NVIDIA GTX 1080 Ti GPU. For more precise timing, we exclude the reading and writing time when evaluating the forward and backward inference time. Both Pix2Vox-F and Pix2Vox-A are about times faster in forward inference than 3D-R2N2 in single-view reconstruction. In backward inference, Pix2Vox-F and Pix2Vox-A are about and times faster than 3D-R2N2, respectively.

4.8 Discussion

To give a detailed analysis of the context-aware fusion module, we visualized the score maps of three coarse volumes when reconstructing the 3D shape of a table from 3-view images, as shown in Figure 4. The reconstruction quality of the table tops on the right is clearly of low quality, and the score of the corresponding part is lower than those in the other two coarse volumes. The fused 3D volume is obtained by combining the selected high-quality reconstruction parts, where bad reconstructions can be eliminated effectively by our scoring scheme.

Although our methods outperform state-of-the-arts, the reconstruction results of our methods are still with a low resolution. We can further improve the reconstruction resolutions in the future work by introducing GANs [6].

5 Conclusion and Future Works

In this paper, we propose a unified framework for both single-view and multi-view 3D reconstruction, named Pix2Vox. Compared with existing methods that fuse deep features generated by a shared encoder, the proposed method fuses multiple coarse volumes produced by a decoder and preserves multi-view spatial constraints better. Quantitative and qualitative evaluation for both single-view and multi-view reconstruction on the ShapeNet and Pascal 3D+ benchmarks indicate that the proposed methods outperform state-of-the-arts by a large margin. Pix2Vox is computationally efficient, which is 24 times faster than 3D-R2N2 in terms of backward inference time. In future work, we will work on improving the resolution of the reconstructed 3D objects. In addition, we also plan to extend Pix2Vox to reconstruct 3D objects from RGB-D images.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Project No. 61772158, 61702136 and U1711265. We gratefully acknowledge Prof. Junbao Li and Huanyu Liu for providing GPU hours for this research. We would also like to thank Prof. Wangmeng Zuo for his valuable feedback and help during this research.

References

  • [1] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. TPAMI, 37(8):1670–1687, 2015.
  • [2] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In ECCV 2016.
  • [3] E. Dibra, H. Jain, A. C. Öztireli, R. Ziegler, and M. H. Gross. Human shape from silhouettes using generative HKS descriptors and cross-modal neural networks. In CVPR 2017.
  • [4] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3D object reconstruction from a single image. In CVPR 2017.
  • [5] J. Fuentes-Pacheco, J. R. Ascencio, and J. M. Rendón-Mancha. Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev., 43(1):55–81, 2015.
  • [6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversarial nets. In NIPS 2014.
  • [7] K. Hwang and W. Sung. Single stream parallelization of generalized LSTM-like RNNs on a GPU. In ICASSP 2015.
  • [8] A. Kar, C. Häne, and J. Malik. Learning a multi-view stereo machine. In NIPS 2017.
  • [9] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR 2015.
  • [10] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv, abs/1312.6114, 2013.
  • [11] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
  • [12] P. Mandikal, N. K. L., M. Agarwal, and V. B. Radhakrishnan. 3D-LMNet: Latent embedding matching for accurate and diverse 3D point cloud reconstruction from a single image. In BMVC 2018.
  • [13] O. Özyeşil, V. Voroninski, R. Basri, and A. Singer. A survey of structure from motion. Acta Numerica, 26:305–364.
  • [14] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In ICML 2013.
  • [15] S. R. Richter and S. Roth. Discriminative shape from shading in uncalibrated illumination. In CVPR 2015.
  • [16] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI 2015.
  • [17] S. Savarese, M. Andreetto, H. E. Rushmeier, F. Bernardini, and P. Perona. 3D reconstruction by shadow carving: Theory and practical evaluation. IJCV, 71(3):305–336, 2007.
  • [18] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR 2015.
  • [19] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. In ICCV 2017.
  • [20] S. Tulsiani. Learning Single-view 3D Reconstruction of Objects and Scenes. PhD thesis, UC Berkeley, 2018.
  • [21] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR 2017.
  • [22] O. Vinyals, S. Bengio, and M. Kudlur. Order matters: Sequence to sequence for sets. In ICLR 2016.
  • [23] M. Wang, L. Wang, and Y. Fang. 3DensiNet: A robust neural network architecture towards 3D volumetric object prediction from 2D image. In ACM MM 2017.
  • [24] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang. Pixel2Mesh: Generating 3D mesh models from single RGB images. In ECCV 2018.
  • [25] P. Wang, Y. Liu, Y. Guo, C. Sun, and X. Tong.

    O-CNN: octree-based convolutional neural networks for 3D shape analysis.

    ACM Trans. Graph., 36(4):72:1–72:11, 2017.
  • [26] A. P. Witkin. Recovering surface shape and orientation from texture. Artif. Intell., 17(1-3):17–45, 1981.
  • [27] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. MarrNet: 3D shape reconstruction via 2.5D sketches. In NIPS 2017.
  • [28] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In NIPS 2016.
  • [29] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR 2015.
  • [30] Y. Xia, Y. Zhang, D. Zhou, X. Huang, C. Wang, and R. Yang. RealPoint3D: Point cloud generation from a single image with complex background. arXiv, abs/1809.02743, 2018.
  • [31] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond PASCAL: A benchmark for 3D object detection in the wild. In WACV 2014.
  • [32] B. Yang, S. Rosa, A. Markham, N. Trigoni, and H. Wen. Dense 3D object reconstruction from a single depth view. TPAMI, DOI: 10.1109/TPAMI.2018.2868195, 2018.