Log In Sign Up

3D Object Super-Resolution

by   Edward Smith, et al.
McGill University

We consider the problem of scaling deep generative shape models to high-resolution. To this end, we introduce a novel method for the fast up-sampling of 3D objects in voxel space by super-resolution on the six orthographic depth projections. We demonstrate the training of object-specific super-resolution CNNs for depth maps and silhouettes. This allows us to efficiently generate high-resolution objects, without the cubic computational costs associated with voxel data. We evaluate our work on multiple experiments concerning high-resolution 3D objects, and show our system is capable of accurately increasing the resolution of voxelized objects by a factor of up to 16, to produce objects at resolutions as large as 512×512×512 from 32×32×32 resolution inputs. Additionally, we demonstrate our method can be easily applied in conjunction with the reconstruction of high-resolution objects from RGB images to achieve quantitative and qualitative state-of-the-art performance for this task.


page 1

page 3

page 6

page 7

page 8


Depth Super-Resolution Meets Uncalibrated Photometric Stereo

A novel depth super-resolution approach for RGB-D sensors is presented. ...

Deep Back-Projection Networks For Super-Resolution

The feed-forward architectures of recently proposed deep super-resolutio...

Joint-SRVDNet: Joint Super Resolution and Vehicle Detection Network

In many domestic and military applications, aerial vehicle detection and...

Signal reconstruction via operator guiding

Signal reconstruction from a sample using an orthogonal projector onto a...

Photometric Depth Super-Resolution

This study explores the use of photometric techniques (shape-from-shadin...

MetH: A family of high-resolution and variable-shape image challenges

High-resolution and variable-shape images have not yet been properly add...

Learning Graph Regularisation for Guided Super-Resolution

We introduce a novel formulation for guided super-resolution. Its core i...

Code Repositories


Repository for code to reproduce the paper 3D Object Super Resolution

view repo

1 Introduction

The 3D shape of an object is a combination of countless physical elements that range in scale from gross structure and topology to minute textures endowed by the material of each surface. Intelligent systems require representations capable of modeling this complex shape efficiently, in order to perceive and interact with the physical world in detail (e.g., object grasping, 3D perception, motion prediction and path planning). Deep generative models have recently achieved strong performance in hallucinating diverse 3D object shapes, capturing their overall structure and rough texture Choy et al. (2016); Sharma et al. (2016); Wu et al. (2016). The first generation of these models utilized voxel representations which scale cubically with resolution, limiting training to only shapes on typical hardware. Numerous recent papers have begun to propose high resolution 3D shape representations with better scaling, such as those based on meshes, point clouds or octrees but these often require more difficult training procedures and customized network architectures.

Our 3D shape model is motivated by a foundational concept in 3D perception: that of canonical views. The shape of a 3D object can be completely captured by a set of 2D images from multiple viewpoints (see Luong and Viéville (1996); Denton et al. (2004)

for an analysis of selecting the location and number of viewpoints). Deep learning approaches for 2D image recognition and generation

Simonyan and Zisserman (2014); He et al. (2016); Goodfellow et al. (2014); Karras et al. (2018) scale easily to high resolutions. This motivates the primary question in this paper: can a multi-view representation be used efficiently with modern deep learning methods?

We propose a novel approach for deep shape interpretation which captures the structure of an object via modeling of its canonical views in 2D as depth maps, in a framework we refer to as Multi-View Decomposition (MVD). By utilizing many 2D orthographic projections to capture shape, a model represented in this fashion can be up-scaled to high resolution by performing semantic super-resolution in 2D space, which leverages efficient, well-studied network structures and training procedures. The higher resolution depth maps are finally merged into a detailed 3D object using model carving.

Our method has several key components that allow effective and efficient training. We leverage two synergistic deep networks that decompose the task of representing an object’s depth: one that outputs the silhouette – capturing the gross structure; and a second that produces the local variations in depth – capturing the fine detail. This decomposition addresses the blurred images that often occur when minimizing reconstruction error by allowing the silhouette prediction to form sharp edges. Our method utilizes the low-resolution input shape as a rough template which simply needs carving and refinement to form the high resolution product. Learning the residual errors between this template and the desired high resolution shape simplifies the generation task and allows for constrained output scaling, which leads to significant performance improvements.

We evaluate our method’s ability to perform 3D object reconstruction on the the ShapeNet dataset Chang et al. (2015). This standard evaluation task requires generating high resolution 3D objects from single 2D RGB images. Furthermore, due to the nature of our pipeline we present the first results for 3D object super-resolution – generating high resolution 3D objects directly from low resolution 3D objects. Our method achieves state-of-the-art quantitative performance, when compared to a variety of other 3D representations such as octrees, mesh-models and point clouds. Furthermore, our system is the first to produce 3D objects at resolution. We demonstrate these objects are visually impressive in isolation, and when compared to the ground truth objects. We additionally demonstrate that objects reconstructed from images can be placed in scenes to create realistic environments, as shown in figure 1. In order to ensure reproducible experimental comparison, code for our system has been made publicly available on a GitHub repository111 Given the efficiency of our method, each experiment was run on a single NVIDIA Titan X GPU in the order of hours.

2 Related Work

Figure 1: Scene created from objects reconstructed by our method from RGB images at resolution. See the supplementary video for better viewing

Deep Learning with 3D Data  Recent advances with 3D data have leveraged deep learning, beginning with architectures such as 3D convolutions for object classification Maturana and Scherer (2015); Li et al. (2016)

. When adapted to 3D generation, these methods typically use an autoencoder network, with a decoder composed of 3D deconvolutional layers

Choy et al. (2016); Wu et al. (2016)

. This decoder receives a latent representation of the 3D shape and produces a probability for occupancy at each discrete position in 3D voxel space. This approach has been combined with generative adversarial approaches

Goodfellow et al. (2014) to generate novel 3D objects Wu et al. (2016); Smith and Meger (2017); Liu et al. (2017), but only at a limited resolution.

2D Super-Resolution  Super-resolution of 2D images is a well-studied problem Park et al. (2003). Traditionally, image super-resolution has used dictionary-style methods Freeman et al. (2002); Yang et al. (2010), matching patches of images to higher-resolution counterparts. This research also extends to depth map super-resolution Mac Aodha et al. (2012); Park et al. (2011); Hui et al. (2016). Modern approaches to super-resolution are built on deep convolutional networks Dong et al. (2016); Wang et al. (2015); Osendorfer et al. (2014) as well as generative adversarial networks Ledig et al. (2016); Karras et al. (2018) which use an adversarial loss to imagine high-resolution details in RGB images.

Multi-View Representation  Our work connects to multi-view representations which capture the characteristics of a 3D object from multiple viewpoints in 2D Koenderink and Van Doorn (1976); Murase and Nayar (1995); Su et al. (2015); Qi et al. (2016); Kar et al. (2017); Shin et al. (2018); Riegler et al. (2017), such as decomposing image silhouettes Macrini et al. (2002); Soltani et al. , Light Field Descriptors Chen et al. (2003), and 2D panoramic mapping Shi et al. (2015). Other representations aim to use orientation Saxena et al. (2009), rotational invariance Kazhdan et al. (2003) or 3D-SURF features Knopp et al. (2010). While many of these representations are effective for 3D classification, they have not previously been utilized to recover 3D shape in high resolution.

Efficient 3D Representations  Given that naïve representations of 3D data require cubic computational costs with respect to resolution, many alternate representations have been proposed. Octree methods Tatarchenko et al. (2017); Häne et al. (2017) use non-uniform discretization of the voxel space to efficiently capture 3D objects by adapting the discretization level locally based on shape. Hierarchical surface prediction (HSP) Häne et al. (2017) is an octree-style method which divides the voxel space into free, occupied and boundary space. The object is generated at different scales of resolution, where occupied space is generated at a very coarse resolution and the boundary space is generated at a very fine resolution. Octree generating networks (OGN) Tatarchenko et al. (2017) use a convolutional network that operates directly on octrees, rather than in voxel space. These methods have only shown novel generation results up to resolution. Our method achieves higher accuracy at this resolution and can efficiently produce novel objects as large as .

A recent trend is the use of unstructured representations such as mesh models Pontes et al. (2017); Kato et al. (2017); Wang et al. (2018) and point clouds Qi et al. (2017); Fan et al. (2017) which represent the data by an unordered set with a fixed number of points. MarrNet Wu et al. (2017), which resembles our work, models 3D objects through the use of 2.5D sketches, which capture depth maps from a single viewpoint. This approach requires working in voxel space when translating 2.5D sketches to high resolution, while our method can work directly in 2D space, leveraging 2D super-resolution technology within the 3D pipeline.

Figure 2: The complete pipeline for 3D object reconstruction and super-resolution outlined in this paper. Our method accepts either a single RGB image for low resolution reconstruction or a low resolution object for 3D super-resolution. ODM up-scaling is defined in section 3.1 and model carving in section 3.2

3 Method

In this section we describe our methodology for representing high resolution 3D objects. Our algorithm is a novel approach which uses the six axis-aligned orthographic depth maps (ODM), to efficiently scale 3D objects to high resolution without directly interacting with the voxels. To achieve this, a pair of networks is used for each view, decomposing the super-resolution task into predicting the silhouette and relative depth from the low resolution ODM. This approach is able to recover fine object details and scales better to higher resolutions than previous methods, due to the simplified learning problem faced by each network, and scalable computations that occur primarily in 2D image space.

3.1 Orthographic Depth Map Super-Resolution

Figure 3: Our Multi-View Decomposition framework (MVD). Each ODM prediction task can be decomposed into a silhouette and detail prediction. We further simplify the detail prediction task by encoding only the residual details (change from the low resolution input), masked by the ground truth silhouette.

Our method begins by obtaining the orthographic depth maps of the six primary views of the low-resolution 3D object. In an ODM, each pixel holds a value equal to the surface depth of the object along the viewing direction at the corresponding coordinate. This projection can be computed quickly and easily from an axis-aligned 3D array via z-clipping. Super-resolution is then performed directly on these ODMs, before being mapped onto the low resolution object to produce a high resolution object.

Representing an object by a set of depth maps however, introduces a challenging learning problem, which requires both local and global consistency in depth. Furthermore, minimizing the mean squared error results in blurry images without sharp edges Mathieu et al. (2015); Pathak et al. (2016)

. This is particularly problematic as a depth map is required to be bimodal, with large variations in depth to create structure and small variations in depth to create texture and fine detail. To address this concern, we propose decomposing the learning problem into predicting the silhouette and depth map separately. Separating the challenge of predicting gross shape from fine detail regularizes and reduces the complexity of the learning problem, leading to improved results when compared with directly estimating new surface depths.

Our Multi-View Decomposition framework (MVD) uses a set of twin of deep convolutional models and , to separately predict silhouette and variations in depth of the higher resolution ODM. We depict our system in figure 3. The deep convolutional network for predicting the high-resolution silhouette, with parameters , is passed the low resolution ODM , extracted from the input 3D object. The network outputs a probability that each pixel is occupied. It is trained by minimizing the mean squared error between the predicted and true silhouette of the high resolution ODM :


where is an indicator function for each coordinate in the image.

The same low-resolution ODM

is passed through the second deep convolution neural network, denoted

with parameters , whose final output is passed through a sigmoid, to produce an estimate for the variation of the ODM within a fixed range . This output is added to the low-resolution depth map to produce our prediction for a constrained high-resolution depth map :



denotes up-sampling using nearest neighbor interpolation.

We train our network by minimizing the mean squared error between our prediction and the ground truth high-resolution depth map . During training only, we mask the output with the ground truth silhouette to allow effective focus on fine detail for . We further add a smoothing regularizer which penalizes the total variation  Rudin et al. (1992)

within the predicted ODM. Our loss function is a simple combination of these terms:


where is the Hadamard product. The total variation penalty is used as an edge-preserving denoising which smooths out irregularities in the output.

The output of the constrained depth map and silhouette networks are then combined to produce a complete prediction for the high-resolution ODM. This accomplished by masking the constrained high-resolution depth map by the predicted silhouette:


denotes our predicted high resolution ODM which can then be mapped back onto the original low resolution object by model carving to produce a high resolution object. Each of the 6 high resolution ODMS are predicted using the same 2 network models, with the side information for each passed using a forth channel in the corresponding low resolution ODM passed to the networks.

3.2 3D Model Carving

To complete our super-resolution procedure, the six ODMs are combined with the low-resolution object to form a high-resolution object. This begins by further smoothing the up-sampled ODM with an adaptive averaging filter, which only consider neighboring pixels within a small radius. To preserve edges, only neighboring pixels within a threshold of the value of the center pixel are included. This smoothing, along with the total variation regularization in the our loss function, are added to enforce smooth changes in local depth regions.

Model carving begins by first up-sampling the low-resolution model to the desired resolution, using nearest neighbor interpolation. We then use the predicted ODMs to determine the surface of the new object. The carving procedure is separated into (1) structure carving, corresponding to the silhouette prediction , and (2) detail carving, corresponding to the constrained depth prediction .

For the structure carving, for each predicted ODM , if a coordinate is predicted unoccupied, then all voxels perpendicular to the coordinate are highlighted to be removed. The removal only occurs if there is agreement of at least two ODMs for the removal of a voxel. As there is a large amount of overlap in the surface area that the six ODMs observe, this silhouette agreement is enforced to maintain the structure of the object.

This same process occurs for detail carving with . However, we do not require agreement within the constrained depth map predictions. This is because, unlike the silhouettes, a depth map can cause or deepen concavities in the surface of the object which may not be visible from any other face. Requiring agreement among depth maps would eliminate their ability to influence these concavities. Thus, performing detail carving simply involves removing all voxels perpendicular to each coordinate of each ODM, up to the predicted depth.

4 Experiments

In this section we present our results for our method, Multi-View Decomposition Networks (MVD), for both 3D object super-resolution and 3D object reconstruction from single RGB images. Our results are evaluated across 13 classes of the ShapeNet Chang et al. (2015) dataset. 3D super-resolution is the task of generating a high resolution 3D object conditioned on a low resolution input, while 3D object reconstruction is the task of re-creating high resolution 3D objects from a single RGB image of the object.

4.1 3D Object Super-Resolution

Figure 4: Super-resolution rendering results. Each set shows, from left to right, the low resolution input and the results of MVD at . Sets in (b) additionally show the ground-truth objects on the far right.
Figure 5: Super-resolution rendering results. Each pair shows the low resolution input (left) and the results of MVD at resolution (right).

Dataset  The dataset consists of low resolution voxelized objects and their high resolution counterparts. These objects were produced by converting CAD models found in the ShapeNetCore dataset Chang et al. (2015) into voxel format, in a canonical view. We work with the three commonly used object classes from this dataset: Car, Chair and Plane, with around 8000, 7000, 4000 objects respectively. For training, we pre-process this dataset, to extract the six ODMs from each object at high and low-resolution. CAD models converted at this resolution do not remain watertight in many cases, making it difficult to fill the inner volume of the object. We describe an efficient method for obtaining high resolution voxelized objects in the supplementary material. Data is split into training, validation, and test set using a ratio of 70:10:20 respectively.

Evaluation  We evaluate our method quantitatively using the intersection over union metric (IoU) against a simple baseline and the prediction of the individual networks on the test set. The baseline method corresponds to the ground truth at resolution, by up-scaling to the high resolution using nearest neighbor up-sampling. While our full method, MVD, uses a combination of networks, we present an ablation study to evaluate the contribution of each separate network.

Implementation  The super-resolution task requires a pair of networks, and , which share the same architecture. This architecture is derived from the generator of SRGAN Ledig et al. (2016), a state of the art 2D super-resolution network. Exact network architectures and training regime are provided in the supplementary material.

Results  The super-resolution IoU scores are presented in table 1. Our method greatly outperforms the naïve nearest neighbor up-sampling baseline in every class. While we find that the silhouette prediction contributes far more to the IoU score, the addition of the depth variation network further increases the IoU score. This is due to the silhouette capturing the gross structure of the object from multiple viewpoints, while the depth variation captures the fine-grained details, which contributes less to the total IoU score. To qualitatively demonstrate the results of our super-resolution system we render objects from the test set at both resolution in figure 5 and resolution in figure 4. The predicted high-resolution objects are all of high quality and accurately mimic the shapes of the ground truth objects. Additional renderings as well as multiple objects from each class at resolution can be found in our supplementary material.

4.2 3D Object Reconstruction from RGB Images

Category Baseline Depth Variation () Silhouette () MVD (Both)
Car 73.2 80.6 86.9 89.9
Chair 54.9 58.5 67.3 68.5
Plane 39.9 50.5 70.2 71.1
Table 1: Super-Resolution IoU Results against nearest neighbor baseline and an ablation over individual networks at from input.

Dataset  To match the datasets used by prior work, two datasets are used for 3D object reconstruction, both derived from the ShapeNet dataset. The first, which we refer to as , consists of only the Car, Chair and Plane classes from the Shapenet dataset, and we re-use the and voxel objects produced for these classes in the previous section. The CAD models for each of these object were rendered into RGB images capturing random viewpoints of the objects at elevations between and all possible azimuth rotations. The voxelized objects and corresponding images were split into a training, validation and test set, with a ratio of 70:10:20 respectively.

The second dataset, which we refer to as , is that provided by Choy et al. (2016). It consists of images and objects produced from the 3 classes in the ShapeNet dataset used in the previous section, as well as 10 additional classes, for a total of around 50000 objects. From each object RGB images are rendered at random viewpoints, and we again compute their and resolution voxelized models and ODMs. The data is split into a training, validation and test set with a ratio of 70:10:20.

Evaluation  We evaluate our method quantitatively with two evaluation schemes. In the first, we use IoU scores when reconstructing objects at resolution. We compare against HSP Häne et al. (2017) using the first dataset , and against OGN Tatarchenko et al. (2017) using the second dataset . To study the effectiveness of our super-resolution pipeline, we also compute the IoU scores using the low resolution objects predicted by our autoencoder (AE) with nearest neighbor up-sampling to produce predictions at resolution.

Our second evaluation is performed only on the second dataset, , by comparing the accuracy of the surfaces of predicted objects to those of the ground truth meshes. Following the evaluation procedure defined by Wang et al. (2018), we first convert the

voxel models into meshes by defining squared polygons on all exposed faces on the surface of the voxel models. We then uniformly sample points from the two mesh surfaces and compute F1 scores. Precision and recall are calculated using the percentage of points found with a nearest neighbor in the ground truth sampling set less than a squared distance threshold of

. We compare to state of the art mesh model methods, N3MR Kato et al. (2017) and Pixel2Mesh Wang et al. (2018), a point cloud method, PSG Fan et al. (2017), and a voxel baseline, 3D-R2N2 Choy et al. (2016), using the values reported by Wang et al. (2018).

Implementation  For 3D object reconstruction, we first trained a standard autoencoder, similar to prior work Choy et al. (2016); Smith and Meger (2017), to produce objects at resolution. These low resolution objects are then used with our 3D super-resolution method, to generate 3D object reconstructions at a high resolution. This process is described in figure 2. The exact network architecture and training regime are provided in the supplementary material.

Figure 6: 3D object reconstruction rendering results from our method, MVD (bottom), of the 13 classes from ShapeNet, by interpreting 2D image input (top).
Category AE HSP Häne et al. (2017) MVD (Ours)
Car 55.2 70.1 72.7
Chair 36.4 37.8 40.1
Plane 28.9 56.1 56.4
Category AE OGN Tatarchenko et al. (2017) MVD (Ours)
Car 68.1 78.2 80.7
Chair 37.6 - 43.3
Plane 34.6 - 58.9
Table 2: 3D Object Reconstruction IoU at . Cells with a dash - indicate that the corresponding result was not reported by the original author.
Category 3D-R2N2 Choy et al. (2016) PSG Fan et al. (2017) N3MR Kato et al. (2017) Pixel2Mesh Wang et al. (2018) MVD (Ours)
Plane 41.46 68.20 62.10 71.12 87.34
Bench 34.09 49.29 35.84 57.57 69.92
Cabinet 49.88 39.93 21.04 60.39 65.87
Car 37.80 50.70 36.66 67.86 67.69
Chair 40.22 41.60 30.25 54.38 62.57
Monitor 34.38 40.53 28.77 51.39 57.48
Lamp 32.35 41.40 27.97 48.15 48.37
Speaker 45.30 32.61 19.46 48.84 53.88
Firearm 28.34 69.96 52.22 73.20 78.12
Couch 40.01 36.59 25.04 51.90 53.66
Table 43.79 53.44 28.40 66.30 68.06
Cellphone 42.31 55.95 27.96 70.24 86.00
Watercraft 37.10 51.28 43.71 55.12 64.07
Mean 39.01 48.58 33.80 59.72 66.39
Table 3: 3D object reconstruction surface sampling F1 scores.

Results  The results of our IoU evaluation compared to the octree methods Tatarchenko et al. (2017); Häne et al. (2017) can be seen in table 2. We achieve state-of-the-art performance on every object class in both datasets. Our surface accuracy results can be seen in table 3 compared to Wang et al. (2018); Fan et al. (2017); Kato et al. (2017); Choy et al. (2016). Our method achieves state of the art results on all 13 classes. We show significant improvements for many object classes and demonstrate a large improvement on the mean over all classes when compared against the methods presented. To qualitatively evaluate our performance, we rendered our reconstructions for each class, which can be seen in figure 6. Additional renderings can be found in the supplementary material.

5 Conclusion

In this paper we argue for the application of multi-view representations when predicting the structure of objects at high resolution. We outline our Multi-View Decomposition framework, a novel system for learning to represent 3D objects and demonstrate its affinity for capturing category-specific shape details at a high resolution by operating over the six orthographic projections of the object.

In the task of super-resolution, our method outperforms baseline methods by a large margin, and we show its ability to produce objects as large as

, with a 16 times increase in size from the input objects. The results produced are visually impressive, even when compared against the ground-truth. When applied to the reconstruction of high-resolution 3D objects from single RGB images, we outperform several state of the art methods with a variety of representation types, across two evaluation metrics.

All of our visualizations demonstrate the effectiveness of our method at capturing fine-grained detail, which is not present in the low resolution input but must be captured in our network’s weights during learning. Furthermore, given that the deep aspect of our method works entirely in 2D space, our method scales naturally to high resolutions. This paper demonstrates that multi-view representations along with 2D super-resolution through decomposed networks is indeed capable of modeling complex shapes.


  • Chang et al. [2015] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • Chen et al. [2003] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, and Ming Ouhyoung.

    On visual similarity based 3d model retrieval.

    In Computer graphics forum, volume 22, pages 223–232. Wiley Online Library, 2003.
  • Choy et al. [2016] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In

    European Conference on Computer Vision

    , pages 628–644. Springer, 2016.
  • Denton et al. [2004] Trip Denton, M Fatih Demirci, Jeff Abrahamson, Ali Shokoufandeh, and Sven Dickinson. Selecting canonical views for view-based 3-d object recognition. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 273–276. IEEE, 2004.
  • Dong et al. [2016] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence, 38(2):295–307, 2016.
  • Fan et al. [2017] Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object reconstruction from a single image. In Conference on Computer Vision and Pattern Recognition (CVPR), volume 38, 2017.
  • Freeman et al. [2002] William T Freeman, Thouis R Jones, and Egon C Pasztor. Example-based super-resolution. IEEE Computer graphics and Applications, 22(2):56–65, 2002.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. 2014.
  • Häne et al. [2017] Christian Häne, Shubham Tulsiani, and Jitendra Malik. Hierarchical surface prediction for 3d object reconstruction. arXiv preprint arXiv:1704.00710, 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hui et al. [2016] Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. Depth map super-resolution by deep multi-scale guidance. pages 353–369, 2016.
  • Kar et al. [2017] Abhishek Kar, Christian Häne, and Jitendra Malik. Learning a multi-view stereo machine. In Advances in Neural Information Processing Systems, pages 364–375, 2017.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations, 2018.
  • Kato et al. [2017] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. arXiv preprint arXiv:1711.07566, 2017.
  • Kazhdan et al. [2003] Michael Kazhdan, Thomas Funkhouser, and Szymon Rusinkiewicz. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In Symposium on geometry processing, volume 6, pages 156–164, 2003.
  • Knopp et al. [2010] Jan Knopp, Mukta Prasad, Geert Willems, Radu Timofte, and Luc Van Gool. Hough transform and 3d surf for robust three dimensional classification. In European Conference on Computer Vision, pages 589–602. Springer, 2010.
  • Koenderink and Van Doorn [1976] Jan J Koenderink and Andrea J Van Doorn. The singularities of the visual mapping. Biological cybernetics, 24(1):51–59, 1976.
  • Ledig et al. [2016] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016.
  • Li et al. [2016] Yangyan Li, Soeren Pirk, Hao Su, Charles R Qi, and Leonidas J Guibas. Fpnn: Field probing neural networks for 3d data. In Advances in Neural Information Processing Systems, pages 307–315, 2016.
  • Liu et al. [2017] Jerry Liu, Fisher Yu, and Thomas Funkhouser. Interactive 3d modeling with a generative adversarial network. arXiv preprint arXiv:1706.05170, 2017.
  • Luong and Viéville [1996] Q-T Luong and Thierry Viéville. Canonical representations for the geometries of multiple projective views. Computer vision and image understanding, 64(2):193–229, 1996.
  • Mac Aodha et al. [2012] Oisin Mac Aodha, Neill DF Campbell, Arun Nair, and Gabriel J Brostow. Patch based synthesis for single depth image super-resolution. In European Conference on Computer Vision, pages 71–84. Springer, 2012.
  • Macrini et al. [2002] Diego Macrini, Ali Shokoufandeh, Sven Dickinson, Kaleem Siddiqi, and Steven Zucker. View-based 3-d object recognition using shock graphs. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 24–28. IEEE, 2002.
  • Mathieu et al. [2015] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
  • Maturana and Scherer [2015] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
  • Murase and Nayar [1995] Hiroshi Murase and Shree K Nayar. Visual learning and recognition of 3-d objects from appearance. International journal of computer vision, 14(1):5–24, 1995.
  • Osendorfer et al. [2014] Christian Osendorfer, Hubert Soyer, and Patrick Van Der Smagt. Image super-resolution with fast approximate convolutional sparse coding. In International Conference on Neural Information Processing, pages 250–257. Springer, 2014.
  • Park et al. [2011] Jaesik Park, Hyeongwoo Kim, Yu-Wing Tai, Michael S Brown, and Inso Kweon. High quality depth map upsampling for 3d-tof cameras. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1623–1630. IEEE, 2011.
  • Park et al. [2003] Sung Cheol Park, Min Kyu Park, and Moon Gi Kang. Super-resolution image reconstruction: a technical overview. IEEE signal processing magazine, 20(3):21–36, 2003.
  • Pathak et al. [2016] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2536–2544, 2016.
  • Pontes et al. [2017] Jhony K Pontes, Chen Kong, Sridha Sridharan, Simon Lucey, Anders Eriksson, and Clinton Fookes. Image2mesh: A learning framework for single image 3d reconstruction. arXiv preprint arXiv:1711.10669, 2017.
  • Qi et al. [2016] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2016.
  • Qi et al. [2017] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • Riegler et al. [2017] Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, and Andreas Geiger. Octnetfusion: Learning depth fusion from data. In Proceedings of the International Conference on 3D Vision, 2017.
  • Rudin et al. [1992] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
  • Saxena et al. [2009] Ashutosh Saxena, Min Sun, and Andrew Y Ng. Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence, 31(5):824–840, 2009.
  • Sharma et al. [2016] Abhishek Sharma, Oliver Grau, and Mario Fritz. Vconv-dae: Deep volumetric shape learning without object labels. In European Conference on Computer Vision, pages 236–250. Springer, 2016.
  • Shi et al. [2015] Baoguang Shi, Song Bai, Zhichao Zhou, and Xiang Bai. Deeppano: Deep panoramic representation for 3-d shape recognition. IEEE Signal Processing Letters, 22(12):2339–2343, 2015.
  • Shin et al. [2018] Daeyun Shin, Charless Fowlkes, and Derek Hoiem. Pixels, voxels, and views: A study of shape representations for single view 3d object shape prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Smith and Meger [2017] Edward J Smith and David Meger. Improved adversarial systems for 3d object generation and reconstruction. In Conference on Robot Learning, pages 87–96, 2017.
  • [42] Amir Arsalan Soltani, Haibin Huang, Jiajun Wu, Tejas D Kulkarni, and Joshua B Tenenbaum. Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks.
  • Su et al. [2015] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision, pages 945–953, 2015.
  • Tatarchenko et al. [2017] Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2088–2096, 2017.
  • Wang et al. [2018] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018.
  • Wang et al. [2015] Zhaowen Wang, Ding Liu, Jianchao Yang, Wei Han, and Thomas Huang. Deep networks for image super-resolution with sparse prior. In Proceedings of the IEEE International Conference on Computer Vision, pages 370–378, 2015.
  • Wu et al. [2016] Jiajun Wu, Chengkai Zhang, Tianfan Xue, William T Freeman, and Joshua B Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • Wu et al. [2017] Jiajun Wu, Yifan Wang, Tianfan Xue, Xingyuan Sun, Bill Freeman, and Josh Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances In Neural Information Processing Systems, pages 540–550, 2017.
  • Yang et al. [2010] Jianchao Yang, John Wright, Thomas S Huang, and Yi Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.