1 Introduction
A key limitation to current generative models [33, 32, 10, 21, 29, 28] is the availability of suitable training data (e. g., 3D volumes, structured image sets, etc.) for supervision.
While methods exist to learn the 3D structure of classes of objects, they typically require 3D data as input. Regrettably, such 3D data is difficult to acquire, in particular for the “long tail” of exotic classes: ShapeNet might have chair, but it does not have chanterelle. Can we enable biologists to benefit from a generative model for such rarely observed classes?
Addressing this problem we suggest a method to learn 3D structure from 2D images only (Fig. 1). Reasoning about the 3D structure from 2D observations without assuming anything about their relation is challenging as illustrated from Plato’s Allegory of the Cave [30]: How can we hope to understand higher dimensions from only seeing projections? If multiple views (maybe only two [36, 11]) of the same object are available, multiview analysis without 3D supervision has been successful, but regrettably massive image collections do not come in this form but are now and will remain unstructured: they show random instances under random pose, uncalibrated lighting in unknown relations.
Our first main contribution (Sec. 3) is to use adversarial training of a 3D generator that makes use of a discriminator that operates exclusively on widely available unstructured collections of 2D images, which we call platonic discriminator. Here, during training, the generator produces a 3D shape that is projected (rendered) to 2D and presented to a 2D discriminator. Making a connection between the 3D generator and the 2D discriminator, our second key contribution, is enabled by a family of rendering layers that can account for occlusion and color (Sec. 4). These layers do not have and do not need any learnable parameters and are efficient to backpropagate [24]. From these two key blocks we construct a system that learns the 3D shapes of common classes such as chairs and cars, but also exotic ones from unstructured 2D image collections. We demonstrate 3D reconstruction from a single 2D image as a key application (Sec. 5).
2 Related Work
Several papers suggest (adversarial) learning using 3D voxel representations [33, 32, 10, 21, 29, 28, 31, 35, 27, 17] or point cloud input [1, 9]. A simplified design for such a network is seen in Fig. 3, c, adapted to our task: an encoder generates a latent code that is fed into a generator to produce a 3D representation (i. e., a voxel grid). A 3D discriminator now analyzes samples both from the generator and from the ground truth distribution. Note that this procedure requires 3D supervision, i. e., is limited by the type and size of the 3D data set such as [4].
Girdhar et al. [10] work on a joint embedding of 3D voxels and 2D images, but still require 3D voxeliations as input. Fan et al. [8] produce points from 2D images, but similarly with 3D data as training input. Cho et al.’s recursive design takes multiple images as input [5] while also being trained on 3D data. Kar et al. [14] propose a simple “unprojection” network component to establish a relation between 2D pixels and 3D voxels but without resolving occlusion and again with 3D supervision.
Cashman and Fitzgibbon [3] have extracted templatebased shape spaces from collections of 2D images. Similarly, Carreira et al. [2] use correspondence to templates across segmentationlabeled 2D image data sets to reconstruct 3D shapes.
Closer to our approach is Rezende et al. [22] that also learn 3D representations from 2D images. However, they make use of a partially differentiable renderer [19] that is limited to surface orientation and shading, while our formulation can resolve both occlusion from the camera and appearance. Also, their representation of the 3D volume is a latent one, that is, it has no immediate physical interpretation that is required in practice, e. g., for measurements, to run simulations such as renderings or 3D printing. This choice of having a deep representation of the 3D world is shared by Eslami et al. [7]. Tulsiani et al. [26] use a special case of our rendering layer (our ) in a setting where multiple 2D images with a known camera transform for each are available at learning time. We take it a step further and use a GAN to work with unstructured single images at training time. In particular we do not even know the camera pose relative to the object in the image. Finally, our image formation goes beyond visual hulls accounting for color and occlusion.
While early suggestions how to extend differentiable renderers to polygonal meshes exist, they are limited to deformation of a predefined template [16]. We work with voxels, that can express arbitrary topology, e. g., we can generate airplanes with drastically different layout, which are not a mere deformation of a base shape. Combining our approach with sparse voxelizations [23] would allow to reproduce even finer details.
Similarly, interview constraints can be used to learn depth maps [36, 11] using reprojection constraints: If the depth label is correct, reprojecting one image into the other view has to produce the other image. Our method does not learn a single depth map but a full voxel grid and allows principled handling of occlusions.
A generalization from visual hull maps to full 3D scenes is discussed by Yan et al. [34], illustrated in Fig. 3, b: Instead of a 3D loss, they employ a simple projection along major axis allowing to use a 2D loss. However, multiple 2D images of the same object are required. In practice this is achieved by rendering the 3D shape into 2D images from multiple views. This makes two assumptions: We have multiple images in a known relation and available reference appearance (i. e., light, materials). Our approach relaxes those two requirements: we use a discriminator that can work on arbitrary projections and arbitrary natural input images, without known refehorence.
3 3D Shape From 2D Image Collections
Here, we discuss our PlatonicGAN in more detail, making use of our rendering layers to be introduced in Sec. 4.
Common GAN
Our method is a classic (generative) adversarial design [12] with two main differences: The 2D discriminator operates in a different space (2D) than the 3D generator (3D) and the two are linked by a hardwired projection operator (rendering layer, Sec. 4).
Let us recall the classic adversarial learning of 3D shapes [32], which is a minmax game
(1) 
between the data and the generator cost and .
The data cost is
(2) 
where is the discriminator with learned parameters which is presented with samples from the distribution of real 3D shapes . Here denotes the expected value of the distribution .
The generation cost is
(3) 
where is the generator with parameters that maps the latent code to the data domain.
Platonic GANs
extend the generator cost to
(4) 
which projects the generator result from 3D to 2D along sampled view direction . Different implementations of are discussed in Sec. 4.
While many parametrizations for views are possible, we here choose an orthographic camera that looks at the origin, with upright orientation from an Euclidean position . is the expected value across the distributions of views.
Full Platonic GAN
Additionally, we make use of three recent ideas orthogonal to our platonic concept, resulting in
(5) 
where includes an encoding step,
enforces a normal distribution of latent codes and
encourgaes the encodedgeneratedandprojected result to be similar to the encoder input. We detail each of these three steps in the following paragraphs:Generator
First, the final generator does not directly work on a latent codes , but allows for an encoder with parameters that maps an input 2D image or 3D volume to a latent code:
(6) 
Using identity instead of a complex mapping addresses a (random) generation task, while using images or volumes as input is a solution to reconstruction or filtering problems, respectively. Note, that the expected value for the generator is now across 2D images and not across the latent codes anymore.
Reconstruction
Third, we encourage the encoder and generator to reproduce the input in the sense, if the view is the input view, by convention , so
(7) 
While this step is optional, as the original view as well as other views similar to it, is already contained in the generator term, we found it useful to give direct additional weighting to this special view. It adds stability to the optimization as it is easy to find an initial solution that matches this 2D cost before refining the 3D structure.
3.1 Optimization
Two properties enable optimizing for a Platonic GAN: First, maximizing the expected value across the distribution of views and second, backpropagation through the projection operator .
We extend the classic GAN optimization procedure to become Alg. 1 using as the binary cross entropy loss. We use the regularization from Mescheder et al. [20]
for training and increase variation by minibatch standard deviation
[15].View sampling
Different options exist to sample the views in the for loop in Alg. 1, even if the the distribution
is uniform. A simple solution is to sample one view, entirely randomly, in every classic GAN update step. While this works, it introduces a lot of variance into the estimate of the gradient which ideally was over all views.
We therefore sample multiple view for every update. This also is economical, as the result of , a 3D volume, is costly to produce once, but simple to render multiple times.
A more refined strategy keeps those views fixed across multiple () updates. This allows the generator to get a set of certain views right, before moving to new ones.
Finally, we suggest an “umbrella” strategy. The intuition is, that generating a 3D volume from one view is simple: just extrude the contour in 3D; this would not look objectionable to any critic with limited view variation. To resolve errors from other views, it might be simplest to start fixing errors from views similar to a view that is already right. Incrementally, views become more varied, like opening an umbrella. This is achieved by choosing
(8)  
where returns a uniform random number from , is a constant to control the opening speed and the learning iteration.
Projection
While many projection operators between different dimensions could be conceived, we focus on the case of a 3D generator on a regular voxel grid and a 2D discriminator on a regular image in the next section 4. Methods to scale dense regular grids to sparse ones, exist [25] and should be applicable to our approach as well. Consequently is a mapping from a threedimensional voxel grid with channels to a 2D image with channels that depend on direction .
We further define
into a linear transformation
that depends on the view and an image formation function that does not, i. e., is viewindependent. The transformation is shared by all implementation of the rendering layer, so we will only discuss the key differences of in the following and assume the rotation to have already been applied. Note that a rotation and a linear resampling is backpropagatable and provided, e. g., as torch.nn.functional.grid_sample. While we work in orthographic space, could also be constructed to become perspective.4 Rendering Layers
Rendering layers (Fig. 4) map 3D information to 2D images so they can be presented to a discriminator. As explained, we have already rotated the 3D volume (Fig. 4, a) into camera space from view direction (Fig. 4, b), such that the pixel value is to be computed from all voxel values and only those (Fig. 4, c). Consequently a rendering layer is described by how it maps sequence of voxels to a pixel value . Composing the full image just amounts to executing for every pixel resp. all voxel at that pixel.
Note, that the rendering layer does not have any learnable parameters. It just serves as an additional constraint that allows the 3D generator to change in respect to loss observed in 2D. We will now discuss several variants of , implementing different forms of volume rendering [6].
Visual hull (VH)
Visual hull [18] is the simplest variant. It converts binary density voxels into binary opacity images, so . A voxel value of 0 means empty space and a value of 1 means occupied, i. e., . Output again is a binary value indicating if any voxel blocked the ray. It simply states
Note that the product operator is both backprop and efficiently computable on a GPU using a parallel scan. We can apply this to learning 3D structure from binary 2D data such as segmented 2D images.
Absorptiononly (AO)
The absorptiononly model is the gradual variant of visual hull. This allows for “softer” attenuation of rays. The same definition
can be applied, but if are fractional the result is similar to an xray, i. e., . This image formation allows learning from xrays or other transparent 2D images. Typical, both are mono, but a colored method (e. g., xray at different wavelength or RGB images of colored transparent objects) could technically be done.
Emissionabsorption (EA)
Emissionabsorption allows the voxels not only to absorb light coming towards the observer but also to emit new light at any position. This interplay of emission and absorption can model occlusion, which we will see is useful to make 3D sense of a 3D world. Fig. 4 uses emissionabsorption with high absorption, effectively realizing an opaque surface with visibility.
A typical choice is to have the absorption monochromatic and the emission chromatic, so the voxels carry channels. Consequently, an output image has color channels.
The complete emissionabsorption equation is
While such equations are typically solved using raymarching [6], they can be rewritten to become backprop in practice: First, we note that the transmission from voxel is one minus a product of one minus the density of all voxels before . Such a cumulative product can be, similar to a sum, backproped and computed efficiently using parallel scans as well, e. g., using torch.cumprod. An numerical alternative, that performed similar in our experiemnts, is to work in the log domain and use torch.cumsum.
5 Evaluation
Our evaluation comprises of a quantitative (Sec. 5.3) and a qualitative analysis (Sec. 5.4) that compares different techniques (Sec. 5.2).
5.1 Data sets
We train on several synthetic and real data sets.
Synthetic
Our synthetic data set comprises of 2D images of different modalities (opacity, RGB) that were rendered using ShapeNet [4]. We chose following classes: airplane, car, chair, rifle and lamp Each sample is rendered from a random view (50 per object), with random natural illumination, using the three image formation methods (VH, AO, EA) we suggest, producing three 2D images. No 3D information is used as training data in our approach. We use volumetric projective texturing to propagate the appearance information from thin 3D surface crust as defined by ShapeNet’s textures into the 3D voxelization.
Furthermore, we also train on a synthetic xray dataset that consists of mammalia skulls. We used a subset (monkey skulls only) of that data set. [13]
Real
We use two rare classes: chanterelle (60 images) and tree (37 images) (not strictly rare, but difficult to 3Dmodel). These images are RGBA, masked, on white background.
5.2 Techniques
We compare different alternative methods (Fig. 3) against our method. A key property of our approach is not to require access to the 3D volume. We investigate three variants where two require more supervision than our method.
Direct
First, we train a Direct method, that avoids reconstructing a 3D volume altogether and directly produces a 2D image from the input image using a simple EncoderDecoder network (Fig. 3, a). While this does not produce a 3D representation and does not allow for applications such as using it in a modeling package or 3D printing. It indicates how our effort to construct a nondeep 3D representation can pay off, even if it is just to produce a novelview 2D image.
Indirect
Second, we investigate a Indirect [34] method, that does not have access to the 3D volume but instead to 2D multiple images of that 3D volume in a known spatial relation (Fig. 3, b). Note, that this is a stronger requirement to our approach that does not require any structure in the adversarial examples: geometry, view, light, all change, while in this method only the view changes in a prescribed way.
3dgan
Ours
We investigate variants of our approach mainly in the way the novel views are generated. We explore two axes: The number of views and the update protocol that either updates every iterations or applies the umbrella strategy.
Please also note that for all, but our rare classes (which only we can process), our approach can only ever hope to perform as good as those methods on our data. In other words, the chosen variants are upper bound to our method as they require access to more structured data.
5.3 Quantitative
Methods
We first report RMSE on 3D voxels. Regrettably, this does not capture well the effect on 2D images. Thus, we render both the reconstructed and the reference volume from the same 5 random views and compare their images using SSIM. For this rerendering, we further employ four different rendering methods: the original (i. e., ) image formation (IF), volume rendering (VOL), isosurface rendering with an isoval of (ISO) and a voxel rendering (VOX), all under random natural illumination.
Results
Tbl. 1 summarizes our main results for the prototypical class airplane. We see that overall, our 2D supervised method produces competitive 2D SSIM and 3D MSE as the 3D supervised methods.
In terms of 3D MSE, not surprising, the 3DGAN wins, as it has this measure in its loss. Our method performs slightly better than half as good. Note that 3D MSE is receptive of small alignment errors that our critic does not, and also should not, consider. The Direct rows do not state 3D MSE, as it does produce 2D images only.
We see that our approach is close to the SSIM of the 3Dsupervised methods, and for the case of voxel rendering even surpassing them. The Direct method can only be applied to the IF visualization, as since it does not produce 3D volumes that could be rerendered. We see that direct regression always performs worse than ours in SSIM. Typically, its results are blurry, while ours are sharp. Overall, the 3DGAN performs better than the Indirect method, but often their differences are of similar magnitude than ours to both of them.
Concerning the image formation models, we see that the absolute SSIM values are best for AO, which is expected: VH asks for scalar density but has only a binary image; AO provides internal structures but only needs to produce scalar density; EA is hardest, as it needs to resolve both density and color. Nonetheless the differences between us and competitors are similar across the image formation models.


Reconstruction Task  

2D SSIM  
IF  VOL  Iso  Vox  
Our  .158  .872  .932  .920  .926  
Direct  —  .560  —  —  —  
Indirect  .130  .880  .938  .928  .933  
3DGAN  .111  .833  .924  .921  .926  
Our  .113  .950  .935  .927  .932  
Direct  —  .641  —  —  —  
Indirect  .115  .945  .932  .921  .927  
3DGAN  .108  .934  .922  .919  .923  
Our  .220  .837  .837  .757  .766  
Direct  —  .701  .701  —  —  
Indirect  .109  .922  .922  .902  .907  
3DGAN  .107  .920  .920  .883  .893 
In Tbl. 2 we look into the performance across different classes. We see that our method produces acceptable SSIM across the board (compare to the 3DGAN / Indirect competitor SSIMs for the airplane class in Tbl. 1). rifle seems to work best: the approach learns quickly from 2D that a gun has an outer 3D shape that is a revolute structure. chair remains a different class likely due to its high intraclass variation.

VH  AO  EA  
VOL  ISO  VOX  VOL  ISO  VOX  VOL  ISO  VOX  
plane  .932  .920  .926  .935  .927  .932  .837  .757  .766  
rifle  .946  .941  .945  .949  .944  .947  .899  .777  .803  
chair  .860  .848  .854  .862  .850  .857  .802  .611  .632  
car  .841  .846  .851  .844  .846  .850  .800  .731  .743  
lamp  .920  .915  .920  .926  .914  .920  .883  .790  .803 
5.4 Qualitative
Fig. 5 shows typical results for the reconstruction task. We see that our reconstruction can produce airplane 3D model representative of the input 2D image. Most importantly, these 3D models look plausible for multiple views, not only from the input one, as seen from the second view column. We also see that the model captures the relevant variation, ranging from jet fighters over smaller sports aircrafts to large transport types. chair is a particularly difficult class, but our results capture most variation in legs or thickness but small errors are present mostly in the form of sporadic activations in free space. For gun, the results turn out almost perfect, in agreement with the numbers reported before. In summary, we think the quality is comparable to supervised GANs, but 3D supervision.
6 Discussion
Which and how many views are needed?
The key to PlatonicGAN is 3D2D projection, but multiple strategies are conceivable to do so. Two parameter control its effect: the number of projections and the protocol for their update, which is either every times or following an umbrella. Tbl. 3 shows the error depending on different viewing protocols. We see, that the optimal combination is for a single view using the umbrella strategy.
Protocol  
VOL  ISO  VOX  VOL  ISO  VOX  VOL  ISO  VOX  
.930  .915  .922  .929  .918  .924  .931  .916  .923  
.934  .927  .931  .932  .921  .927  .932  .921  .927  
.933  .925  .930  .930  .918  .924  .932  .922  .928  
Umbrella  .935  .927  .932  .932  .921  .927  .932  .918  .926 
Why not having a multiview discriminator?
It is tempting to suggest a discriminator that does not only look at a single image, but at multiple views at he same time to judge if the generator result is plausible holistically. But while we can generate “fake” images from multiple views, this is not possible, as , the set of “real” natural images does not come in such a form. As a key advantage, ours only expects unstructured data: online repositories hold images with unknown camera, 3D geometry or illumination.
Rare classes
Supplemental
Our supplemental materials show novelview videos, more analysis and both data and network definitions will be made publicly available upon publication.
7 Conclusion
In this paper, we have presented PlatonicGAN, a new approach to learn 3D shape from unstructured collections of 2D images. The key to our “escape plan” is to train a 3D generator outside the cave that will fool a discriminator seeing projection inside the cave.
We have shown a family of rendering operators that can be GPUefficiently backpropagated and account for occlusion and color. These support a range of input modalities, ranging from binary masks, over opacity maps to RGB images with transparency. Our 3D reconstruction application is build on top of this idea to capture varied and detailed 3D shapes, including color, from 2D images. Training is exclusively performed on 2D images, enabling massive 2D image collections to contribute to generating 3D shapes.
Future work could include shading that is related to gradients of density [6] into classic volume rendering. Furthermore, any sort of backpropagatable rendering operator can be added. Devising such operators is a key future challenge. Other adversarial applications such as pure generation or filtering of 3D shapes seem worth exploring.
While we combine 2D observations with 3D interpretations, similar relations might exist in higher dimensions, between 3D observations and 4D (3D shapes in motion) but also in lower dimensions, such as for 1D row scanner in robotics or 2D slices of 3D data such as in tomography.
References
 [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. 2018.
 [2] J. Carreira, S. Vicente, L. Agapito, and J. Batista. Lifting object detection datasets into 3d. IEEE PAMI, 38(7):1342–55, 2016.
 [3] T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3D morphable models from 2D images. PAMI, 35(1):232–44, 2013.
 [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An informationrich 3D model repository. arXiv:1512.03012, 2015.
 [5] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3DR2N2: A unified approach for single and multiview 3D object reconstruction. In ECCV, pages 628–44, 2016.
 [6] R. A. Drebin, L. Carpenter, and P. Hanrahan. Volume rendering. In Siggraph Computer Graphics, volume 22, pages 65–74, 1988.

[7]
S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo,
A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al.
Neural scene representation and rendering.
Science, 360(6394):1204–10, 2018.  [8] H. Fan, H. Su, and L. Guibas. A point set generation network for 3D object reconstruction from a single image. arXiv:1612.00603, 2016.
 [9] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.

[10]
R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta.
Learning a predictable and generative vector representation for objects.
In ECCV, pages 484–99, 2016.  [11] C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In CVPR, pages 6602–6611, 2017.
 [12] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–80, 2014.
 [13] P. Henzler, V. Rasche, T. Ropinski, and T. Ritschel. SingleImage Tomography: 3D Volumes from 2D Cranial XRays. Computer Graphics Forum (Proceedings of Eurographics 2018), 2018.
 [14] A. Kar, C. Häne, and J. Malik. Learning a multiview stereo machine. In NIPS, pages 365–376, 2017.
 [15] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv:1710.10196, 2017.
 [16] H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer. In CVPR, pages 3907–16, 2018.
 [17] M. Kazhdan and H. Hoppe. Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG), 32(3):29, 2013.
 [18] A. Laurentini. The visual hull concept for silhouettebased image understanding. AMI, 16(2):150–62, 1994.
 [19] M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In ECCV, volume 8695, pages 154–69, 2014.
 [20] L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In ICML, pages 3478–87, 2018.
 [21] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multiview cnns for object classification on 3D data. In CVPR, pages 5648–5656, 2016.
 [22] D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3D structure from images. In NIPS, pages 4996–5004, 2016.
 [23] G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In CVPR, 2017.
 [24] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learning representations by backpropagating errors. Cognitive modeling, 5(3):1, 1988.
 [25] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3D outputs. arXiv:1703.09438, 2017.
 [26] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. In CVPR, 2017.
 [27] J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen. Shape completion enabled robotic grasping. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 2442–2447. IEEE, 2017.
 [28] H. Wang, J. Yang, W. Liang, and X. Tong. Deep singleview 3D object reconstruction with visual hull embedding. arXiv:1809.03451, 2018.
 [29] W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann. Shape inpainting using 3d generative adversarial network and recurrent convolutional networks. arXiv:1711.06375, 2017.
 [30] E. H. Warmington, P. G. Rouse, and W. Rouse. Great dialogues of Plato. New American Library, 1956.
 [31] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. MarrNet: 3D shape reconstruction via 2.5D sketches. In NIPS, pages 540–550, 2017.
 [32] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In NIPS, pages 82–90, 2016.
 [33] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D Shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–20, 2015.
 [34] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning singleview 3D object reconstruction without 3D supervision. In NIPS, pages 1696–1704, 2016.
 [35] B. Yang, H. Wen, S. Wang, R. Clark, A. Markham, and N. Trigoni. 3d object reconstruction from a single depth view with adversarial learning. arXiv preprint arXiv:1708.07969, 2017.
 [36] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In CVPR, 2017.
Comments
There are no comments yet.