While methods exist to learn the 3D structure of classes of objects, they typically require 3D data as input. Regrettably, such 3D data is difficult to acquire, in particular for the “long tail” of exotic classes: ShapeNet might have chair, but it does not have chanterelle. Can we enable biologists to benefit from a generative model for such rarely observed classes?
Addressing this problem we suggest a method to learn 3D structure from 2D images only (Fig. 1). Reasoning about the 3D structure from 2D observations without assuming anything about their relation is challenging as illustrated from Plato’s Allegory of the Cave : How can we hope to understand higher dimensions from only seeing projections? If multiple views (maybe only two [36, 11]) of the same object are available, multi-view analysis without 3D supervision has been successful, but regrettably massive image collections do not come in this form but are now and will remain unstructured: they show random instances under random pose, uncalibrated lighting in unknown relations.
Our first main contribution (Sec. 3) is to use adversarial training of a 3D generator that makes use of a discriminator that operates exclusively on widely available unstructured collections of 2D images, which we call platonic discriminator. Here, during training, the generator produces a 3D shape that is projected (rendered) to 2D and presented to a 2D discriminator. Making a connection between the 3D generator and the 2D discriminator, our second key contribution, is enabled by a family of rendering layers that can account for occlusion and color (Sec. 4). These layers do not have and do not need any learnable parameters and are efficient to back-propagate . From these two key blocks we construct a system that learns the 3D shapes of common classes such as chairs and cars, but also exotic ones from unstructured 2D image collections. We demonstrate 3D reconstruction from a single 2D image as a key application (Sec. 5).
2 Related Work
Several papers suggest (adversarial) learning using 3D voxel representations [33, 32, 10, 21, 29, 28, 31, 35, 27, 17] or point cloud input [1, 9]. A simplified design for such a network is seen in Fig. 3, c, adapted to our task: an encoder generates a latent code that is fed into a generator to produce a 3D representation (i. e., a voxel grid). A 3D discriminator now analyzes samples both from the generator and from the ground truth distribution. Note that this procedure requires 3D supervision, i. e., is limited by the type and size of the 3D data set such as .
Girdhar et al.  work on a joint embedding of 3D voxels and 2D images, but still require 3D voxeliations as input. Fan et al.  produce points from 2D images, but similarly with 3D data as training input. Cho et al.’s recursive design takes multiple images as input  while also being trained on 3D data. Kar et al.  propose a simple “unprojection” network component to establish a relation between 2D pixels and 3D voxels but without resolving occlusion and again with 3D supervision.
Cashman and Fitzgibbon  have extracted template-based shape spaces from collections of 2D images. Similarly, Carreira et al.  use correspondence to templates across segmentation-labeled 2D image data sets to reconstruct 3D shapes.
Closer to our approach is Rezende et al.  that also learn 3D representations from 2D images. However, they make use of a partially differentiable renderer  that is limited to surface orientation and shading, while our formulation can resolve both occlusion from the camera and appearance. Also, their representation of the 3D volume is a latent one, that is, it has no immediate physical interpretation that is required in practice, e. g., for measurements, to run simulations such as renderings or 3D printing. This choice of having a deep representation of the 3D world is shared by Eslami et al. . Tulsiani et al.  use a special case of our rendering layer (our ) in a setting where multiple 2D images with a known camera transform for each are available at learning time. We take it a step further and use a GAN to work with unstructured single images at training time. In particular we do not even know the camera pose relative to the object in the image. Finally, our image formation goes beyond visual hulls accounting for color and occlusion.
While early suggestions how to extend differentiable renderers to polygonal meshes exist, they are limited to deformation of a pre-defined template . We work with voxels, that can express arbitrary topology, e. g., we can generate airplanes with drastically different layout, which are not a mere deformation of a base shape. Combining our approach with sparse voxelizations  would allow to reproduce even finer details.
Similarly, inter-view constraints can be used to learn depth maps [36, 11] using reprojection constraints: If the depth label is correct, reprojecting one image into the other view has to produce the other image. Our method does not learn a single depth map but a full voxel grid and allows principled handling of occlusions.
A generalization from visual hull maps to full 3D scenes is discussed by Yan et al. , illustrated in Fig. 3, b: Instead of a 3D loss, they employ a simple projection along major axis allowing to use a 2D loss. However, multiple 2D images of the same object are required. In practice this is achieved by rendering the 3D shape into 2D images from multiple views. This makes two assumptions: We have multiple images in a known relation and available reference appearance (i. e., light, materials). Our approach relaxes those two requirements: we use a discriminator that can work on arbitrary projections and arbitrary natural input images, without known refehorence.
3 3D Shape From 2D Image Collections
Here, we discuss our PlatonicGAN in more detail, making use of our rendering layers to be introduced in Sec. 4.
Our method is a classic (generative) adversarial design  with two main differences: The 2D discriminator operates in a different space (2D) than the 3D generator (3D) and the two are linked by a hard-wired projection operator (rendering layer, Sec. 4).
Let us recall the classic adversarial learning of 3D shapes , which is a min-max game
between the data and the generator cost and .
The data cost is
where is the discriminator with learned parameters which is presented with samples from the distribution of real 3D shapes . Here denotes the expected value of the distribution .
The generation cost is
where is the generator with parameters that maps the latent code to the data domain.
extend the generator cost to
which projects the generator result from 3D to 2D along sampled view direction . Different implementations of are discussed in Sec. 4.
While many parametrizations for views are possible, we here choose an orthographic camera that looks at the origin, with upright orientation from an Euclidean position . is the expected value across the distributions of views.
Full Platonic GAN
Additionally, we make use of three recent ideas orthogonal to our platonic concept, resulting in
where includes an encoding step,
enforces a normal distribution of latent codes andencourgaes the encoded-generated-and-projected result to be similar to the encoder input. We detail each of these three steps in the following paragraphs:
First, the final generator does not directly work on a latent codes , but allows for an encoder with parameters that maps an input 2D image or 3D volume to a latent code:
Using identity instead of a complex mapping addresses a (random) generation task, while using images or volumes as input is a solution to reconstruction or filtering problems, respectively. Note, that the expected value for the generator is now across 2D images and not across the latent codes anymore.
Third, we encourage the encoder and generator to reproduce the input in the sense, if the view is the input view, by convention , so
While this step is optional, as the original view as well as other views similar to it, is already contained in the generator term, we found it useful to give direct additional weighting to this special view. It adds stability to the optimization as it is easy to find an initial solution that matches this 2D cost before refining the 3D structure.
Two properties enable optimizing for a Platonic GAN: First, maximizing the expected value across the distribution of views and second, back-propagation through the projection operator .
for training and increase variation by mini-batch standard deviation.
Different options exist to sample the views in the for loop in Alg. 1, even if the the distribution
is uniform. A simple solution is to sample one view, entirely randomly, in every classic GAN update step. While this works, it introduces a lot of variance into the estimate of the gradient which ideally was over all views.
We therefore sample multiple view for every update. This also is economical, as the result of , a 3D volume, is costly to produce once, but simple to render multiple times.
A more refined strategy keeps those views fixed across multiple () updates. This allows the generator to get a set of certain views right, before moving to new ones.
Finally, we suggest an “umbrella” strategy. The intuition is, that generating a 3D volume from one view is simple: just extrude the contour in 3D; this would not look objectionable to any critic with limited view variation. To resolve errors from other views, it might be simplest to start fixing errors from views similar to a view that is already right. Incrementally, views become more varied, like opening an umbrella. This is achieved by choosing
where returns a uniform random number from , is a constant to control the opening speed and the learning iteration.
While many projection operators between different dimensions could be conceived, we focus on the case of a 3D generator on a regular voxel grid and a 2D discriminator on a regular image in the next section 4. Methods to scale dense regular grids to sparse ones, exist  and should be applicable to our approach as well. Consequently is a mapping from a three-dimensional voxel grid with channels to a 2D image with channels that depend on direction .
We further define
into a linear transformationthat depends on the view and an image formation function that does not, i. e., is view-independent. The transformation is shared by all implementation of the rendering layer, so we will only discuss the key differences of in the following and assume the rotation to have already been applied. Note that a rotation and a linear resampling is back-propagatable and provided, e. g., as torch.nn.functional.grid_sample. While we work in orthographic space, could also be constructed to become perspective.
4 Rendering Layers
Rendering layers (Fig. 4) map 3D information to 2D images so they can be presented to a discriminator. As explained, we have already rotated the 3D volume (Fig. 4, a) into camera space from view direction (Fig. 4, b), such that the pixel value is to be computed from all voxel values and only those (Fig. 4, c). Consequently a rendering layer is described by how it maps sequence of voxels to a pixel value . Composing the full image just amounts to executing for every pixel resp. all voxel at that pixel.
Note, that the rendering layer does not have any learnable parameters. It just serves as an additional constraint that allows the 3D generator to change in respect to loss observed in 2D. We will now discuss several variants of , implementing different forms of volume rendering .
Visual hull (VH)
Visual hull  is the simplest variant. It converts binary density voxels into binary opacity images, so . A voxel value of 0 means empty space and a value of 1 means occupied, i. e., . Output again is a binary value indicating if any voxel blocked the ray. It simply states
Note that the product operator is both backprop and efficiently computable on a GPU using a parallel scan. We can apply this to learning 3D structure from binary 2D data such as segmented 2D images.
The absorption-only model is the gradual variant of visual hull. This allows for “softer” attenuation of rays. The same definition
can be applied, but if are fractional the result is similar to an x-ray, i. e., . This image formation allows learning from x-rays or other transparent 2D images. Typical, both are mono, but a colored method (e. g., x-ray at different wavelength or RGB images of colored transparent objects) could technically be done.
Emission-absorption allows the voxels not only to absorb light coming towards the observer but also to emit new light at any position. This interplay of emission and absorption can model occlusion, which we will see is useful to make 3D sense of a 3D world. Fig. 4 uses emission-absorption with high absorption, effectively realizing an opaque surface with visibility.
A typical choice is to have the absorption monochromatic and the emission chromatic, so the voxels carry channels. Consequently, an output image has color channels.
The complete emission-absorption equation is
While such equations are typically solved using ray-marching , they can be rewritten to become backprop in practice: First, we note that the transmission from voxel is one minus a product of one minus the density of all voxels before . Such a cumulative product can be, similar to a sum, backproped and computed efficiently using parallel scans as well, e. g., using torch.cumprod. An numerical alternative, that performed similar in our experiemnts, is to work in the log domain and use torch.cumsum.
5.1 Data sets
We train on several synthetic and real data sets.
Our synthetic data set comprises of 2D images of different modalities (opacity, RGB) that were rendered using ShapeNet . We chose following classes: airplane, car, chair, rifle and lamp Each sample is rendered from a random view (50 per object), with random natural illumination, using the three image formation methods (VH, AO, EA) we suggest, producing three 2D images. No 3D information is used as training data in our approach. We use volumetric projective texturing to propagate the appearance information from thin 3D surface crust as defined by ShapeNet’s textures into the 3D voxelization.
Furthermore, we also train on a synthetic x-ray dataset that consists of mammalia skulls. We used a subset (monkey skulls only) of that data set. 
We use two rare classes: chanterelle (60 images) and tree (37 images) (not strictly rare, but difficult to 3D-model). These images are RGBA, masked, on white background.
We compare different alternative methods (Fig. 3) against our method. A key property of our approach is not to require access to the 3D volume. We investigate three variants where two require more supervision than our method.
First, we train a Direct method, that avoids reconstructing a 3D volume altogether and directly produces a 2D image from the input image using a simple Encoder-Decoder network (Fig. 3, a). While this does not produce a 3D representation and does not allow for applications such as using it in a modeling package or 3D printing. It indicates how our effort to construct a non-deep 3D representation can pay off, even if it is just to produce a novel-view 2D image.
Second, we investigate a Indirect  method, that does not have access to the 3D volume but instead to 2D multiple images of that 3D volume in a known spatial relation (Fig. 3, b). Note, that this is a stronger requirement to our approach that does not require any structure in the adversarial examples: geometry, view, light, all change, while in this method only the view changes in a prescribed way.
We investigate variants of our approach mainly in the way the novel views are generated. We explore two axes: The number of views and the update protocol that either updates every iterations or applies the umbrella strategy.
Please also note that for all, but our rare classes (which only we can process), our approach can only ever hope to perform as good as those methods on our data. In other words, the chosen variants are upper bound to our method as they require access to more structured data.
We first report RMSE on 3D voxels. Regrettably, this does not capture well the effect on 2D images. Thus, we render both the reconstructed and the reference volume from the same 5 random views and compare their images using SSIM. For this re-rendering, we further employ four different rendering methods: the original (i. e., ) image formation (IF), volume rendering (VOL), iso-surface rendering with an iso-val of (ISO) and a voxel rendering (VOX), all under random natural illumination.
Tbl. 1 summarizes our main results for the prototypical class airplane. We see that overall, our 2D supervised method produces competitive 2D SSIM and 3D MSE as the 3D supervised methods.
In terms of 3D MSE, not surprising, the 3DGAN wins, as it has this measure in its loss. Our method performs slightly better than half as good. Note that 3D MSE is receptive of small alignment errors that our critic does not, and also should not, consider. The Direct rows do not state 3D MSE, as it does produce 2D images only.
We see that our approach is close to the SSIM of the 3D-supervised methods, and for the case of voxel rendering even surpassing them. The Direct method can only be applied to the IF visualization, as since it does not produce 3D volumes that could be re-rendered. We see that direct regression always performs worse than ours in SSIM. Typically, its results are blurry, while ours are sharp. Overall, the 3DGAN performs better than the Indirect method, but often their differences are of similar magnitude than ours to both of them.
Concerning the image formation models, we see that the absolute SSIM values are best for AO, which is expected: VH asks for scalar density but has only a binary image; AO provides internal structures but only needs to produce scalar density; EA is hardest, as it needs to resolve both density and color. Nonetheless the differences between us and competitors are similar across the image formation models.
In Tbl. 2 we look into the performance across different classes. We see that our method produces acceptable SSIM across the board (compare to the 3DGAN / Indirect competitor SSIMs for the airplane class in Tbl. 1). rifle seems to work best: the approach learns quickly from 2D that a gun has an outer 3D shape that is a revolute structure. chair remains a different class likely due to its high intra-class variation.
Fig. 5 shows typical results for the reconstruction task. We see that our reconstruction can produce airplane 3D model representative of the input 2D image. Most importantly, these 3D models look plausible for multiple views, not only from the input one, as seen from the second view column. We also see that the model captures the relevant variation, ranging from jet fighters over smaller sports aircrafts to large transport types. chair is a particularly difficult class, but our results capture most variation in legs or thickness but small errors are present mostly in the form of sporadic activations in free space. For gun, the results turn out almost perfect, in agreement with the numbers reported before. In summary, we think the quality is comparable to supervised GANs, but 3D supervision.
Which and how many views are needed?
The key to PlatonicGAN is 3D-2D projection, but multiple strategies are conceivable to do so. Two parameter control its effect: the number of projections and the protocol for their update, which is either every times or following an umbrella. Tbl. 3 shows the error depending on different viewing protocols. We see, that the optimal combination is for a single view using the umbrella strategy.
Why not having a multi-view discriminator?
It is tempting to suggest a discriminator that does not only look at a single image, but at multiple views at he same time to judge if the generator result is plausible holistically. But while we can generate “fake” images from multiple views, this is not possible, as , the set of “real” natural images does not come in such a form. As a key advantage, ours only expects unstructured data: online repositories hold images with unknown camera, 3D geometry or illumination.
Results for rare classes are seen in Fig. 1 and Fig. 6. We see that our method produces plausible details from multiple view while respecting the input image, even in this difficult case. No metric can be applied to these data as no 3D volume is available to compare in 3D or re-project.
Our supplemental materials show novel-view videos, more analysis and both data and network definitions will be made publicly available upon publication.
In this paper, we have presented PlatonicGAN, a new approach to learn 3D shape from unstructured collections of 2D images. The key to our “escape plan” is to train a 3D generator outside the cave that will fool a discriminator seeing projection inside the cave.
We have shown a family of rendering operators that can be GPU-efficiently back-propagated and account for occlusion and color. These support a range of input modalities, ranging from binary masks, over opacity maps to RGB images with transparency. Our 3D reconstruction application is build on top of this idea to capture varied and detailed 3D shapes, including color, from 2D images. Training is exclusively performed on 2D images, enabling massive 2D image collections to contribute to generating 3D shapes.
Future work could include shading that is related to gradients of density  into classic volume rendering. Furthermore, any sort of back-propagatable rendering operator can be added. Devising such operators is a key future challenge. Other adversarial applications such as pure generation or filtering of 3D shapes seem worth exploring.
While we combine 2D observations with 3D interpretations, similar relations might exist in higher dimensions, between 3D observations and 4D (3D shapes in motion) but also in lower dimensions, such as for 1D row scanner in robotics or 2D slices of 3D data such as in tomography.
-  P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas. Learning representations and generative models for 3d point clouds. 2018.
-  J. Carreira, S. Vicente, L. Agapito, and J. Batista. Lifting object detection datasets into 3d. IEEE PAMI, 38(7):1342–55, 2016.
-  T. J. Cashman and A. W. Fitzgibbon. What shape are dolphins? building 3D morphable models from 2D images. PAMI, 35(1):232–44, 2013.
-  A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An information-rich 3D model repository. arXiv:1512.03012, 2015.
-  C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In ECCV, pages 628–44, 2016.
-  R. A. Drebin, L. Carpenter, and P. Hanrahan. Volume rendering. In Siggraph Computer Graphics, volume 22, pages 65–74, 1988.
S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo,
A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al.
Neural scene representation and rendering.Science, 360(6394):1204–10, 2018.
-  H. Fan, H. Su, and L. Guibas. A point set generation network for 3D object reconstruction from a single image. arXiv:1612.00603, 2016.
-  H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d object reconstruction from a single image. In CVPR, volume 2, page 6, 2017.
R. Girdhar, D. F. Fouhey, M. Rodriguez, and A. Gupta.
Learning a predictable and generative vector representation for objects.In ECCV, pages 484–99, 2016.
-  C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In CVPR, pages 6602–6611, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–80, 2014.
-  P. Henzler, V. Rasche, T. Ropinski, and T. Ritschel. Single-Image Tomography: 3D Volumes from 2D Cranial X-Rays. Computer Graphics Forum (Proceedings of Eurographics 2018), 2018.
-  A. Kar, C. Häne, and J. Malik. Learning a multi-view stereo machine. In NIPS, pages 365–376, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv:1710.10196, 2017.
-  H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer. In CVPR, pages 3907–16, 2018.
-  M. Kazhdan and H. Hoppe. Screened poisson surface reconstruction. ACM Transactions on Graphics (ToG), 32(3):29, 2013.
-  A. Laurentini. The visual hull concept for silhouette-based image understanding. AMI, 16(2):150–62, 1994.
-  M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In ECCV, volume 8695, pages 154–69, 2014.
-  L. Mescheder, A. Geiger, and S. Nowozin. Which training methods for GANs do actually converge? In ICML, pages 3478–87, 2018.
-  C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. J. Guibas. Volumetric and multi-view cnns for object classification on 3D data. In CVPR, pages 5648–5656, 2016.
-  D. J. Rezende, S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3D structure from images. In NIPS, pages 4996–5004, 2016.
-  G. Riegler, A. O. Ulusoy, and A. Geiger. Octnet: Learning deep 3d representations at high resolutions. In CVPR, 2017.
-  D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988.
-  M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3D outputs. arXiv:1703.09438, 2017.
-  S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR, 2017.
-  J. Varley, C. DeChant, A. Richardson, J. Ruales, and P. Allen. Shape completion enabled robotic grasping. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 2442–2447. IEEE, 2017.
-  H. Wang, J. Yang, W. Liang, and X. Tong. Deep single-view 3D object reconstruction with visual hull embedding. arXiv:1809.03451, 2018.
-  W. Wang, Q. Huang, S. You, C. Yang, and U. Neumann. Shape inpainting using 3d generative adversarial network and recurrent convolutional networks. arXiv:1711.06375, 2017.
-  E. H. Warmington, P. G. Rouse, and W. Rouse. Great dialogues of Plato. New American Library, 1956.
-  J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. MarrNet: 3D shape reconstruction via 2.5D sketches. In NIPS, pages 540–550, 2017.
-  J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In NIPS, pages 82–90, 2016.
-  Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3D Shapenets: A deep representation for volumetric shapes. In CVPR, pages 1912–20, 2015.
-  X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee. Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In NIPS, pages 1696–1704, 2016.
-  B. Yang, H. Wen, S. Wang, R. Clark, A. Markham, and N. Trigoni. 3d object reconstruction from a single depth view with adversarial learning. arXiv preprint arXiv:1708.07969, 2017.
-  T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In CVPR, 2017.