1 Introduction
Given enough labelled data, deep neural networks can learn tasks such as object recognition and monocular depth estimation. However, doing so from unlabelled data is much more difficult. In fact, it is often unclear
what can be learned when labels are missing. In this paper, we consider the problem of learning the 3D shape of object categories from raw images and seek to solve it by making minimal assumptions on the data. In particular, we do not wish to use any external image annotation.Given multiple views of the same 3D object, techniques such as structurefrommotion (SFM) can be used to reconstruct the object’s 3D shape Faugeras and Luong (2001). In fact, visual geometry suggests that multiple views are not only sufficient, but also necessary for reconstruction. However, this requirement can be relaxed if one has prior information on the possible shapes of the objects. By learning such a prior, authors have demonstrated that monocular 3D reconstruction is in fact possible and practical Zhou et al. (2017); Ummenhofer et al. (2017). However, this does not clarify how the necessary prior can be acquired in the first place. Recent methods such as SFMLearner Zhou et al. (2017) have shown that the prior too can be extracted from a collection of unlabelled images with no externallyprovided 3D information. However, these methods require multiple views of the same object, just as SFM does.
In this paper, we wish to relax these conditions even further and reconstruct objects from an unconstrained collection of images. By “unconstrained”, we mean that images are i.i.d. samples from a distribution of views of different object instances, such as a gallery of human faces. In particular, each image may contain a different object identity, preventing a direct application of the geometric principles leveraged by SFM and SFMLearner. Even so, we argue that these principles are still relevant and applicable, albeit in a statistical sense.
The recent work of Moniz et al. (2018); Kanazawa et al. (2018), and more in general nonrigid SFM approaches, have demonstrated that 3D reconstruction from unconstrained images is possible provided that at least 2D object keypoint annotations are available. However, the knowledge of keypoints captures a significant amount of information, as shown by the fact that learning a keypoint detector in an unsupervised manner is a very difficult problem in its own right Thewlis et al. (2017a). Ultimately, 3D reconstruction of deformable objects without keypoint annotations and from unconstrained images remains an open challenge.
In this work, we suggest that the constraints that are explicitly provided by 2D or 3D data annotations can be replaced by weaker constraints on the statistics of the data that do not require to perform data annotation of any kind. We capture such constraints as follows (section 3). First, we observe that understanding 3D geometry can explain much of the variation in a dataset of images. For example, the depth map extracted from an image of a human face depends on the face shape as well as the viewpoint; if the viewpoint is registered, then the variability is significantly reduced. We can exploit this fact by seeking a generative model that explains the data as a combination of three partially independent factors: viewpoint, geometry and texture. We pair the generative model with an encoder that can explain any given image of the object as a combination of these factors. The resulting autoencoder maps the data in a structured space. The structure is specified by how factors are combined: shape and viewpoint induce, based on the camera geometry, a 2D deformation of the canoncial texture, which in turns matches the observed data.
This process can also be seen as extracting 3D information from 2D image deformations. However, deformations arising from small depth variations, which are characteristics of objects such as faces, are subtle. Illumination provides a complementary cue for highfrequency but shallow depth variations. Using a generative model allows to integrate this cue in a simple manner: we estimate an additional factor, representing the main lighting direction, and combine the latter with the estimated depth and texture to infer the object shading. Finally, we propose to exploit the fact that many object classes of interest are symmetric, in most cases bilaterally. We show how this constraint can be leveraged by enforcing that, in the canonical modelling space, appearance and geometry are mirrorsymmetric.
We show empirically (section 4) that these cues, when combined, lead to a good monocular reconstruction of 3D object categories such as human and animal faces and objects such as cars. Quantitatively, we test our model on a large dataset of synthetic faces generated using computer graphics, so that accurate groundtruth information is available for assessment. We also consider a benchmark dataset of real human faces for which 3D annotations are available and outperform a recent stateoftheart method that uses keypoint supervision, while our method uses no supervision at all. We also test the components of our methods via an ablation study and show the different benefits they bring to the quality of the reconstruction.
2 Related Work
The literature on estimating 3D structures from (collections of) images is vast. Here, we will restrict the overview to mostly learning based methods. Traditionally, 3D reconstruction can be achieved by means of multiple view geometry Hartley and Zisserman (2003), where correspondences over multiple views can be used to reconstruct the shape and the camera parameters. Another important clue to recover a surface from an image is the shading Woodham (1980). In fact, our model also uses the line between shape and shading to recover the geometry of objects. When multiple distinct views are not available, information can be gained from various different sources.
Zhou et al. (2017); Wang et al. (2018); Novotny et al. (2017); Agrawal et al. (2015) learn from videos, while Godard et al. (2017); Luo et al. (2018) train using stereo image pairs. Recently, learning instance geometry without multiple views but collections of instance images has emerged. DAE Shu et al. (2018) learns to predict a deformation field through heavily constraining an autoencoder with a small bottleneck embedding. In a recent followup work Sahasrabudhe et al. (2019) they learn to disentangle the 3D mesh and the viewpoint from the deformation field. Similarly, SfSNet Sengupta et al. (2018) learns partially supervised by synthetic ground truth data and Kanazawa et al. (2018) needs foreground segmentation and 2D keypoints to learn a parametric 3D model. GANs have been proposed to learn a generative model for 3D shapes by discriminating the reprojection into images Kato and Harada (2019); Henzler et al. (2018); Szabó and Favaro (2018); NguyenPhuoc et al. (2019); Zhu et al. (2018). Shape models have been used as a form of supervised constraints to learn a mapping between images and shape parameters Wang et al. (2017); Gecer et al. (2019). Thewlis et al. (2018, 2017b) demonstrate that the symmetry of objects can be leveraged to learn a canonical representation. Depth can also be learned from keypoints Moniz et al. (2018); Kudo et al. (2018); Chen et al. (2019); Suwajanakorn et al. (2018) which serve as a form of correspondences even across instances of different objects.
Since our model generates images from an internal 3D representation, one part of the model is a differentiable renderer. However, with a traditional rendering pipeline, gradients across occlusions and object boundaries are undefined. Several soft relaxations have thus been proposed Kato et al. (2018); Liu et al. (2019); Loper and Black (2014). In this work we use an implementation^{1}^{1}1https://github.com/daniilidisgroup/neural_renderer of Kato et al. (2018).
3 Method
Our method takes as input an unconstrained collection of images of an object category, such as human faces, and returns as output a model that can explain each image as the combination of a 3D shape, a texture, an illumination and a viewpoint, as illustrated in fig. 1(a).
Formally, an image is a function defined on a lattice
, or, equivalently, a tensor in
. We assume that the image is roughly centered on an instance of the objects of interest. The goal is to learn a function , implemented as a neural network, that maps the image to four factors consisting of a depth map , and albedo image , a viewpoint and a global light direction .In order to learn this disentangled representation without supervision, we task the model with the goal of reconstructing the input image from the four factors. The reconstruction is a differentiable operation composed of two steps: lighting and reprojection , as follows:
(1) 
The lighting function generates a version of the face based on the depth map , the light direction and the albedo as seen from a canonical viewpoint . For example, for faces a natural choice for the canonical viewpoint is a frontal view, since this minimizes self occlusions, but we let the network choose one automatically. The viewpoint then represents the transformation between the canonical view and the viewpoint of the actual input image . Then, the reprojection function simulates the effect of a viewpoint change and generates the image given the canonical depth and the shaded canonical image .
Next, we discuss the functions and in model (1) in detail.
Reprojection function and camera model.
The image is formed by a camera sensor looking at a 3D object. If we denote with a 3D point expressed in the reference frame of the camera, this is mapped to pixel by the following projection equation:
(2) 
This model assumes a perspective camera with field of view (FOV) in the horizontal direction. Given that the images are cropped around a particular object, we assume a relatively narrow FOV of . The object is assumed to be approximately at a distance of from the camera.
The depth map associates to each pixel a depth value in the canonical view. Inverting the camera model (2), this corresponds to the 3D point
The viewpoint represents an Euclidean transformation such that (this is the exponential map of ) and .
The map transforms 3D points from the canonical view to the actual view. Thus a pixel in the canonical view is mapped to the pixel in the actual view by the warping function given by:
(3) 
Finally, the reprojection function takes as input the depth and the viewpoint change and applies the resulting warp to the canonical image to obtain the actual image as
Notice that this requires to compute the inverse of the warp . This issue is discussed in detail in section 3.1.
Lighting function .
The goal of the lighting function is to generate the canonical image as a combination of albedo, 3D shape and light direction. Note that the effect of lighting could be incorporated in the factor by interpreting the latter as a texture rather than as the object’s albedo. However, there are two good reasons for avoiding this. First, the albedo is often symmetric even if the illumination causes the corresponding texture to look asymmetric. Separating them allows us to more effectively incorporate the symmetry constraint described below. Second, shading provides an additional cue on the underlying 3D shape Woodham (1980); Horn (1975); Belhumeur et al. (1999). In particular, unlike the recent work of Shu et al. (2018) where a shading map is predicted independently from shape, our model computes the shading based on the predicted depth, constraining the two.
Formally, given the depth map , we derive the normal map by associating to each pixel
a vector normal to the underlying 3D surface. In order to find this vector, we compute the vectors
and tangent to the surface along the and directions. For example, the first one is:Then the normal is obtained by taking the vector product .
The normal is multiplied by the light direction to obtain a value for the direct illumination and the latter is added to the ambient light. Finally, the result is multiplied by the albedo function to obtain the illuminated texture, as follows:
(4) 
Here and are the scalar coefficients weighting the ambient and direct terms.
3.1 Differentiable rendering layer
As noted in the previous section, the reprojection function warps the canonical image to generate the actual image . In CNNs, image warping is usually regarded as a simple operation that can be implemented efficiently using a bilinear resampling layer Jaderberg et al. (2015). However, this is true only if we can easily send pixels in the warped image back to pixels in the source image , a process also known as backward warping. Unfortunately, in our case the function obtained by eq. 3 sends pixels in the opposite direction.
Implementing a forward warping layer is surprisingly delicate. One way of approaching the problem is to regard this task as a special case of rendering a textured mesh. The recent Neural Mesh Renderer (NMR) of Kato et al. (2018) is a differentiable renderer of this type. In our case, however, the mesh has one vertex per pixel and each group of
adjacent pixels is tessellated by two triangles. Empirically, we found the quality of the texture gradients computed by NMR to be poor in this case, probably also due to the high frequency content of the texture image
.We solve the problem as follows. First, we use NMR to warp not the albedo , but the depth map itself, obtaining a version of the depth map as seen from the actual viewpoint. This has two advantages: NMR is much faster when the task is limited to rendering the depth map instead of warping an actual texture. Secondly, the gradients are more stable, probably also due to the comparatively smooth nature of the depth map compared to the texture image . Given the depth map , we then use the inverse of (3) to find the warp field from the observed viewpoint to the canonical viewpoint, and bilinearly resample the canonical image to obtain the reconstruction.
Discussion.
Several alternative architectures were tested and discarded in favor of the one outlined above. Among those, one option is to task the network to estimate as well as . However, this requires to ensure that the two depth maps are compatible, which adds extra complexity to the model and did not work as well.
3.2 Symmetry
A constraint that is often useful in modelling object categories is the fact that these have a bilateral symmetry, both in shape and albedo. Under the assumption of bilateral symmetry, we are able to obtain a second virtual view of an object simply by flipping the image horizontally, as shown in fig. 1(b). Note that, if we are given the correspondence between symmetric points of the object (such as the corner of the two eyes, etc.), we could use this information to infer the object’s 3D shape Gao and Yuille (2017); Gordon (1990). While such correspondences are not given to us as the system is unsupervised, we estimate them implicitly by mapping the image to the canonical space.
In practice, there are various ways to enforce a symmetry constraint. For example, one can add a symmetry loss term to the learning objective as a regularizer. However, this requires balancing more terms in the objective. Instead, we incorporate symmetry by performing reconstruction from both canonical image and its mirrored version.
In order to do so, we introduce the (horizontal) flipping operator, whose action on a tensor is given by During training, we randomly choose to flip the canonical albedo and depth before we reconstruct the image using eq. 1. Implicitly and without introducing an additional loss term, this imposes several constraints on the model. Both, depth and albedo will be predicted with horizontal symmetry by to overcome the confusion that is introduced by the flipping operation. Additionally, this constrains the canonical viewpoint to align the object’s plane of symmetry with the vertical centerline of the image. Finally, flipping helps to disentangle albedo and shading: if an object is lit from one side and the albedo is flipped, the target still needs to be lit from the same side, requiring the shading to arise from geometry and not from the texture.
3.3 Loss, regularizer, and objective function
The primary loss function of our model is the
loss on the reconstruction and input image :(5) 
However, this loss is sensitive to small geometric imperfection and tends to result in blurry reconstruction; to avoid that, we add a perceptual loss, which is more robust to such geometric imperfections and eventually leads to a much sharper canonical image. This is obtained by using an offtheshelf image encoder (VGG16 in our case Simonyan and Zisserman (2015)), and is given by where is the feature map computed by the th layer of the encoder network.
We regularize the viewpoint by pulling its mean to zero, breaking the tie between equivalent rotations (which have a period of ) and aligning the canonical view to the mean viewpoint in the dataset. This is achieved by minimizing the function where is the viewpoint is estimated for image in a batch of
images. We also regularize the depth by shrinking its variance between faces. We do so via the regularization term
where and are the depth maps obtained from a pair of example images and . Losses and regularizers are averaged over a batch, yielding the objective:(6) 
3.4 Neural network architecture
We use different networks to extract depth, albedo, viewpoint and lighting from a single image of the object. The depth and albedo are generated by encoderdecoder networks, while viewpoint and lighting are regressed using simple encoder networks. In particular, we use DenseNet Huang et al. (2017) for albedo prediction, with deeper architecture than standard encoderdecoder for depth prediction, because we would like the albedo to capture more details than the depth. We do not use skip connections between encoder and decoder because the network is generating a different view, and thus pixel alignment is not desirable.
4 Experiments
We first analyze the contribution of the individual components of our model (1) and of the regularizers. We do so quantitatively, by using a synthetic face dataset where 3D ground truth is available to measure the quality of the predicted depth maps. However, we also show qualitatively that these metrics do not fully account for all the aspects that make a good 3D reconstruction, and we demonstrate that some components of our model are particularly good at addressing those. We also compare our method on real data to Moniz et al. (2018) who estimate depth for facial keypoints, and test the generalization of the model to objects other than human faces by training on synthetic ShapeNet cars and real cat faces.
For reproducibility and future comparisons, we describe network architecture details and hyperparameters in the supplementary material. We will release the code, trained models and the synthetic dataset upon acceptance of the paper.
4.1 Quantitative assessment and ablation
To evaluate the model quantitatively, we utilize synthetic data, where we know the ground truth depth. We follow in particular the protocol of Sengupta et al. (2018) to generate a large dataset of synthetic faces using the Basel Face Model Paysan et al. (2009). The faces are rendered with shapes, textures, illuminations, and rotations randomly sampled from the model. We use images from SUN Database Xiao et al. (2010) as background and render the images together with ground truth depth maps for evaluation.
Since the scale of 3D reconstruction from projective cameras is inherently ambiguous, we adjust for it in the evaluation. Specifically, we take the depth map predicted by the model in the canonical view, map it to a depth map in the actual view, and compare the latter to the groundtruth depth map using the scale invariant error Eigen et al. (2014) where . Additionally, we report the mean angle deviation between normals computed from ground truth depth and from the depth prediction, which measures how well the surface is captured by the prediction.
In table 2 we estimate upper bounds on the model performance by comparing it to supervised baselines using the same network architectures. We also see a significant improvement over various constant prediction baselines.
To understand the influence of the individual parts, we remove each one of them and evaluate the ablated model in fig. 3. However, since the error is computed in the original view of the image, it does not evaluate the quality of the 3D shape not visible from that vantage point. Thus, we also visualize the canonical albedo and the normal map computed from the depth map . We can see that all components reduce the error as well as improve the visual quality of the samples. The symmetry constraint for the albedo and depth have the strongest impact on the model, while the perceptual loss improves the quality of the reconstruction and helps to avoid local minima during training. The regularizers improve the canonical representation, and, lastly, lighting helps to resolve possible ambiguities in the geometric reconstruction, particularly in textureless regions.
4.2 Qualitative results
To evaluate the performance on real data, we conduct experiments using two datasets. CelebA Liu et al. (2015) contains over k images of real human faces, and 3DFAW Gross et al. (2010); Jeni et al. (2015); Zhang et al. (2014); Yin et al. (2008) contains k images with 3D keypoint annotations. We crop the faces from original images using the provided keypoints, and follow the official train/val/test splits. In fig. 1 we show qualitative results on both datasets. The 3D shape of the faces is recovered very well by the model, including details of nose, eyes and mouth, despite the presence of extreme facial expression.
4.3 Comparison with the state of the art
To the best of our knowledge, there is no prior work on fully unsupervised dense object depth estimation that we can directly compare with. However, the DepthNet model of Moniz et al. (2018) predicts depth for selected facial keypoints given the 2D keypoint locations as input. Hence, we can evaluate the reconstruction obtained by our method on this sparse set of points. We also compare to the baselines MOFA Tewari et al. (2017) and AIGN Tung et al. (2017) reported in Moniz et al. (2018). For a fair comparison, we use their public code which computes the depth correlation score (between and ). We use the 2D keypoint locations to sample our predicted depth and then evaluate the same metric. The set of test images from 3DFAW Gross et al. (2010); Jeni et al. (2015); Zhang et al. (2014); Yin et al. (2008) and the preprocessing are identical to Moniz et al. (2018).
In table 2 we report the results from their paper and the slightly higher results we obtained from their publiclyavailable implementation. The paper also evaluates a supervised model using a GAN discriminator trained with groundtruth depth information. Our fully unsupervised model outperforms DepthNet and reaches closetosupervised performance, indicating that we learn reliable depth maps.
4.4 Generalization to other objects
To understand the generalization of the method to other symmetric objects, we train on two additional datasets. We use the cat dataset provided by Zhang et al. (2008), crop the cat heads using the keypoint annotations and split the images by :: into train, validation and test sets. For car images we render ShapeNet’s Chang et al. (2015) synthetic car models from various viewpoints and textures.
We are able to reconstruct both object categories well and the results are visualized in fig. 4. Although we assume Lambertian surfaces to estimate the shading, our model can reconstruct cat faces convincingly despite their fur which has complicated light transport mechanics. This shows that the other parts of the model constrain the shape enough to still converge to meaningful representations. Overall, the model is able to reconstruct cats and cars as well as human faces, showing that the method generalizes over object categories.
5 Conclusions
We have presented a method that can learn from an unconstrained image collection of single views of a given category to reconstruct the 3D shapes of individual instances. The model is fully unsupervised and learns based on a reconstruction loss, similar to an autoencoder. We have shown that lighting and symmetry are strong indicators for shape and help the model to converge to a meaningful reconstruction. Our model outperforms a current stateoftheart method that uses 2D keypoint supervision. As for future work, the model currently represents 3D shape from a canonical viewpoint, which is sufficient for objects such as faces that have roughly convex shape and a natural canonical viewpoint. In order to handle more complex objects, it may be possible to extend the model to use either a collection of canonical views or a 3D representations such as a mesh or a voxel map.
Acknowledgement
We gratefully thank Soumyadip Sengupta for sharing with us the code to generate synthetic face datasets, and members of Visual Geometry Group for insightful discussion. Shangzhe Wu is supported by Facebook Research. Christian Rupprecht is supported by ERC Stg Grant IDIU638009.
References
 Agrawal et al. (2015) Pulkit Agrawal, Joao Carreira, and Jitendra Malik. Learning to see by moving. In Proc. ICCV, pages 37–45. IEEE, 2015.
 Belhumeur et al. (1999) P. N. Belhumeur, D. J. Kriegman, and A. L. Yuille. The basrelief ambiguity. IJCV, 35(1), 1999.
 Chang et al. (2015) A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An informationrich 3d model repository. arXiv, abs/1512.03012, 2015.
 Chen et al. (2019) C.H. Chen, A. Tyagi, A. Agrawal, D. Drover, M. V. Rohith, S. Stojanov, and J. M. Rehg. Unsupervised 3d pose estimation with geometric selfsupervision. arXiv, abs/1904.04812, 2019.
 Eigen et al. (2014) D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In NeurIPS, 2014.
 Faugeras and Luong (2001) O. Faugeras and Q.T. Luong. The Geometry of Multiple Images. MIT Press, 2001.
 Gao and Yuille (2017) Y. Gao and A. L. Yuille. Exploiting symmetry and/or manhattan properties for 3d object structure estimation from single and multiple images. In Proc. CVPR, 2017.
 Gecer et al. (2019) B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. arXiv, abs/1902.05978, 2019.
 Godard et al. (2017) C. Godard, O. Mac Aodha, and G. J. Brostow. Unsupervised monocular depth estimation with leftright consistency. In Proc. CVPR, 2017.
 Gordon (1990) G. G. Gordon. Shape from symmetry. In Proc. SPIE, 1990.
 Gross et al. (2010) R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image and Vision Computing, 2010.

Hartley and Zisserman (2003)
R. Hartley and A. Zisserman.
Multiple view geometry in computer vision
. Cambridge university press, 2003.  Henzler et al. (2018) P. Henzler, N. J. Mitra, and T. Ritschel. Escaping plato’s cave using adversarial training: 3d shape from unstructured 2d image collections. arXiv, abs/1811.11606, 2018.
 Horn (1975) B. Horn. Obtaining shape from shading information. In The Psychology of Computer Vision, 1975.
 Huang et al. (2017) G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proc. CVPR, 2017.
 Jaderberg et al. (2015) M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NeurIPS, 2015.
 Jeni et al. (2015) L. A. Jeni, J. F. Cohn, and T. Kanade. Dense 3d face alignment from 2d videos in realtime. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2015.
 Kanazawa et al. (2018) A. Kanazawa, S. Tulsiani, Alexei A. Efros, and J. Malik. Learning categoryspecific mesh reconstruction from image collections. In Proc. ECCV, 2018.
 Kato and Harada (2019) H. Kato and T. Harada. Learning view priors for singleview 3d reconstruction. In Proc. CVPR, 2019.
 Kato et al. (2018) H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In Proc. CVPR, 2018.
 Kudo et al. (2018) Y. Kudo, K. Ogaki, Y. Matsui, and Y. Odagiri. Unsupervised adversarial learning of 3d human pose from 2d joint locations. arXiv, abs/1803.08244, 2018.
 Liu et al. (2019) S. Liu, T. Li, W. Chen, and H. Li. Soft rasterizer: A differentiable renderer for imagebased 3d reasoning. arXiv, abs/1904.01786, 2019.
 Liu et al. (2015) Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proc. ICCV, 2015.
 Loper and Black (2014) M. M. Loper and M. J. Black. OpenDR: An approximate differentiable renderer. In Proc. ECCV, 2014.
 Luo et al. (2018) Y. Luo, J. Ren, M. Lin, J. Pang, W. Sun, H. Li, and L. Lin. Single view stereo matching. In Proc. CVPR, 2018.
 Moniz et al. (2018) J. R. A. Moniz, C. Beckham, S. Rajotte, S. Honari, and C. Pal. Unsupervised depth estimation, 3d face rotation and replacement. In NeurIPS, 2018.
 NguyenPhuoc et al. (2019) T. NguyenPhuoc, C. Li, L. Theis, C. Richardt, and Y.L. Yang. Hologan: Unsupervised learning of 3d representations from natural images. arXiv, abs/1904.01326, 2019.
 Novotny et al. (2017) D. Novotny, D. Larlus, and A. Vedaldi. Learning 3d object categories by looking around them. In Proc. ICCV, 2017.

Paysan et al. (2009)
P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter.
A 3d face model for pose and illumination invariant face recognition.
In The IEEE International Conference on Advanced Video and Signal Based Surveillance, 2009.  Sahasrabudhe et al. (2019) M. Sahasrabudhe, Z. Shu, E. Bartrum, R. A. Guler, D. Samaras, and I. Kokkinos. Lifting autoencoders: Unsupervised learning of a fullydisentangled 3d morphable model using deep nonrigid structure from motion. arXiv, abs/1904.11960, 2019.
 Sengupta et al. (2018) S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet: Learning shape, refectance and illuminance of faces in the wild. In Proc. CVPR, 2018.
 Shu et al. (2018) Z. Shu, M. Sahasrabudhe, R. A. Guler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In Proc. ECCV, 2018.
 Simonyan and Zisserman (2015) K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 Suwajanakorn et al. (2018) S. Suwajanakorn, N. Snavely, J. Tompson, and M. Norouzi. Discovery of latent 3d keypoints via endtoend geometric reasoning. In NeurIPS, 2018.
 Szabó and Favaro (2018) A. Szabó and P. Favaro. Unsupervised 3d shape learning from image collections in the wild. arXiv, abs/1811.10519, 2018.
 Tewari et al. (2017) A. Tewari, M. Zollöfer, H. Kim, P. Garrido, F. Bernard, P. Perez, and Theobalt C. Mofa: Modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction. In Proc. ICCV, 2017.
 Thewlis et al. (2017a) J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In Proc. ICCV, 2017a.
 Thewlis et al. (2017b) J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object frames by dense equivariant image labelling. In NeurIPS, 2017b.
 Thewlis et al. (2018) J. Thewlis, H. Bilen, and A. Vedaldi. Modelling and unsupervised learning of symmetric deformable object categories. In NeurIPS, 2018.

Tung et al. (2017)
H.Y. F. Tung, A. W. Harley, W. Seto, and K. Fragkiadaki.
Adversarial inverse graphics networks: Learning 2dto3d lifting and imagetoimage translation from unpaired supervision.
In Proc. ICCV, 2017.  Ummenhofer et al. (2017) B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In Proc. CVPR, pages 5038–5047, 2017.
 Wang et al. (2018) C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct methods. In Proc. CVPR, 2018.
 Wang et al. (2017) M. Wang, Z. Shu, S. Cheng, Y. Panagakis, D. Samaras, and S. Zafeiriou. An adversarial neurotensorial approach for learning disentangled representations. IJCV, pages 1–20, 2017.
 Woodham (1980) R. J. Woodham. Photometric method for determining surface orientation from multiple images. Optical engineering, 19(1):191139, 1980.

Xiao et al. (2010)
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
Sun database: Largescale scene recognition from abbey to zoo.
In Proc. CVPR, 2010.  Yin et al. (2008) L. Yin, X. Chen, Y. Sun, T. Worm, and M. Reale. A highresolution 3d dynamic facial expression database. In Proc. Int. Conf. Autom. Face and Gesture Recog., 2008.
 Zhang et al. (2008) W. Zhang, J. Sun, and X. Tang. Cat head detection  how to effectively exploit shape and texture features. In Proc. ECCV, 2008.
 Zhang et al. (2014) X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. Bp4dspontaneous: a highresolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.
 Zhou et al. (2017) T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In Proc. CVPR, 2017.
 Zhu et al. (2018) J.Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. B. Tenenbaum, and W. T. Freeman. Visual object networks: Image generation with disentangled 3D representations. In NeurIPS, 2018.
6 Appendix
6.1 Further Implementation Details
We will release the code and the datasets for future benchmarks upon acceptance of this paper.
Table 3 summarizes the number of images in each of datasets used in this paper. We use an image size of in all experiments. We also report all hyperparameter settings in table 7. Our models were trained for around k iterations (e.g., epochs for the synthetic face dataset), which translates to roughly one day on a Titan X Pascal GPU. To avoid border issues after the viewpoint transformation, we predict depth maps twice as large and crop the center after warping.
Architecture
We use standard encoder networks for both viewpoint and lighting prediction, and encoder decoder networks for albedo and depth prediction. The architecture for each network is detailed in table 5, table 5, and table 7. Abbreviations of building blocks are defined are as follows:

: convolution with input channels, output channels, kernel size , stride and padding .

BN

: dense encoder block with convolutions with
channels, each followed by batch normalization and ReLU.

: encoder transition block with convolutions with input channels, output channels and stride , each followed by batch normalization and LeakyReLU.

: dense decoder block with deconvolutions with channels, each followed by batch normalization and ReLU.

: encoder transition block with deconvolutions with input channels, output channels and stride , each followed by batch normalization and ReLU.
Total  Train  Val  Test  

Syn Face  
CelebA  
3DFAW  
Cats  
Cars 
[2]
Encoder 

Conv(3, 64, 4, 2, 1) + LeakyReLU(0.2) 
Conv(64, 128, 4, 2, 1) + BN + LeakyReLU(0.2) 
Conv(128, 256, 4, 2, 1) + BN + LeakyReLU(0.2) 
Conv(256, 512, 4, 2, 1) + BN + LeakyReLU(0.2) 
Conv(512, 512, 4, 2, 1) + BN + LeakyReLU(0.2) 
Conv(512, 128, 2, 1, 0) + BN + LeakyReLU(0.2) 
Decoder 
Deconv(128, 512, 2, 1, 0) + BN + ReLU 
Deconv(512, 512, 4, 2, 1) + BN + ReLU 
Deconv(512, 256, 4, 2, 1) + BN + ReLU 
Deconv(256, 128, 4, 2, 1) + BN + ReLU 
Deconv(128, 64, 4, 2, 1) + BN + ReLU 
Deconv(64, 64, 4, 2, 1) + BN + ReLU 
Conv(64, 1, 5, 1, 2) + Tanh 
Dense Encoder 

Conv(3, 64, 4, 2, 1) 
DBE(64, 6) + TBE(64, 128, 2) 
DBE(128, 12) + TBE(128, 256, 2) 
DBE(256, 24) + TBE(256, 512, 2) 
DBE(512, 16) + TBE(512, 128, 4) 
Sigmoid 
Dense Decoder 
Deconv(128, 512, 4, 1, 0) 
DBD(512, 16) + TBD(512, 256, 2) 
DBD(256, 24) + TBD(256, 128, 2) 
DBD(128, 12) + TBD(128, 64, 2) 
DBD(64, 6) + TBD(64, 64, 2) 
BN + ReLU + Conv(64, 3, 5, 1, 2) 
Tanh 
[2]
Parameter  Value/Range 

Optimizer  Adam 
Learning rate  
Number of epochs  
Batch size  
Loss weight  
Loss weight  
Loss weight  
Loss weight  
Depth (Human face)  
Depth (Cat head)  
Depth (Car)  
Albedo  
Light coefficient  
Light coefficient  
Light direction x/y  
Viewpoint rotation x/y/z  
Viewpoint translation x/y/z 
Encoder 

Conv(3, 32, 4, 2, 1) + ReLU 
Conv(32, 64, 4, 2, 1) + ReLU 
Conv(64, 128, 4, 2, 1) + ReLU 
Conv(128, 256, 4, 1, 0) + ReLU 
Conv(256, 256, 4, 2, 1) + ReLU 
FC(256, output dim) + Tanh 
6.2 More Qualitative Results











input 
recon  albedo  normal  shading  shaded  side 








input 
recon  albedo  normal  shading  shaded  side 









