1 Introduction
Computer vision can be understood as the task of inverse graphics, namely the recovery of the scene that underlies an observed image. The scene factors that govern image formation primarily include surface geometry, camera position, material properties and illumination. These are independent of each other, but jointly determine the observed image intensities.
In this work we incorporate these factors as disentangled variables in a deep generative model of an object category and tackle the problem of recovering all of them in an entirely unsupervised manner. We integrate in our network design ideas from classical computer vision, including structurefrommotion, spherical harmonic models of illumination and deformable models, and recover the threedimensional geometry of a deformable object category in an entirely unsupervised manner from an unstructured collection of RGB images. We focus in particular on human faces and show that we can learn a threedimensional morphable model of face geometry and appearance without access to any 3D training data, or manual labels. We further show that by using weak supervision we can further disentangle identity and expression, leading to even more controllable 3D generative models.
The resulting model allows us to generate photorealistic images of persons in a fullycontrollable manner: we can manipulate 3D camera pose, expression, texture and illumination in terms of disentangled and interpretable lowdimensional variables.
Our starting point is the Deforming AutoEncoder (DAE) model introduced in [58] to learn an unsupervised deformable template model for an object category. DAEs incorporate deformations in the generative process of a deep autoencoder by associating pixels with the UV coordinates of a learned deformable template. As such, they disentangle appearance and shape variability and learn dense templateimage correspondences in an unsupervised manner.
We first introduce Lifting AutoEncoders (LAEs) to recover, and then exploit the underlying 3D geometry of an object category by interpreting the outputs of a DAE in terms of a 3D representation. For this we train a network task so as minimize a NonRigid SfM minimization objective, which results is a lowdimensional morphable model of 3D shape, coupled with an estimate of the camera parameters. The resulting 3D reconstruction is coupled with a differentiable renderer
[34] that propagates information from a 3D mesh to a 2D image, yielding a generative model for images that can be used for both image reconstruction and manipulation.Our second contribution consists in exploiting the 3D nature of our novel generative model to further disentangle the image formation process. This is done in two complementary ways. For illumination modeling we use the 3D model to render normal maps and then shading images, which are combined with albedo maps to synthesize appearance. The resulting generative model incorporates our sphericalharmonicsbased [83, 74, 75] modeling of image formation, while still being endtoend differentiable and controllable. For shape modeling we use sources of weak supervision to factor the shape variability into 3D pose, and nonrigid identity and expression, allowing us to control the expression or identity of a face by working with the appropriate latent variable code.
Finally, we combine our reconstructiondriven architecture with an adversarially trained refinement network which allows us to generate photorealistic images as its output.
As a result of these advances we have a deep generative model that uses 3D geometry to model shape variability and provides us with a clearly disentangled representation of 3D shape in terms of identity, expression and camera pose and appearance in terms of albedo and illumination/shading. We report quantitative results on a 3D landmark localization task and show multiple qualitative results of controllable photorealistic image generation.
2 Previous work
The task of disentangling deep models can be understood as splitting the latent space of a network into independent sources of variation. In the case of learning generative models for computer vision, this amounts to uncovering the independent factors that contribute to image formation. This can both simplify learning, by injecting inductive biases about the data generation process, and can also lead to interpretable models that can controlled by humans in terms of a limited number of degrees of freedom. This would for instance allow computer graphics to benefit from the advances in the learning of generative models.
Over the past few years rapid progress has been made in the direction of disentangling the latent space of deep models into dimensions that account for generic factors of variation, such as identity and lowdimensional transformations [14, 79, 46, 78, 61], or even nonrigid, dense deformations from appearance [86, 19, 64, 58, 76]. Several of these techniques have made it into some of the most compelling photorealistic, controllable generative models of object categories [52, 33].
Moving closer to graphics, recent works have aimed at exploiting our knowledge about image formation in generative modeling by replicating the inner workings of graphics engines in deep networks. On the synthesis side, geometrydriven generative models using intrinsic images [59, 5, 57] or the 2.5D image sketch [87] as inputs to image synthesis networks have been shown to deliver sharper, more controllable image and video [38] synthesis results. On the analysis side, several works have aimed at intrinsic image decomposition [7] using energy minimization, e.g [20, 42]. The disentanglement of image formation into all of its constituent sources (surface normals, illumination and albedo) was first pursued in [6]
, where priors over the constituent variables were learned from generic scenes and then served as regularisers to complement the image reconstruction loss. More recently, deep learningbased works have aimed at learning the intrinsic image decomposition from synthetic supervision
[47], self supervision [29] or multiview supervision [82].These works can be understood in D. Marr’s terms as getting 2.5D proxies to 3D geometry, which could eventually lead to 3D reconstruction [80]: texture is determined by shading, shading is obtained from normals and illumination, and normals are obtained from the 3D geometry. This leads to the task of 3D geometry estimation as being the key to a thorough disentanglement of image formation.
Despite these advances, the disentanglement of the threedimensional world geometry from the remaining aspects of image formation still remains very recent in deep learning. Effectively all works addressing aspects related to 3D geometry rely on paired data for training, e.g. multiple views of the same object [71], videos [48] or some preexisting 3D mesh representation that is the starting point for further disentanglement [21, 56, 81, 62] or selfsupervision [85]. This however leaves open the question of how one can learn about the threedimensional world simply by observing a set of unstructured images.
Very recently, a few works have started tackling the problem of recovering the threedimensional geometry of objects from more limited information. In [31] the authors used segmentation masks and keypoints to learn a CNNdriven 3D morphable model of birds, trained in tandem with a differentiable renderer module [34]. Apart from the combination with an endtoend learnable framework, this requires however the same level of manual annotation (keypoints and masks) that earlier works had used to lift object categories to 3D [13]. A similar approach has been proposed in [70] to learn morphable models from keypoint annotations.
The LiftNet architecture proposed more recently by [77]
uses a 3D geometrybased reprojection loss to train a depth regression FCN by using correspondences of object instances during training. This however is missing the surfacebased representation of a given category, and is using geometry only implicitly, in its loss function  the network itself is a standard FCN.
The unsupervised training of volumetric CNNs was originally proposed in [30] using toy examples and mostly binary masks. Most recently, a GANbased volumetric model of object categories was introduced in [25], showing that one can recover 3D geometry from an unstructured photo collection using adversarial training. Still, this is far from a rendering pipeline, in the sense that the effects of illumination and texture are coupled together, and the volumetric representation implies limitations in resolution.
Even though these works present exciting progress in the direction of deep 3D reconstruction, they fall short of providing us with a model that operates like a fullblown rendering pipeline. By contrast in our work we propose for the first time a deep learningbased method that recovers a threedimensional, surfacebased, deformable template of an object category from an unorganized set of images, leading to controllable photorealistic image synthesis.
We do so by relying on on NonRigid Structure from Motion (NRSfM). Rigid SFM is a mature technology, with efficient algorithms existing for multiple decades years [68, 24], systems for largescale, citylevel 3D reconstruction were introduced a decade ago [3], while highperforming systems are now publicly available [55]. Rigid SFM has very recently been revisited from the deep learning viewpoint, leading to exciting new results [72, 85].
In contrast, NRSfM is still a largely unsolved problem. Developed originally to establish a 3D model of a deformable object by observing its motion [10]
it was developed to solve increasingly accurately the underlying mathematical optimization problems
[69, 49, 4, 15], extending to dense reconstruction [18], lifting object categories from keypoints and masks [13, 31], incorporating spatiotemporal priors [60] and illumination models [45], while leading to impressively highresolution 3D Reconstruction results [22, 45, 26]. In [41] it has recently been proposed to represent nonrigid variability in terms of a deep architecture  but still the work relies on given point correspondences between instances of the same category. By contrast, our proposed method has a simple, linear model for the shape variability, as classical morphable models, but establishes the correspondences automatically.Earlier NRSfMbased work has shown that 3D morphable model learning is possible in particular for human faces [35, 37, 36] by using a carefully designed, flowbased algorithm to uncover the organization of the image collection  effectively weaving a network of connections between pixels of images, and feeding this into NRSfM. As we now show this is no longer necessary  we delegate the task of establishing correspondences across image pixels of multiple images to a Deforming AutoEncoder [58] and proceed to lifting images through an endtoend trainable deep network as we now describe. . Several other works have shown that combining a prior template about the object category shape with video allows for an improved 3D reconstruction of the underlying geometry, both for faces [67, 63, 43] and quadrupeds [8]. However, these methods still require multiple videos and a template, while our method does not. We intend to explore the use of videobased supervision in future work.
3 Lifting AutoEncoders
We start by briefly describing Deforming AutoEncoders, as these are the starting point of our work. We then turn to our novel contributions of 3D lifting in Sec. 3.2 and shape disentanglement in Sec. 4.2.
3.1 DAEs: from image collections to deformations
Deforming Autoencoders, introduced in [58], and shown in Fig. 3
, follow the deformable template paradigm and model image generation through a combination of appearance synthesis in a canonical coordinate system and a spatial deformation that warps the appearance (or, ‘texture’) to the observed image coordinates. The resulting model disentangles shape and appearance in an entirely unsupervised manner, while using solely an image reconstruction loss for training. Training a DAE is in principle an illposed problem, since the model could learn to model shape variability in terms of appearance and recover a constant, identity deformation, resulting in a standard AutoEncoder. This is handled in practice by forcing the network to model shape variability through the deformation branch by reducing the dimensionality of the latent vector for textures. Further details for training DAEs are provided in
[58].3.2 LAEs: 3D structurefromdeformations
We now turn to the problem of recovering the 3D geometry of an object category from an unstructured set of images. For this we rely on DAEs to identify corresponding points across this image set, and address our problem by training a network to minimize an objective function that is inspired from NonRigid Structure from Motion (NRSfM). Our central observation is that DAEs provide us with an image representation on which NRSfM optimization objectives can be easily applied. In particular, disentangling appearance and deformation labels all image positions that correspond to a single template point with a common, discovered value. LAEs take this a step further, and interpret the DAE’s UV decoding outputs as indicating the positions where an underlying 3D object surface position projects to the image plane. The task of an LAE is to then infer a 3D structure that can successfully project to all of the observed 2D points.
Given that we want to handle a deformable, nonrigid object category, we introduce a loss function that is inspired from NonRigid Structure from Motion, and optimize with respect to it. The variables involved in the optimization include (a) the statistical 3D shape representation, represented in terms of a linear basis (b) the perinstance expansion coefficients on this basis and (c) the perinstance 3D camera parameters. We note that in standard NRSfM all of the observations come from a common instance that is observed in time  by constrast in our case every training sample stems from a different instance of the same category, and it is only thanks to the DAEbased preprocessing that these distinct instances become commensurate.
3.3 3D Lifting Objective
Our 3D structure inference task amounts to the recovery of a surface model that maps an intrinsic coordinate space to 3D coordinates: . Even though the underlying model is continuous, our implementation is discrete: we consider a set of 3D points sampled uniformly on a cartesian grid in intrinsic coordinates,
(1)  
(2) 
where determines the spatial resolution at which we discretize the surface. We parameterize the threedimensional position of these vertices in terms of a lowdimensional linear model, that captures the dominant modes of variation around a mean shape ,
(3) 
In morphable models [73, 9] the mean shape and deformation basis elements are learned by PCA on a set of aligned 3D shapes, but in our case we discover them from 2D by solving an NRSfM minimization problem that involves the projection to an unknown camera viewpoint.
In particular we consider scaled orthographic projection through a camera described by a rotation matrix and translation vector . Under this assumption, the 3D surface points project to the points , given by
(4) 
where defines a global scaling.
We measure the quality of a 3D reconstruction in terms of the Euclidean distance of the predicted projection of a 3D point and its actual position in the image. In our case a 3D point is associated with surface coordinate , we therefore penalize its distance from the image position that the DAE’s deformation decoder labels as :
(5) 
In practice we return the image point where the DAE’s prediction is closest to ; if no point is sufficiently close we declare that point is missing, setting a visibility variable to zero. We treat and as data terms, which specify the constraints that our learned 3D model must meet: the 3D points must project to points that lie close to their visible 2D counterparts, . We express this reprojection objective in terms of the remaining variables:
(6) 
where we have expressed as a differentiable function of through Eq. 4 and Eq. 3.
For a set of images we have different camera and shape parameters since we consider a nonrigid object seen from different viewpoints. The basis elements are however considered to be common across all images, since they describe the inherent shape variability of the whole category. Our 3D nonrigid reconstruction problem thus becomes:
(7) 
3.4 LAE learning via Deep NRSfM
Minimizing the objective of Eq. 7 amounts to the common NonRigid StructurefromMotion objective [10, 69, 49, 4, 15]. Even though highly efficient and scalable algorithms have been proposed for its minimization, we would only consider them for initialization, since we want 3D Lifting to be a component of a larger deep generative model of images. We do not use any such technique, in order to simplify our model’s training, implementing it as a single deep network training process.
The approach we take is to handle the shape basis as the parameters of a linear ‘morphable’ layer, tasked with learning the shape model for our object category. We train this layer in tandem with complementary, multilayer network branches that regress from the image to (a) the expansion coefficients , (b) the Euler angles/rotation matrix , and (c) the displacement vector describing the camera position. In the limit of very large hidden vectors the related angle/displacement/coefficient heads could simply memorize the optimal values per image, as dictated by the optimization of Eq. 7. With a smaller number of hidden units these heads learn to successfully regress camera and shape vectors and can generalize to unseen images. As such, they are components of a larger deep network that can learn to reconstruct an image in 3D  a task we refer to as Deep NRSfM.
If we only train a network to optimize this objective we obtain a network that can interpret a given image in terms of its 3D geometry, as expressed by the 3D camera position (rigid pose) and the instancespecific expansion coefficients (nonrigid shape). Having established this, we can conclude the task of image synthesis by projecting the 3D surface back to 2D. For this we combine the 3D lifting network with a differentiable renderer [34], and bring the synthesized texture image in correspondence with the image coordinates. The resulting network is an endtoend trainable pipeline for image generation that passes through a fullblown, 3D reconstruction process.
Having established a controllable, 3Dbased rendering pipeline, we turn to photorealistic synthesis. For this we further refine the rendered image by a UNet [53] architecture that takes as input the reconstructed image and augments the visual plausibility. This refinement module is trained using two losses, firstly an loss to reconstruct the input image and secondly an adversarial loss to provide photorealism. The results of this module are demonstrated in Figure 7  we see that while keeping intact the image generation process, we achieve a substantially more realistic synthesis.
4 GeometryBased Disentanglement
A Lifting AutoEncoder provides us with a disentangled representation of images in terms of 3D rotation, nonrigid deformation, and texture, leading to controllable image synthesis.
In this section we show that having access to the underlying 3D scene behind an image allows to further decompose the image generation into distinct, controllable submodels, in the same way that one would do within a graphics engine. These contributions rely on certain assumptions and data that are reasonable for human faces, but could also apply to several other categories.
We first describe in Sec. 4.1 how surfacebased normal estimation allows us to disentangle appearance into albedo and shading using a physicsbased model of illumination. In Sec. 4.2 we then turn to learning a more finegrained model of 3D shape and use weak supervision to disentangle perinstance nonrigid shape into expression and identity.
4.1 LAElux: Disentangling Shading and Albedo
Given the 3D reconstruction of a face we can use certain assumptions about image formation that lead to physicallyplausible illumination modeling. For this we extend LAE with albedoshading disentangling, giving rise to LAElux where we explicitly model illumination.
As in several recent works [59, 57] we consider a Lambertian reflectance model for human faces and adopt the Spherical Harmonic model to model the effects of illumination on appearance [83, 74, 75]. We pursue the intrinsic decomposition [7] of the canonical texture into albedo, and shading, :
(8) 
where denotes Hadamard product, by constraining the shading image to be connected to the normals delivered by the LAE surface.
In particular, denoting by the representation of the scenespecific spherical harmonic illumination vector, and by the representation of the local normal field on the first 9 spherical harmonic coefficients, we consider that the local shading, is expressed as an inner product:
(9) 
As such the shading field can be obtained by a linear layer that is driven by regressed illumination coefficients and the surfacebased harmonic field, . Given , the texture can then be obtained from albedo and shading images according to Eq. 8.
In practice, the normal field we estimate is not as detailed as would be needed, e.g. to capture sharp corners, while the illumination coefficients can be inaccurate. To compensate for this, we first render an estimate of the shading with spherical harmonics parameters and normal maps and then use a UNet to refine it, obtaining .
In our experiments we have initialized LAElux with a converged LAE, discarded the last layer of the LAE’s texture prediction and replaced it with the inrinsic predictor outlined above. The albedo image is obtained through an albedo decoder that has an identical architecture to the texture decoder in DAE. The latent code for albedo and the spherical harmonics parameters are obtained as separate linear layers that process the penultimate layer of the texture encoder.
In training, only the texture decoders are updated while other encoding and decoding networks are fixed. When instead training everything jointly from scratch we observed implausible disentanglement results, presumably due to the illposed nature of the decomposition problem.
Given that the shadingalbedo decomposition is an illposed problem, we further use a combination of losses that capture increasingly detailed prior knowledge about the desired solution. First, as in [59] we employ intrinsic imagebased smoothness losses on albedo and shading:
(10) 
where represents the spatial gradient, which means that we allow the albedo to have sharp discontinuities, while the shading image should have mostly smooth variations [54]. In our experiment, we set and .
Second, we compute a deterministic estimate of the illumination parameters and penalize its distance to the regressed illumination values:
(11) 
More specifically, is based on the crude assumption that the face’s albedo is constant, , where we treat albedo as a grayscale. Even though clearly very rough, this assumption captures the fact that a face is largely uniform, and allows us to compute a proxy to the shading in terms of where denotes Hadamard division. We subsequently compute the approximation from and the harmonic field using least squares. For face images, similar to [59], serves as a reasonably rough approximation of the illumination coefficient and is used for weak supervision in Eq. 11.
Finally, the shading consistency loss regularizes the UNet, and is designed to encourage the UNet based adapted shading to be consistent with the shading rendered from the spherical harmonics representation —
(12) 
where we use Huber loss for a robust regression since
can contain some outlier pixels due to an imperfect 3D shape.
4.2 Disentangling Expression, Identity and Pose
Having outlined our geometrydriven model for disentangling appearance variability into shading and albedo, we now turn to the task of disentangling the sources of shape variability.
In particular, we consider that face shape, as observed in an image is the composite effect of camera pose, identity and expression. Without some guidance the parameters controlling shape can be mixed  for instance accounting for the effects of camera rotation through nonrigid deformations of the face.
We start by allowing our representation to separately model identity and expression, and then turn to forcing it to disentangle pose, identity and expression.
For a given identity we can understand expressionbased shape variability in terms of deviation from a neutral pose. We can consider that a reasonable approximation to this consists in using a separate linear basis for identity and another for expression , which amounts to following model:
(13) 
Even though the model is still linear and is at first sight equivalent, clearly separating the two subspaces means that we can control them through side information. For instance when watching a video of a single person, or a single person from multiple viewpoints one can enforce the identity expansion coefficients to remain constant through a siamese loss [40]. This would force the training to model all of the personspecific variability through the remaining subspace, by changing the respective coefficients per image.
Here we use the MultiPIE[23] dataset to help disentangle the latent representation of person identity, facial expression, and pose (camera). MultiPIE is captured under a controlled environment and contains image pairs acquired under identical conditions with differences only in (1) facial expression, (2) camera position, and (3) illumination conditions. We use this dataset to disentangle the latent representation for shape into distinct components.
We denote by the concatenation of all shape parameters: and turn to the task of forcing the different components of to behave as expected. We use facial expression distentangling as an example, and follow a similar procedure for pose and camera disentangling. Given an image with known expression exp, we sample two more images. The first, has the same facial expression but different identity, pose, and illumination conditions. The second, , has a different facial expression but the same identity, pose and illumination condition as . We use siamese training to encourage and to have similar latent representations for facial expression, and a triplet loss to ensure that and are closer in expression space than and :
(14)  
(15)  
(16)  
Following a similar collection of triplets for the remaining sources of variability, we disentangle the latent code for shape in terms of camera pose, identity, and expression. With MultiPIE, the overall disentanglement objective for shape is hence
(17) 
where and are defined similarly to . In our experiments, we used the scaling parameter for this loss, .
4.3 Complete Objective
Having introduced the losses that we use for disentangling, we now turn to forming our complete training objective.
We control the model learning with a regularization loss defined as follows:
(18) 
where is the scaling parameter in Eq. 4 and is the nonrigid deviation from the mean shape, . We use , and in all our experiments.
Combining this with the reprojection loss, , defined in Eq. 7, we can write the complete objective function, which is trained endtoend:
(19) 
In our experiments, we used the scaling factor for the 3D reprojection loss, . This relatively high scaling factor was chosen so that the reprojection loss is not overpowered by other losses at later training iterations.
For training the LAELux, we also add the albedoshading disentanglement losses, summarised by
(20) 
5 Experiments
5.1 Architectural Choices
Our encoder and decoder architectures are similar to the ones employed in [58], but working on images of size pixels instead of
. We use convolutional neural networks with five stridedConvbatchNormleakyReLU layers in image encoders, which regress the expansion coefficients
s. Image decoders consist similarly of five stridedDeconvbatchNormReLU layers.
In all of these experiments the training process was started with a base learning rate of , which was reduced by a factor of
every fifty epochs of training. We used the Adam optimizer
[39] and a batch size of 64. All training images were cropped and resized to a shape of pixels, while a mesh of size was used in training. This allowed us to sample one keypoint for every four pixels in the UV space, making the mesh fairly high resolution. The mesh was initialized as a Gaussian surface, and was initially positioned so that it faces toward the camera.5.2 Datasets
We now note the face datasets that we used for our experiments. Certain among them contain side information, for instance multiple views of the same person, or videos of the same person. This side information was used for expressionidentity disentanglement experiments, but not for the 3D lifting part. For the reconstruction results our algorithms were only provided with unstructured datasets, unless otherwise noted.

CelebA [44]: This dataset contains about 200,000 inthewild images, and is one of the datasets we use to train our DAE. A subset of this dataset, MAFL [84], was also released which contains annotations for five facial landmarks. We use the training set of MAFL in our evaluation experiments, and report results on the test set. Further, as MAFL is a subset of CelebA, we removed the images in the MAFL test set from the CelebA training set before training the DAE.

MultiPIE [23]: MultiPIE contains images of 337 subjects of 7 facial expressions, each of which is captured under 15 viewpoints and 19 illumination conditions simultaneously.

AFLW20003D [88]: This dataset consists of 3D fitted faces for the first 2000 images of the AFLW dataset. In this paper, we employ it for evaluation of our learned shapes using 3D landmark localization errors.
5.3 Qualitative Results
In this section, we show examples of the learned 3D shapes. Figure 5 shows visualizations of reconstructed faces from various yaw angles using a model that was trained only on CelebA images. We see that the model learns a shape that expresses the input well. However, when using no pose information from MultiPIE, and the completely unsupervised nature of our alignment, it is not able to properly decode side poses. This drawback is quickly overcome when we add weak pose supervision from the MultiPIE dataset, as seen in Figure 7.
6 Face manipulation results
In this section, we show some results of manipulating the expression and pose latent spaces. In Figure 9 (b), we visualize the decoded 3D shape from input images in 9 (a) from various camera angles. Furthermore, in Figure 9 (d), we show results after passing the visualizations in Figure 9 (b) through the refinement network.
Similarly, in Figures 10 and 11 (a)(e), we interpolate over the expression latent space from each of the images in (a) to the image in (b), and visualize the shape at each intermediate step in Figure (c), the output in (d), and the refined output in (e).
Finally, in Figure 6, we interpolate over all three latent spaces—texture, pose, and shape.
Method  NME 

Thewlis (2017) [66]  
Thewlis (2018) [65]  
Jakub (2018) [28]  
Shu (2018), DAE, no regressor [58]  
Shu (2018), DAE, with regressor [58]  
LAE, CelebA (no regressor)  
LAE, CelebA (with regressor) 
6.1 Landmark Localization
6.2 Albedoshading disentanglement
In Fig. 12 we show that with the disentangled physical representation for illumination, we can hallucinate illumination manipulation with LAElux.
6.3 Quantitative Analysis: Landmark Localization
We evaluate our approach quantitatively in terms of landmark localization. Specifically, we evaluate on two datasets—the MAFL test set for 2D landmarks, and the AFLW20003D for 3D shape. In both cases, as we do not train with groundtruth landmarks, we manually annotate, only once, the necessary landmarks on the base shape as linear combinations of one or more mesh vertices. That is to say, each landmark location corresponds to a linear combination of the locations of several vertices.
We use five landmarks for the MAFL test set, namely the two eyes, the tip of the nose, and the ends of the mouth. Similarly to [66, 65, 58], we evaluate the extent to which landmarks are captured by our 3D shape model by training a linear regressor to predict them given the locations of the mesh vertices in 3D.
We observe from Table 1 that our system is able to perform atpar with the DAE, which is our starting model  and as such serves as the upper bound on the performance that we can attain. This shows that while being able to successfully perform the lifting operation, we do not sacrifice localization accuracy. The small increase in error can be attributed to the fact that perfect reconstruction of a system is nearly impossible with a lowdimensional shape model. Furthermore we use a feedforward, singleshot camera and shape regression network, while in principle this is a problem that could require iterative model fitting techniques to align a 3D deformable model to 2D landmarks [51].
We report localization results in 3D on 21 landmarks that feature in the AFLW20003D dataset. As our unsupervised system is often unable to locate human ears, the learned face model does not account for them in the UV space. This makes it impossible to evaluate landmark localization for points that lie on or near the ears, which is the case for two of these landmarks. Hence, for the AFLW20003D dataset, we report localization accuracies only for 19 landmarks. Furthermore, as an evaluation of the discovered shape, we also show landmark localization results after rigid alignment (without reflection) of the predicted landmarks with the ground truth. We perform Procrustes analysis, with and without adding rotation to the alignment, the latter giving us an evaluation of the accuracy of pose estimation as well.
Table 2 also demonstrates the gain achieved by adding weak supervision via the MultiPIE dataset. We see that the mean NMEs for LAEs trained with and without the MultiPIE dataset increase as the yaw angle increases. This is also visible in our qualitative results shown in Fig. 7, where we visualize the discovered shapes for both of these cases.
Method  Rotation  Yaw angle  

All  
3DDFA [88] (supervised)  Y  
N  
PRNet [17] (supervised)  Y  
N  
3DFAN [12] (supervised)  Y  
N  
LAE (64) CelebA  Y  
N  
LAE (128) CelebA  Y  
N  
LAE (128) MultiPIE  Y  
N  
LAE (128) CelebA+MultiPIE  Y  
N 
7 Conclusion
In this work we have introduced an unsupervised method for lifting an object category into a 3D representation, allowing us to learn a 3D morphable model of faces from an unorganized photo collection. We have shown that we can use the resulting model for controllable manipulation and editing of observed images.
Deep imagebased generative models have shown the ability to deliver photorealistic synthetsis results with substantially more diverse categories than faces [11, 32]  we anticipate that their combination with 3D representations like LAEs will further unleash their potential for controllable image synthesis.
Appendix A Additional Details
In this section, we note some additional implementation details.
a.1 Data Processing
In our experiments, we used images of size pixels, which were cropped from the CelebA and MultiPIE datasets using groundtruth bounding boxes.
For CelebA images, the cropping was performed by extracting a square patch around the face with sidelength equal to the length of the longer side of the bounding box. It was then adjusted so that it lies entirely inside the image (by translating it horizontally or vertically, or even scaling it down if necessary). Finally, we tightened the resulting box by pixels from each side as the bounding boxes are quite loose crops, and resized the resulting square image to . We use all images from CelebA for training (about images) except the MAFL test set which is contained entirely in CelebA ( images).
For MultiPIE dataset, we crop the face images according to landmarks positions on the eyes, the corner of mouth, and the width of the frontal face. Specifically, we use the mean coordinates of the 4 landmarks as the center of the crop, and use the width of the face as the width of the images. We use the method proposed in [12] to detect the landmarks. For each person, the crop is identical across all illumination condition for the same camera.
a.2 Architecture Details
We used convolutional encoders and decoders similar to the ones described in [58]. We detail the architectures here again for completeness. The convolutional encoder architecture is—
Conv(32)LeakyReLUConv(64)> >BNLeakyReLUConv(128)> >BNLeakyReLUConv(256)> >BNLeakyReLUConv(256)> >BNLeakyReLUConv(Nz)> >Sigmoid;
while the convolutional decoder architecture is—
ConvT(512)BNReLUConvT(256)> >BNReLUConvT(128)> >BNReLUConvT(64)> >BNReLUConvT(32)> >BNReLUConvT(32)> >BNReLUConvT(Nc)> >Threshold(0,1).
a.3 Refinement Networks
The refinement setup consists of a generator network, and a discriminator network. The generator is a standard UNet[53] for images that are downsampled to in the innermost latent layer.
The discrminiator is a PatchGAN discrminator[27] with the following architecture—
Conv(64)LeakyReLUConv(128)BN> >LeakyReLUConv(256)BN> >LeakyReLUConv(512)BN> >LeakyReLU>Conv(1)
a.4 Implementation Details
We implemented our system in Python 3.6 using the PyTorch library. We use convolutional, activation, and batch norm layers predefined in the
torch.nn module, and take advantage of the Autograd[50]framework to take care of the gradients required by backpropagation.
a.5 Rotation Modeling
Modelling rotations using quaternions has several advantages over modelling them using Euler angles, including computational ease, less ambiguity, and compact representation[16]. Quaternions were also employed by [31] to model mesh rotations. Following these works, we also use quaternions in our framework to model rotations, by regressing them from the camera latent space, and normalizing them to unit length.
a.6 The Neural Mesh Renderer
The Neural Mesh Renderer[34] is a recently proposed module that can be inserted into a neural network to enable endtoend training with a rendering operation. The renderer proposes approximate gradients to learn texture and shape given the output rendering. The original module was released in Chainer[1], but we use a PyTorch port of this module, which is a publiclyavailable reimplementation[2]. The renderer in our framework accepts a texture image, the mean shape, the deviation from the mean shape, and the camera parameters to output a 2D reconstruction of the original image.
a.7 Training Procedure
To train the LAE, we first train a DAE on the training data. We then fix the DAE and use it to extract dense correspondences between the image space and the canonical space. These correspondences are used in the objective of the 3D reprojection loss (Equations 6 and 7 in the paper).
To obtain imagespecific camera, translation, and shape estimates, we train another convolutional encoder. This encoder learns a disentangled latent space where the shape estimates and camera and translation estimates are encoded by different vectors. For the MultiPIE experiments, the shape latent vector is further divided into identity and expression vectors. We use linear layers to regress camera, translation, and shape estimates from their latent encodings.
We train our system using the Adam[39] optimizer for all learnable parameters. We start with a learning rate of , which is decayed every training epochs by a factor of . We train for a total of epochs.
References
 [1] The neural mesh renderer on github. https://github.com/hiroharukato/neural_renderer.
 [2] A pytorch port of the neural mesh renderer on github. https://github.com/daniilidisgroup/neural_renderer.
 [3] S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27  October 4, 2009, pages 72–79, 2009.
 [4] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade. Nonrigid structure from motion in trajectory space. In Advances in neural information processing systems, pages 41–48, 2009.
 [5] H. A. Alhaija, S. K. Mustikovela, A. Geiger, and C. Rother. Geometric image synthesis. CoRR, abs/1809.04696, 2018.
 [6] J. Barron and J. Malik. Shape, illumination, and reflectance from shading. Technical Report UCB/EECS2013117, EECS Department, University of California, Berkeley, May 2013.
 [7] H. Barrow, J. Tenenbaum, A. Hanson, and E. Riseman. Recovering intrinsic scene characteristics. Comput. Vis. Syst, 2:3–26, 1978.
 [8] B. Biggs, T. Roddick, A. W. Fitzgibbon, and R. Cipolla. Creatures great and SMAL: recovering the shape and motion of animals from video. CoRR, abs/1811.05804, 2018.
 [9] J. Booth, A. Roussos, A. Ponniah, D. Dunaway, and S. Zafeiriou. Large scale 3d morphable models. International Journal of Computer Vision, 2018.
 [10] C. Bregler, A. Hertzmann, and H. Biermann. Recovering nonrigid 3d shape from image streams. In cvpr, volume 2, page 2690. Citeseer, 2000.
 [11] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
 [12] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, 2017.
 [13] J. Carreira, S. Vicente, L. Agapito, and J. Batista. Lifting object detection datasets into 3d. IEEE transactions on pattern analysis and machine intelligence, 38(7):1342–1355, 2016.
 [14] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Neural Information Processing Systems, 2016.
 [15] Y. Dai, H. Li, and M. He. A simple priorfree method for nonrigid structurefrommotion factorization. International Journal of Computer Vision, 107(2):101–122, 2014.
 [16] E. B. Dam, M. Koch, and M. Lillholm. Quaternions, interpolation and animation. Technical report, 1998.
 [17] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d face reconstruction and dense alignment with position map regression network. In ECCV, 2018.

[18]
R. Garg, A. Roussos, and L. Agapito.
Dense variational reconstruction of nonrigid surfaces from monocular
video.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1272–1279, 2013.  [19] U. Gaur and B. S. Manjunath. Weakly supervised manifold learning for dense semantic object correspondence. In ICCV, 2017.
 [20] P. V. Gehler, C. Rother, M. Kiefel, L. Zhang, and B. Schölkopf. Recovering intrinsic images with a global sparsity prior on reflectance. In NIPS, 2011.
 [21] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and W. T. Freeman. Unsupervised training for 3d morphable model regression. In CVPR, 2018.
 [22] P. F. Gotardo, T. Simon, Y. Sheikh, and I. Matthews. Photogeometric scene flow for highdetail dynamic 3d reconstruction. In Proceedings of the IEEE International Conference on Computer Vision, pages 846–854, 2015.
 [23] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image Vision Comput., 28(5):807–813, May 2010.
 [24] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. 2003.
 [25] P. Henzler, N. Mitra, and T. Ritschel. Escaping plato’s cave using adversarial training: 3d shape from unstructured 2d image collections. arXiv preprint arXiv:1811.11606, 2018.
 [26] M. Hernandez, T. Hassner, J. Choi, and G. G. Medioni. Accurate 3d face reconstruction via prior constrained structure from motion. Computers & Graphics, 66:14–22, 2017.
 [27] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
 [28] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks through conditional image generation. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 4016–4027. Curran Associates, Inc., 2018.
 [29] M. Janner, J. Wu, T. D. Kulkarni, I. Yildirim, and J. Tenenbaum. Selfsupervised intrinsic image decomposition. In NIPS, 2017.
 [30] D. Jimenez Rezende, S. M. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg, and N. Heess. Unsupervised learning of 3d structure from images. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4996–5004. Curran Associates, Inc., 2016.
 [31] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning categoryspecific mesh reconstruction from image collections. In ECCV, 2018.
 [32] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
 [33] T. Karras, S. Laine, and T. Aila. A stylebased generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2018.
 [34] H. Kato, Y. Ushiku, and T. Harada. Neural 3d mesh renderer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [35] I. KemelmacherShlizerman. Internet based morphable model. In Proceedings of the IEEE International Conference on Computer Vision, pages 3256–3263, 2013.
 [36] I. KemelmacherShlizerman and S. M. Seitz. Face reconstruction in the wild. In 2011 International Conference on Computer Vision, pages 1746–1753. IEEE, 2011.
 [37] I. KemelmacherShlizerman and S. M. Seitz. Collection flow. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1792–1799. IEEE, 2012.
 [38] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Niessner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep video portraits. ACM Trans. Graph., 37(4):163:1–163:14, July 2018.
 [39] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [40] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop, volume 2, 2015.
 [41] C. Kong and S. Lucey. Deep interpretable nonrigid structure from motion. CoRR, abs/1902.10840, 2019.
 [42] N. Kong, P. V. Gehler, and M. J. Black. Intrinsic video. In Computer Vision  ECCV 2014  13th European Conference, Zurich, Switzerland, September 612, 2014, Proceedings, Part II, pages 360–375, 2014.
 [43] M. R. Koujan and A. Roussos. Combining dense nonrigid structure from motion and 3d morphable models for monocular 4d face reconstruction. In CVMP, 2018.
 [44] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
 [45] Q. LiuYin, R. Yu, L. Agapito, A. Fitzgibbon, and C. Russell. Better together: Joint reasoning for nonrigid 3d reconstruction with specularities and shading. arXiv preprint arXiv:1708.01654, 2017.

[46]
R. Memisevic and G. E. Hinton.
Learning to represent spatial transformations with factored higherorder boltzmann machines.
Neural Computation, 2010.  [47] T. Narihira, M. Maire, and S. X. Yu. Direct intrinsics: Learning albedoshading decomposition by convolutional regression. In ICCV, 2015.
 [48] D. Novotny, D. Larlus, and A. Vedaldi. Learning 3d object categories by looking around them. In Proceedings of the IEEE International Conference on Computer Vision, pages 5218–5227, 2017.
 [49] M. Paladini, A. Del Bue, M. Stosic, M. Dodig, J. Xavier, and L. Agapito. Factorization for nonrigid and articulated structure using metric projections. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2898–2905. IEEE, 2009.
 [50] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [51] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis. 6dof object pose from semantic keypoints. In ICRA, 2017.
 [52] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. MorenoNoguer. Ganimation: Anatomicallyaware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 818–833, 2018.
 [53] O. Ronneberger, P. Fischer, and T. Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.
 [54] D. Samaras, D. Metaxas, P. Fua, and Y. G. Leclerc. Variable albedo surface reconstruction from stereo and shape from shading. In IEEE International Conference on Computer Vision and Pattern Recognition, pages I: 480–487, 2000.
 [55] J. L. Schönberger and J.M. Frahm. Structurefrommotion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [56] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet : Learning shape, reflectance and illuminance of faces in the wild. In CVPR, 2018.
 [57] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet: Learning shape, refectance and illuminance of faces in the wild. In Computer Vision and Pattern Regognition (CVPR), 2018.
 [58] Z. Shu, M. Sahasrabudhe, R. A. Güler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In European Conference on Computer Vision, 2018.
 [59] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In CVPR, 2017.
 [60] T. Simon, J. Valmadre, I. Matthews, and Y. Sheikh. Separable spatiotemporal priors for convex reconstruction of timevarying 3d point clouds. In European Conference on Computer Vision, pages 204–219. Springer, 2014.
 [61] M. Sundermeyer, Z.C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pages 699–715, 2018.
 [62] A. Tewari, F. Bernard, P. Garrido, G. Bharaj, M. Elgharib, H. Seidel, P. Pérez, M. Zollhöfer, and C. Theobalt. FML: face model learning from videos. CoRR, abs/1812.07603, 2018.
 [63] A. Tewari, F. Bernard, P. Garrido, G. Bharaj, M. Elgharib, H. Seidel, P. Pérez, M. Zollhöfer, and C. Theobalt. FML: face model learning from videos. CoRR, abs/1812.07603, 2018.
 [64] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object frames by dense equivariant image labelling. In NIPS, 2017.
 [65] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object frames by dense equivariant image labelling. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 844–855. Curran Associates, Inc., 2017.
 [66] J. Thewlis, H. Bilen, and A. Vedaldi. Unsupervised learning of object landmarks by factorized spatial embeddings. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3229–3238, Oct 2017.
 [67] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. Face2face: Realtime face capture and reenactment of RGB videos. In CVPR, 2016.
 [68] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9(2):137–154, 1992.
 [69] L. Torresani, A. Hertzmann, and C. Bregler. Nonrigid structurefrommotion: Estimating shape and motion with hierarchical priors. IEEE transactions on pattern analysis and machine intelligence, 30(5):878–892, 2008.
 [70] L. Tran and X. Liu. Nonlinear 3d face morphable model. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, June 2018.
 [71] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2626–2634, 2017.
 [72] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5038–5047, 2017.
 [73] T. Vetter, M. J. Jones, and T. A. Poggio. A bootstrapping algorithm for learning linear models of object classes. In 1997 Conference on Computer Vision and Pattern Recognition (CVPR ’97), June 1719, 1997, San Juan, Puerto Rico, pages 40–46, 1997.
 [74] Y. Wang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras. Face relighting from a single image under harsh lighting conditions. In IEEE International Conference on Computer Vision and Pattern Recognition, 2007.
 [75] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras. Face relighting from a single image under arbitrary unknown lighting conditions. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(11):1968 –1984, nov. 2009.
 [76] O. Wiles, A. Sophia Koepke, and A. Zisserman. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 670–686, 2018.
 [77] O. Wiles and A. Zisserman. 3d surface reconstruction by pointillism. In ECCV Workshop on Geometry Meets Deep Learning, 2018.
 [78] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and rotation equivariance. In CVPR, 2016.
 [79] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Interpretable transformations with encoderdecoder networks. In CVPR, 2017.
 [80] J. Wu, Y. Wang, T. Xue, X. Sun, W. T. Freeman, and J. B. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5d sketches. In NIPS, 2017.
 [81] S. Yao, T. M. Hsu, J.Y. Zhu, J. Wu, A. Torralba, B. Freeman, and J. Tenenbaum. 3daware scene manipulation via inverse graphics. In Advances in Neural Information Processing Systems, pages 1891–1902, 2018.
 [82] Y. Yu and W. A. P. Smith. Inverserendernet: Learning single image inverse rendering. CoRR, abs/1811.12328, 2018.
 [83] L. Zhang, S. Wang, and D. Samaras. Face synthesis and recognition under arbitrary unknown lighting using a spherical harmonic basis morphable model. In IEEE International Conference on Computer Vision and Pattern Recognition, pages II:209–216, 2005.
 [84] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multitask learning. In European Conference on Computer Vision 2014, 2014.
 [85] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and egomotion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2017.
 [86] T. Zhou, P. Krähenbühl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3dguided cycle consistency. In CVPR, 2016.
 [87] J. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. B. Tenenbaum, and W. T. Freeman. Visual object networks: Image generation with disentangled 3d representation. CoRR, abs/1812.02725, 2018.
 [88] X. Zhu, Z. Lei, S. Z. Li, et al. Face alignment in full pose range: A 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
Comments
There are no comments yet.