1 Introduction
Disentangling factors of variation is important for the broader goal of controlling and understanding deep networks, but also for applications such as image manipulation through interpretable operations. Progress in the direction of disentangling the latent space of deep generative models has facilitated the separation of latent image representations into dimensions that account for independent factors of variation, such as identity, illumination, normals, and spatial support [1, 2, 3, 4], lowdimensional transformations, such as rotations, translation, or scaling, [5, 6, 7] or finerlevels of variation, including age, gender, wearing glasses, or other attributes e.g. [2, 8] for particular classes, such as faces.
Shape variation is more challenging as it amounts to a transformation of a function’s domain, rather than its values. Even simple, supervised additive models of shape result in complex nonlinear optimization problems [9, 10]. Despite this challenge several works in the previous decade aimed at learning shape/appearance factorizations in an unsupervised manner, exploring groupwise image alignment, [11, 12, 13, 14]
. In the context of deep learning several works have aimed at incorporating deformations and alignment in a supervised setting, including Spatial Transformers
[15], Deep Epitomic Networks [16], Deformable CNNs [17], Mass Displacement Networks [18], Mnemonic Descent [19], or Densereg [20]. These works have shown that one can improve the accuracy of both classification and localization tasks by injecting deformations and alignment within traditional CNN architectures.Turning to unsupervised deep learning, even though most works focus on rigid, or lowdimensional parametric deformations, e.g. [5, 6], several works have attempted to incorporate richer nonrigid deformations within learning. A thread of works has been aimed at dynamically rerouting the processing of information within the network’s graph based on the input, starting from neural computation arguments [21, 22, 23] and eventually translating into concrete algorithms, such as the ‘capsule’ works of [24, 25]
that bind neurons onthefly. Still, these works lack a transparent, parametric handling of nonrigid deformations. Working on a more geometric direction, several works have recently aimed at recovering dense correspondences between pairs
[26] or sets of RGB images, as e.g. in the recent works of [27, 28]. These works however do not have the notion of a reference coordinate system (‘template’) to which images can get mapped  this makes the image generation and manipulation harder. More recently, [29] use the equivariance principle in order to align sets of images to a common coordinate system, but do not develop this into a fullblown generative model of images.Our work pushes the envelope of this line of research by following the deformable template paradigm [30, 31, 9, 32, 10]
. In particular, we consider that object instances are obtained by deforming a prototypical object, or ‘template’, through dense, diffeomorphic deformation fields. This makes it possible to factor object variability within a category into variations that are associated to spatial transformations, generally linked to the object’s 2D/3D shape, and variations that are associated to appearance (or, ‘texture’ in graphics), e.g. due to facial hair, skin color, or illumination. In particular we consider that both sources of variation can be modelled in terms of a lowdimensional latent code that is learnable in an unsupervised manner from images. We achieve disentangling by breaking this latent code into separate parts that are fed into separate decoder networks that deliver appearance and deformation estimates. Even though one could hope that a generic convolutional architecture will learn to represent such effects, we argue that explicitly injecting this inductive bias in a network can help with the training, while also yielding control over the generative process.
Our main contributions in this work can be summarized as follows:
First, we introduce the Deforming Autoencoder architecture, bringing together the deformable modeling paradigm with unsupervised deep learning. We treat the templatetoimage correspondence task as that of predicting a smooth and invertible transformation. As shown in Fig. 1, our network predicts this transformation field alongside with the templatealigned appearance and subsequently deforms the synthesized appearance to generate an image similar to its input. This allows for a disentanglement of the shape and appearance parts of image generation by explicitly modelling the effects of image deformation during the decoding stage.
Second, we explore different ways in which deformations can be represented and predicted by the decoder. Instead of building a generic deformation model, we compose a global, affine deformation field, with a nonrigid field that is synthesized as a convolutional decoder network. We develop a method that allows us to constrain the synthesized field to be a diffeomorphism, namely an invertible and smooth transformation, and show that it simplifies training and improves accuracy. We also show that classrelated information can be exploited, when available, to learn better deformation models: this yields sharper images and can be used to learn models that jointly account for multiple classes  e.g. all MNIST digits.
Third, we show that disentangling appearance from deformation comes with several advantages when it comes to modeling and manipulating images. By using disentangling we obtain clearly better synthesis results when manipulating images for tasks such as expression, pose or identity interpolation when compared to standard autoencoder architectures. Along the same lines, we show that accounting for deformations facilitates a further disentangling of the appearance components into an intrinsic, shadingalbedo decomposition which completely fails when naively performed in the original image coordinates. This allows us to perform reshading through simple operations on the latent shading coordinate space.
We complement these qualitative results with a quantitative analysis of the learned model in terms of landmark localization accuracy. We show that our method is not too far below supervised methods and outperforms with a margin the latest stateoftheart works on selfsupervised correspondence estimation [29], even though we never explicitly trained our network for correspondence estimation, but rather only aimed at reconstructing pixel intensities.
2 Deforming Autoencoders
Our architecture embodies the deformable template paradigm in an autoencoder architecture. The premise of our work is that image generation can be interpreted as the combination of two processes: a synthesis of appearance on a deformationfree coordinate system (‘template’), followed by a subsequent deformation that introduces shape variability. Denoting by the value of the synthesized appearance (or, texture) at coordinate and by the estimated deformation field, we consider that the observed image, , can be reconstructed as follows:
(1) 
namely the image appearance at position is obtained by looking up the synthesized appearance at position . This is implemented in terms of a spatial transformer layer [15] that allows us to pass gradients through the warping process.
The appearance and deformation functions are synthesized by independent decoder networks. The inputs to the decoders are delivered by a joint encoder network that takes as input the observed image and delivers a lowdimensional latent representation, , of shape and appearance. This is split into two parts, which feed into the appearance and shape networks respectively, providing us with a clear separation of shape and appearance.
2.1 Deformation field modeling
Rather than leave deformation modeling entirely to backpropagation, we use some domain knowledge to simplify and accelerate learning. The first observation is that global aspects can be expressed using lowdimensional linear models. We account for global deformations by an affine Spatial Transformer layer, that uses a sixdimensional input to synthesize a deformation field as an expansion on a fixed basis [15]. This means that the shape representation, described above is decomposed into two parts, , where accounts for the affine, and for the nonrigid, learned part of the deformation field. These deformation fields are generated by separate decoders, and are composed, so that the affine transformation warps the detailed nonrigid warps to the image positions where they should apply. This is also a common decomposition in deformable models for faces [9, 10].
Turning to local deformation effects, we quickly realized that not every deformation field is plausible. Without appropriate regularization we would often obtain deformation fields that could expand small areas to occupy whole regions, and/or would be nondiffeomorphic, meaning that the deformation could spread a connected texture pattern to a disconnected image area (Figure 2(f)).
To prevent this problem, instead of making the shape decoder CNN directly predict the local warping field , we consider a ‘differential decoder’ that generates the spatial gradient of the warping field: and , where denotes the component of the spatial gradient vector. These two quantities measure the displacement of consecutive pixels  for instance amounts to translation in the horizontal axis, amounts to horizontal shifting by a size of 2, while amounts to leftright flipping; a similar behavior is associated with in the vertical axis. We note that global rotations are handled by the affine warping field, and the are associated with small local rotations of minor importance  we therefore focus on .
Having access to these two values gives us a handle on the deformation field, since we can prevent folding/excessive stretching by controlling .
In particular, we pass the outputs of our differential decoder through a Rectified Linear Unit (ReLU) module, which enforces positive horizontal offsets on horizontally adjacent pixels, and positive vertical offsets on vertically adjacent pixels. We subsequently apply a spatial integration layer, implemented in terms of a fixed network layer, on top of the output of the ReLU layer to reconstruct the warping field from its spatial gradient. By doing so, the new deformation module enforces the generation of smooth and regular warping fields that avoid selfcrossings. In practice we found that also clipping the decoded offsets by a maximal value significantly eases the training, which amounts to replacing the ReLU layer,
with a layer. In our experiments, we set where denotes the number of pixels along one dimension of the image.2.2 Classaware Deforming Autoencoder
We can require our network’s latent representation to be predictive of not only shape and appearance, but also of instance class, if that is available during training. We note that this information, being discrete may be easier to acquire than the actual deformation field, which would require manual landmark annotation. For instance, for faces such discrete information could represent the expression or a person’s identity.
In particular we consider that the latent representation can be decomposed as follows: , where are as previously the appearance and shape related parts of the representation, respectively, while is fed as input to a subnetwork trained to predict the class associated with the input image. Apart from assisting the classification task, the latent vector is fed into both the appearance and shape decoders. Intuitively this allows our decoder network to learn a mixture model that is conditioned on class information, rather than treating the joint, multimodal distribution through a monolithic model. Even though the class label is only used during training, and not for reconstruction, our experimental results show that a network trained with class supervision can deliver more accurate synthesis results.
2.3 Intrinsic Deforming Autoencoder: Deformation, Albedo and Shading Decomposition
Having outlined Deforming Autoencoders, we now use a Deforming Autoencoder to model complex physical image signals, such as illumination effects, without a supervision signal. For this we design the Intrinsic DeformingAutoencoder, named IntrinsicDAE to model shading and albedo for inthewild face images. As shown in Fig. 4(a), we introduce two separate decoders for shading and albedo , each of with has the same structure as the original texture decoder. The texture is computed by where denotes the Hadamard product.
In order to model the physical properties of shading and albedo, we follow the intrinsic decomposition regularization loss used in [2]: we apply the L2 smoothness loss on , meaning that shading is expected to be smooth, while leaving albedo unconstrained. As shown in Fig. 4 and more extensively in the experimental results section, when used in tandem with an Deforming Autoencoder this allows us to successfully decompose of face image into shape, albedo, and shading components, while a standard Autoencoder completely fails at decomposing unaligned images into shading and albedo.
2.4 Training
Our objective function is formed as the sum of three losses, combining the reconstruction error with the regularization terms required for the modules described above. Concretely, the loss of the deforming autoencoder can be written as
(2) 
where the reconstruction loss is defined as the standard loss
(3) 
and the warping loss is decomposed as follows:
(4) 
In particular the smoothness cost, , penalizes quicklychanging deformations encoded by the local warping field. It is measured in terms of the total variation norm of the horizontal and vertical differential warping fields, and is given by
(5) 
where . Finally, aims at removing any systematic bias introduced by the fitting process, e.g. the average template becoming small, or a distorted version of the data. It consists of regularization on (1) the affine parameters defined as the L2distance between and , with being the identity affine transform and (2) on freeform deformations defined as the L2distance between the average deformation grid within a minibatch, and the identity grid :
(6) 
where .
In the classaware variant described in Sec. 2.2 we augment the loss above with the crossentropy loss evaluated on the classification network’s outputs, while for IntrinsicDAE, we add the following objective function in training:
where 1e6.
We experiment with two types of architectures; the majority of our results are obtained with a standard autoencoder architecture, where both encoder and decoders are CNNs with standard convolutionBatchNormReLU blocks. The number of filters and the texture bottleneck capacity can vary per experiment, image resolution, and dataset, as detailed in the Appendix 0.A.
Follow the recent work on densely connected convolutional networks [33], we have also experimented with incorporating dense connections into our encoder and decoders architectures respectively (no skip connections over the bottleneck layer for latent representations). In particular, we follow the architecture of DenseNet121, but without the convolutional layers inside each dense block. These have been shown to better exploit larger datasets, as indicated in the quantitative analysis of unsupervised face alignment. We call this version of the deforming autoencoder DenseDAE.
3 Experiments
To demonstrate the properties of our deformation disentangling network, we conduct experiments on the following three datasets:

Deformed MNIST. A synthetic dataset designed specifically to explore the deformation modelling power of our network. Deformed MNIST consists of handwritten MNIST images randomly distorted using a mixture of sinusoidal waveforms.

MUG facial expression dataset [34]
. This dataset consists of videos of individuals performing facial expressions, with simple blue background and minor translation. The dataset also offers frames from the videos, classified according to the facial expression, as well as the subject.
Using these datasets we experimentally explored the ability of the unsupervised appearanceshape (or texturedeformation) disentangling network on 1) unsupervised image alignment/appearance inference; 2) learning semantically meaningful manifolds for shape and appearance; 3) decomposition into illumination intrinsics (shading, albedo); 4) unsupervised landmark detection, as detailed below. We intend to make all of the code of our system publicly available in order to facilitate the reproduction of our results.
3.1 Unsupervised Appearance Inference
We first use our network to model canonical appearance and deformation for single category objects. For this purpose, we demonstrate the results in the MNIST and MUG facial expression datasets (Fig. 5, 6, 7).
We observe that by heavily limiting the size of (1 in Fig. 5 and 0 in Fig. 7), we can successfully infer a canonical appearance for such a class. In Fig. 5, all different types of handwritten digits ’3’ are aligned to a simple canonical shape. In Fig. 7, by limiting the dimension of to , the network learns to encode a single texture image for all expressions, and successfully distills expressionrelated information exclusively in the shape space. In Fig. 7(b) we show that by interpolating the learned latent representations, we can generate meaningful shape interpolations that mimic facial expressions.
In cases where data has a multimodal distribution exhibiting multiple different canonical appearances, e.g., multiclass MNIST digit images, learning a single appearance is less meaningful and often challenging (Fig. 6(b)). In such cases, utilizing class information (Sec. 2.2) significantly improves the quality of multimodal appearance learning (Fig. 6(d)). As the network learns to classify the images implicitly in its latent space, it learns to generate a single canonical appearance for each class. Misclassified data will be decoded into an incorrect class: the image at position (2,4) in Fig. 6(c,d) is interpreted as a 6.
We now demonstrate the effectiveness of texture inference using our network on inthewild human faces. Using the MAFL face dataset, we show that our network is able to align the faces to a common texture space under various poses, illumination conditions, or facial expressions (Fig. 10)(d). The aligned textures retain the information of the input image such as lighting, gender, and facial hair, without a relevant supervision training signal. We further demonstrate the alignment on the 11k Hands dataset [37], where we align palmar images of the left hand of several subjects 8. This property of our network is especially useful for applications such as computer graphics, where establishing correspondences (UV map) between a class of objects is important but usually difficult.
3.2 Autoencoders vs. Deforming Autoencoders
We show the ability of our network to learn meaningful deformation representations without supervision. We compare our disentangling network with a plain autoencoder (Fig. 9). Contrary to our network which disentangles an image into a template texture and a deformation field, the autoencoder is trained to encode all of the image in a single latent representation, i.e., the bottleneck.
We train both networks in the MAFL facesinthewild dataset. To evaluate the learned representation, we conduct manifold traversal (i.e., latent representation interpolation) between two randomly sampled face images: given a source face image and a target image , we first compute their latent representations s. We use and to denote the latent representations in our network for , and for the latent representation learned by a plain autoencoder. We then conduct linear interpolation on , between and : . We subsequently reconstruct the image from using the corresponding decoder(s), as shown in Fig. 9.
By traversing the learned deformation representation only, we can change the shape and pose of a face while maintaining its texture (Fig. 9(1)); interpolating the texture representation results in posealigned texture transfer (Fig. 9(2)); traversing on both representations will generate a smooth deformation from one image to another (Fig. 9(3,5,7)). Compared to the interpolation using the autoencoder (Fig. 9(4,6,8)), which often exhibits artifacts, our traversal stays on the semantic manifold of faces and generates sharp facial features.
3.3 Intrinsic Deforming Autoencoders
Having demonstrated the disentanglement abilities of Deforming Autoencoders, we now explore the disentanglement capabilities of IntrinsicDAE described in Sec. 2.3. Using only the and regularization losses, the IntrinsicDAE is able to generate convincing shading and albedo estimates without direct supervision (Fig. 10(b) to (g)). Without the “learningtoalign” property, a baseline autoencoder structure with an intrinsic decomposition design (Fig. 4(b)) cannot decompose the image into plausible shading and albedo components (Fig. 10(h),(i),(j)).
In addition, we show that by manipulating the learned latent representation of , IntrinsicDAE allows us to simulate illumination effects for face images, such as interpolating lighting directions (Fig. 11).
Training with reconstruction losses, autoencoderlike architectures are prone to generating smooth images which lack visual realism (Fig. 10). Inspired by the success of generative adversarial networks (GANs) [38], we follow previous work [2] where an adversarial loss is adopted to generate visually realistic images: we train the IntrinsicDAE with an extra adversarial loss term
applied on the final output. The loss function becomes:
(7) 
In practice, we apply a PatchGAN [39, 40] as the discriminator and set . We found that the adversarial loss improves the visual sharpness of the reconstruction while the deformation, shading are still successfully disentangled (Fig. 12).
3.4 Unsupervised alignment evaluation
Having qualitatively analyzed the disentanglement capabilities of our networks, we now turn to quantifying their performance on the task of unsupervised image alignment. We report the performance of our face DAE’s alignment on landmark detection on face images, specifically, the eyes, the nose, and corners of the mouth. We report performance on the MAFL dataset, which contains manually annotated landmark locations for 19,000 training and 1,000 test images. In our experiments, we use a model trained on the CelebA dataset without any form of supervision to estimate deformation fields on the MAFL training set. Following the evaluation protocol of the work that we directly compare to [29]
, we train a landmark regressor posthoc on these deformation fields using the provided annotations. We use landmark locations from the MAFL training set as training data for this regressor, but do not pass gradients to the Deforming Autoencoder, which thereby remains fixed to the model learned without supervision. The regressor is a 2layer fullyconnected neural network. Its inputs are flattened deformation fields (vectors of size
), which are provided as input to a 100dimensional hidden layer, followed by a ReLU and a 10D output layer to predict the spatial coordinates () for five landmarks corresponding to the eyes, nose, and mouth corner landmarks. We use L1 loss as the objective function for this regression task.In testing, we predict landmark locations using the trained regressor and the deformation fields on the MAFL test set. In Table 1 we report the mean error in landmark localization as a percentage of the interocular distance. As the deformation field determines the alignment in the texture space, it serves as an effective mapping between landmark locations on the aligned texture and those on the original, unaligned faces. Hence, the mean error we report directly quantifies the quality of the (unsupervised) face alignment.
, MAFL  , MAFL  , MAFL  , CelebA  , CelebA, with Regressor 

14.13  9.89  8.50  7.54  5.96 
In Table 2 we compare with the results of the best current method for semisupervised image registration [29]. We observe that by better modeling of the deformation space we quickly bridge the gap in performance, even though we never explicitly trained to learn correspondences.
DAE  DenseDAE  TCDCN[41]  Thewlis et al.[29]  
32NR  32Res  16  32  64  96  16  64  96  
10.24  9.93  5.71  5.96  5.70  6.46  6.85  5.50  5.45  7.95  5.83 
4 Conclusion and Future Work
In this paper we have developed deep autoencoders that can disentangle shape and appearance in latent representation space. We have shown that this method can be used for unsupervised groupwise image alignment. Our experiments with expression morphing in humans, image manipulation, such as shape and appearance interpolation, as well as unsupervised landmark localization, show the generality of our approach. We have shown that bringing images in a canonical coordinate system allows for a more extensive form of image disentangling, facilitating the estimation of decompositions into shape, albedo and shading without any form of supervision. We expect that this will lead in the future to a fullfledged disentanglement into normals, illumination, and 3D geometry.
5 Acknowledgment
This work was supported by a gift from Adobe, NSF grants CNS1718014 and DMS 1737876, the Partner University Fund, and the SUNY2020 Infrastructure Transportation Security Center.
References
 [1] Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: NIPS. (2016)
 [2] Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: CVPR. (2017)
 [3] Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoderdecoder networks. In: CVPR. (2017)
 [4] Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.: Sfsnet: Learning shape, reflectance and illuminance of faces in the wild. arXiv preprint arXiv:1712.01261 (2017)

[5]
Memisevic, R., Hinton, G.E.:
Learning to represent spatial transformations with factored higherorder boltzmann machines.
Neural Computation (2010)  [6] Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: Deep translation and rotation equivariance. (2016)

[7]
Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.:
Transformationgrounded image generation network for novel 3d view
synthesis.
In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE (2017) 702–711
 [8] Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.: Fader networks: Manipulating images by sliding attributes. CoRR abs/1706.00409 (2017)
 [9] Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: European conference on computer vision, Springer (1998)
 [10] Matthews, I., Baker, S.: Active appearance models revisited. IJCV (2004)
 [11] LearnedMiller, E.G.: Data driven image models through continuous joint alignment. PAMI (2006)
 [12] Kokkinos, I., Yuille, A.L.: Unsupervised learning of object deformation models. In: ICCV. (2007)
 [13] Frey, B.J., Jojic, N.: Transformationinvariant clustering using the EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25(1) (2003) 1–17
 [14] Jojic, N., Frey, B.J., Kannan, A.: Epitomic analysis of appearance and shape. In: 9th IEEE International Conference on Computer Vision (ICCV 2003), 1417 October 2003, Nice, France. (2003) 34–43
 [15] Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. CoRR abs/1506.02025 (2015)
 [16] Papandreou, G., Kokkinos, I., Savalle, P.: Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In: CVPR. (2015)
 [17] Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. (2017)
 [18] Neverova, N., Kokkinos, I.: Mass displacement networks. Arxiv (2017)
 [19] Trigeorgis, G., Snape, P., Nicolaou, M.A., Antonakos, E., Zafeiriou, S.: Mnemonic descent method: A recurrent process applied for endtoend face alignment. In: Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition. (2016)
 [20] Güler, R.A., Trigeorgis, G., Antonakos, E., Snape, P., Zafeiriou, S., Kokkinos, I.: Densereg: Fully convolutional dense shape regression inthewild. CVPR (2017)

[21]
Hinton, G.E.:
A parallel computation that assigns canonical objectbased frames of
reference.
In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI ’81, Vancouver, BC, Canada, August 2428, 1981. (1981) 683–685
 [22] Olshausen, B.A., Anderson, C.H., Essen, D.C.V.: A multiscale dynamic routing circuit for forming size and positioninvariant object representations. Journal of Computational Neuroscience 2(1) (1995) 45–62
 [23] Malsburg, C.: The correlation theory of brain function. In: Internal Report 812. Gottingen MaxPlanckInstitute for Biophysical Chemistry. (1981)

[24]
Hinton, G.E., Krizhevsky, A., Wang, S.D.:
Transforming autoencoders.
In: Artificial Neural Networks and Machine Learning  ICANN 2011  21st International Conference on Artificial Neural Networks, Espoo, Finland, June 1417, 2011, Proceedings, Part I. (2011) 44–51
 [25] Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. CoRR abs/1710.09829 (2017)
 [26] Bristow, H., Valmadre, J., Lucey, S.: Dense semantic correspondence where every pixel is a classifier. In: ICCV. (2015)
 [27] Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3dguided cycle consistency. In: CVPR. (2016)
 [28] Gaur, U., Manjunath, B.S.: Weakly supervised manifold learning for dense semantic object correspondence. In: ICCV. (2017)
 [29] Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised object learning from dense equivariant image labelling. (2017)
 [30] Amit, Y., Grenander, U., Piccioni, M.: Structural image restoration through deformable templates. Journal of the American Statistical Association 86(414) (1991)

[31]
Yuille, A.L.:
Deformable templates for face recognition.
Journal of Cognitive Neuroscience 3(1) (1991)  [32] Blanz, V.T., Vetter, T.: Face recognition based on fitting a 3D morphable model. 25(9) (2003) 1063–1074
 [33] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017)
 [34] 11th International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS 2010, Desenzano del Garda, Italy, April 1214, 2010, IEEE (2010)
 [35] Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multitask learning. In: Computer Vision  ECCV 2014  13th European Conference, Zurich, Switzerland, September 612, 2014, Proceedings, Part VI. (2014) 94–108
 [36] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV). (2015)
 [37] Afifi, M.: Gender recognition and biometric identification using a large dataset of hand images. CoRR abs/1711.04322 (2017)
 [38] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
 [39] Li, C., Wand, M.: Precomputed realtime texture synthesis with markovian generative adversarial networks. In: European Conference on Computer Vision, Springer (2016) 702–716
 [40] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Imagetoimage translation with conditional adversarial networks. arxiv (2016)
 [41] Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence 38(5) (2016) 918–930
Appendix 0.A Architectural Details
0.a.1 Convolutional Encoders and Decoders
In our experiments, where input images are of size (Nc is 1 for MNIST and 3 for faces), we use identical architectures for convolutional encoders and decoders.
The encoder architecture is
Conv(32)LeakyReLUConv(64)BNLeakyReLUConv(128)> >BNLeakyReLUConv(256)BNLeakyReLUConv(Nz)> >Sigmoid;
while the decoder architecture is
ConvT(256)BNReLUConvT(128)BNReLUConvT(64)> >BNReLUConvT(32)BNReLUConvT(32)BNReLUConvT(Nc)> >Threshold(0,1),
where

Conv(n): convolution layer with output feature map;

ConvT(n): transposed convolution (deconvolution) layer with output feature map;

BN
: batch normalization layer

Nz: latent representation dimension

Nc: number of output image channel
0.a.2 DenseNetstyle Encoders and Decoders
For DenseNetstyle architectures, we employ dense convolutional connections. The architecture for the encoder is
BNReLUConv(32)DBE(32,6)TBE(32,64,2)> >DBE(64,12)TBE(64,128,2)DBE(128,24)TBE(128,256,2)> >DBE(256,16)TBE(256,Nz,4)Sigmoid;
whereas the architecture for the decoder is
BNTanhConvT(256)DBD(256,16)TBD(256,128)> >DBD(128,24)TBD(128,64)DBD(64,12)TBD(64,32)> >DBD(32,6)TBD(32,32)BNTanhConvT(Nc)Threshold(0,1),
where

DBE(n,k): A dense encoder block with convolutions with channels.

TBE(m,n,p): An encoder transition block of convolutions with input channels and
output channels. Also includes a maxpooling operation of size
. 
DBD(n,k): A dense decoder block with transposed convolution operations with channels.
Conv Encoder  Conv Decoder  
Output Size  Operation  Output size  Operation 
Conv  ConvT  
Conv  ConvT  
Conv  ConvT  
Conv  ConvT  
Nz  Conv(Nz)  ConvT  
Nc  ConvT(Nc) 
Dense Conv Encoder  Dense Conv Decoder  
Output Size  Operation  Output size  Operation 
Conv  ConvT  
DBE(32,6)  DBD(256,16)  
TBE(32,64,2)  TBD(256,128)  
DBE(64,12)  DBD(128,24)  
TBE(64,128,2)  TBD(128,64)  
DBE(128,24)  DBD(64,12)  
TBE(128,256,2)  TBD(64,32)  
DBE(256,16)  DBD(32,6)  
Nz  TBE(256,Nz,4)  TBD(32,32)  
Nc  ConvT(Nc) 
Appendix 0.B Ablation Study
0.b.1 Dimension of
In this section, we show experimental results on single deformed MNIST images of the digit 3 (Figure 14) as well as inthewild faces (without masking) from the MAFL dataset (Figure 15) to demonstrate the effect of varying the dimension of .
(a) input  

(b) 0D 

(c) 1D 

(d) 4D 

(e) 8D 

(f) 16D 


reconstruction  texture  warping (x)  warping (y) 
Input  

Reconstruction, 0D 

Texture, 0D 

Reconstruction, 2D 

Texture, 2D 

Reconstruction, 4D 

Texture, 4D 

Reconstruction, 16D 

Texture, 16D 

Reconstruction, 32D 

Texture, 32D 

Reconstruction, 128D 

Texture, 128D 
0.b.2 Methods for deformation modeling
In this section, we demonstrate the effect of using different warping modules.
We first show additional comparisons between using our proposed affine + integral warping and a nonrigid warping field directly output from a convolutional decoder for nonrigid deformation modeling (Figure 16).
We visualize the utility of affine and integral warping modules in our network with face images (Figure 17
). We can see that the affine transformation handles global pose variance (Figure.
17(b)) but not local nonrigid deformation. Our proposed integral warping module aligns the faces in a nonrigid manner (Figure 17(c)). Incorporating both deformation modules improves the nonrigid alignment (Figure 17(d)).
(a) Image 


(b)1 reconstruction 

(b)2 texture 

(b)3 warping (x) 

(c)1 reconstruction 

(c)2 texture 

(c)3 warping (x) 


(a) Image 


Affine: 

(b)1 reconstruction  
(b)2 texture  
Integral: 

(c)1 reconstruction  
(c)2 texture  
Affine + Integral: 

(d)1 reconstruction  
(d)2 texture  

Appendix 0.C Latent Manifold Traversal
We provide additional results and comparisons with a plain autoencoder on traversing the learned manifolds. In addition to Figure 13 in our manuscript, we provide two more sets of results in Figure 18 and Figure 19. Compared to a plain autoencoder, our deforming autoencoder not only generates better reconstructions, but also learns a better face manifold  interpolating between learned latent representations generates sharper and more realistic face images. For this experiment, we use the convolutional encoder and decoder architecture as described in Sec. 0.A.1.
Appendix 0.D Intrinsic Decomposition with DAE
In Fig.21 we provide additional results of unsupervised intrinsic disentangling for facesinthewild using IntrinsicDAE. Using the architecture and objective functions described in Sec. 2.3 of the main paper the network learns to bring faces under different poses and illumination conditions, shown in Fig. 21(a), to a canonical view, as shown in Fig. 21(d), while separating the shading, shown in Fig. 21(b) and albedo, shown in Fig. 21(c) components in the canonical view using two independent decoders. With the learned deformation from the deformation decoder, we can warp the aligned shading and aligned albedo to its original view as in the input image, as shown in Fig. 21(e,f).
In Fig. 22, we provide additional results for “changing lighting direction” of a face image using IntinsicDAE. We show that even without explicitly modeling of geometry, we can simulate smooth and reasonable lighting direction changes in the image by interpolating the learned latent representation for shading, as shown in Fig. 22a(4),b(4).
For IntrinsicDAE, we use the DenseNet architecture as the encoders and decoders (0.A.2). The network is trained with a subset of images in the CelebA dataset. The dimensions of latent representations are: 16 for albedo, 16 for shading, and 128 for deformation field.
Comments
There are no comments yet.