Disentangling factors of variation is important for the broader goal of controlling and understanding deep networks, but also for applications such as image manipulation through interpretable operations. Progress in the direction of disentangling the latent space of deep generative models has facilitated the separation of latent image representations into dimensions that account for independent factors of variation, such as identity, illumination, normals, and spatial support [1, 2, 3, 4], low-dimensional transformations, such as rotations, translation, or scaling, [5, 6, 7] or finer-levels of variation, including age, gender, wearing glasses, or other attributes e.g. [2, 8] for particular classes, such as faces.
Shape variation is more challenging as it amounts to a transformation of a function’s domain, rather than its values. Even simple, supervised additive models of shape result in complex nonlinear optimization problems [9, 10]. Despite this challenge several works in the previous decade aimed at learning shape/appearance factorizations in an unsupervised manner, exploring groupwise image alignment, [11, 12, 13, 14]
. In the context of deep learning several works have aimed at incorporating deformations and alignment in a supervised setting, including Spatial Transformers, Deep Epitomic Networks , Deformable CNNs , Mass Displacement Networks , Mnemonic Descent , or Densereg . These works have shown that one can improve the accuracy of both classification and localization tasks by injecting deformations and alignment within traditional CNN architectures.
Turning to unsupervised deep learning, even though most works focus on rigid, or low-dimensional parametric deformations, e.g. [5, 6], several works have attempted to incorporate richer non-rigid deformations within learning. A thread of works has been aimed at dynamically rerouting the processing of information within the network’s graph based on the input, starting from neural computation arguments [21, 22, 23] and eventually translating into concrete algorithms, such as the ‘capsule’ works of [24, 25]
that bind neurons on-the-fly. Still, these works lack a transparent, parametric handling of non-rigid deformations. Working on a more geometric direction, several works have recently aimed at recovering dense correspondences between pairs or sets of RGB images, as e.g. in the recent works of [27, 28]. These works however do not have the notion of a reference coordinate system (‘template’) to which images can get mapped - this makes the image generation and manipulation harder. More recently,  use the equivariance principle in order to align sets of images to a common coordinate system, but do not develop this into a full-blown generative model of images.
. In particular, we consider that object instances are obtained by deforming a prototypical object, or ‘template’, through dense, diffeomorphic deformation fields. This makes it possible to factor object variability within a category into variations that are associated to spatial transformations, generally linked to the object’s 2D/3D shape, and variations that are associated to appearance (or, ‘texture’ in graphics), e.g. due to facial hair, skin color, or illumination. In particular we consider that both sources of variation can be modelled in terms of a low-dimensional latent code that is learnable in an unsupervised manner from images. We achieve disentangling by breaking this latent code into separate parts that are fed into separate decoder networks that deliver appearance and deformation estimates. Even though one could hope that a generic convolutional architecture will learn to represent such effects, we argue that explicitly injecting this inductive bias in a network can help with the training, while also yielding control over the generative process.
Our main contributions in this work can be summarized as follows:
First, we introduce the Deforming Autoencoder architecture, bringing together the deformable modeling paradigm with unsupervised deep learning. We treat the template-to-image correspondence task as that of predicting a smooth and invertible transformation. As shown in Fig. 1, our network predicts this transformation field alongside with the template-aligned appearance and subsequently deforms the synthesized appearance to generate an image similar to its input. This allows for a disentanglement of the shape and appearance parts of image generation by explicitly modelling the effects of image deformation during the decoding stage.
Second, we explore different ways in which deformations can be represented and predicted by the decoder. Instead of building a generic deformation model, we compose a global, affine deformation field, with a non-rigid field that is synthesized as a convolutional decoder network. We develop a method that allows us to constrain the synthesized field to be a diffeomorphism, namely an invertible and smooth transformation, and show that it simplifies training and improves accuracy. We also show that class-related information can be exploited, when available, to learn better deformation models: this yields sharper images and can be used to learn models that jointly account for multiple classes - e.g. all MNIST digits.
Third, we show that disentangling appearance from deformation comes with several advantages when it comes to modeling and manipulating images. By using disentangling we obtain clearly better synthesis results when manipulating images for tasks such as expression, pose or identity interpolation when compared to standard autoencoder architectures. Along the same lines, we show that accounting for deformations facilitates a further disentangling of the appearance components into an intrinsic, shading-albedo decomposition which completely fails when naively performed in the original image coordinates. This allows us to perform re-shading through simple operations on the latent shading coordinate space.
We complement these qualitative results with a quantitative analysis of the learned model in terms of landmark localization accuracy. We show that our method is not too far below supervised methods and outperforms with a margin the latest state-of-the-art works on self-supervised correspondence estimation , even though we never explicitly trained our network for correspondence estimation, but rather only aimed at reconstructing pixel intensities.
2 Deforming Autoencoders
Our architecture embodies the deformable template paradigm in an autoencoder architecture. The premise of our work is that image generation can be interpreted as the combination of two processes: a synthesis of appearance on a deformation-free coordinate system (‘template’), followed by a subsequent deformation that introduces shape variability. Denoting by the value of the synthesized appearance (or, texture) at coordinate and by the estimated deformation field, we consider that the observed image, , can be reconstructed as follows:
namely the image appearance at position is obtained by looking up the synthesized appearance at position . This is implemented in terms of a spatial transformer layer  that allows us to pass gradients through the warping process.
The appearance and deformation functions are synthesized by independent decoder networks. The inputs to the decoders are delivered by a joint encoder network that takes as input the observed image and delivers a low-dimensional latent representation, , of shape and appearance. This is split into two parts, which feed into the appearance and shape networks respectively, providing us with a clear separation of shape and appearance.
2.1 Deformation field modeling
Rather than leave deformation modeling entirely to back-propagation, we use some domain knowledge to simplify and accelerate learning. The first observation is that global aspects can be expressed using low-dimensional linear models. We account for global deformations by an affine Spatial Transformer layer, that uses a six-dimensional input to synthesize a deformation field as an expansion on a fixed basis . This means that the shape representation, described above is decomposed into two parts, , where accounts for the affine, and for the non-rigid, learned part of the deformation field. These deformation fields are generated by separate decoders, and are composed, so that the affine transformation warps the detailed non-rigid warps to the image positions where they should apply. This is also a common decomposition in deformable models for faces [9, 10].
Turning to local deformation effects, we quickly realized that not every deformation field is plausible. Without appropriate regularization we would often obtain deformation fields that could expand small areas to occupy whole regions, and/or would be non-diffeomorphic, meaning that the deformation could spread a connected texture pattern to a disconnected image area (Figure 2-(f)).
To prevent this problem, instead of making the shape decoder CNN directly predict the local warping field , we consider a ‘differential decoder’ that generates the spatial gradient of the warping field: and , where denotes the component of the spatial gradient vector. These two quantities measure the displacement of consecutive pixels - for instance amounts to translation in the horizontal axis, amounts to horizontal shifting by a size of 2, while amounts to left-right flipping; a similar behavior is associated with in the vertical axis. We note that global rotations are handled by the affine warping field, and the are associated with small local rotations of minor importance - we therefore focus on .
Having access to these two values gives us a handle on the deformation field, since we can prevent folding/excessive stretching by controlling .
In particular, we pass the outputs of our differential decoder through a Rectified Linear Unit (ReLU) module, which enforces positive horizontal offsets on horizontally adjacent pixels, and positive vertical offsets on vertically adjacent pixels. We subsequently apply a spatial integration layer, implemented in terms of a fixed network layer, on top of the output of the ReLU layer to reconstruct the warping field from its spatial gradient. By doing so, the new deformation module enforces the generation of smooth and regular warping fields that avoid self-crossings. In practice we found that also clipping the decoded offsets by a maximal value significantly eases the training, which amounts to replacing the ReLU layer,with a layer. In our experiments, we set where denotes the number of pixels along one dimension of the image.
2.2 Class-aware Deforming Autoencoder
We can require our network’s latent representation to be predictive of not only shape and appearance, but also of instance class, if that is available during training. We note that this information, being discrete may be easier to acquire than the actual deformation field, which would require manual landmark annotation. For instance, for faces such discrete information could represent the expression or a person’s identity.
In particular we consider that the latent representation can be decomposed as follows: , where are as previously the appearance- and shape- related parts of the representation, respectively, while is fed as input to a sub-network trained to predict the class associated with the input image. Apart from assisting the classification task, the latent vector is fed into both the appearance and shape decoders. Intuitively this allows our decoder network to learn a mixture model that is conditioned on class information, rather than treating the joint, multi-modal distribution through a monolithic model. Even though the class label is only used during training, and not for reconstruction, our experimental results show that a network trained with class supervision can deliver more accurate synthesis results.
2.3 Intrinsic Deforming Autoencoder: Deformation, Albedo and Shading Decomposition
Having outlined Deforming Autoencoders, we now use a Deforming Autoencoder to model complex physical image signals, such as illumination effects, without a supervision signal. For this we design the Intrinsic Deforming-Autoencoder, named Intrinsic-DAE to model shading and albedo for in-the-wild face images. As shown in Fig. 4-(a), we introduce two separate decoders for shading and albedo , each of with has the same structure as the original texture decoder. The texture is computed by where denotes the Hadamard product.
In order to model the physical properties of shading and albedo, we follow the intrinsic decomposition regularization loss used in : we apply the L2 smoothness loss on , meaning that shading is expected to be smooth, while leaving albedo unconstrained. As shown in Fig. 4 and more extensively in the experimental results section, when used in tandem with an Deforming Autoencoder this allows us to successfully decompose of face image into shape, albedo, and shading components, while a standard Autoencoder completely fails at decomposing unaligned images into shading and albedo.
Our objective function is formed as the sum of three losses, combining the reconstruction error with the regularization terms required for the modules described above. Concretely, the loss of the deforming autoencoder can be written as
where the reconstruction loss is defined as the standard loss
and the warping loss is decomposed as follows:
In particular the smoothness cost, , penalizes quickly-changing deformations encoded by the local warping field. It is measured in terms of the total variation norm of the horizontal and vertical differential warping fields, and is given by
where . Finally, aims at removing any systematic bias introduced by the fitting process, e.g. the average template becoming small, or a distorted version of the data. It consists of regularization on (1) the affine parameters defined as the L2-distance between and , with being the identity affine transform and (2) on free-form deformations defined as the L2-distance between the average deformation grid within a minibatch, and the identity grid :
In the class-aware variant described in Sec. 2.2 we augment the loss above with the cross-entropy loss evaluated on the classification network’s outputs, while for Intrinsic-DAE, we add the following objective function in training:
We experiment with two types of architectures; the majority of our results are obtained with a standard auto-encoder architecture, where both encoder and decoders are CNNs with standard convolution-BatchNorm-ReLU blocks. The number of filters and the texture bottleneck capacity can vary per experiment, image resolution, and dataset, as detailed in the Appendix 0.A.
Follow the recent work on densely connected convolutional networks , we have also experimented with incorporating dense connections into our encoder and decoders architectures respectively (no skip connections over the bottleneck layer for latent representations). In particular, we follow the architecture of DenseNet-121, but without the convolutional layers inside each dense block. These have been shown to better exploit larger datasets, as indicated in the quantitative analysis of unsupervised face alignment. We call this version of the deforming autoencoder Dense-DAE.
To demonstrate the properties of our deformation disentangling network, we conduct experiments on the following three datasets:
Deformed MNIST. A synthetic dataset designed specifically to explore the deformation modelling power of our network. Deformed MNIST consists of handwritten MNIST images randomly distorted using a mixture of sinusoidal waveforms.
Using these datasets we experimentally explored the ability of the unsupervised appearance-shape (or texture-deformation) disentangling network on 1) unsupervised image alignment/appearance inference; 2) learning semantically meaningful manifolds for shape and appearance; 3) decomposition into illumination intrinsics (shading, albedo); 4) unsupervised landmark detection, as detailed below. We intend to make all of the code of our system publicly available in order to facilitate the reproduction of our results.
3.1 Unsupervised Appearance Inference
We first use our network to model canonical appearance and deformation for single category objects. For this purpose, we demonstrate the results in the MNIST and MUG facial expression datasets (Fig. 5, 6, 7).
We observe that by heavily limiting the size of (1 in Fig. 5 and 0 in Fig. 7), we can successfully infer a canonical appearance for such a class. In Fig. 5, all different types of handwritten digits ’3’ are aligned to a simple canonical shape. In Fig. 7, by limiting the dimension of to , the network learns to encode a single texture image for all expressions, and successfully distills expression-related information exclusively in the shape space. In Fig. 7-(b) we show that by interpolating the learned latent representations, we can generate meaningful shape interpolations that mimic facial expressions.
In cases where data has a multi-modal distribution exhibiting multiple different canonical appearances, e.g., multi-class MNIST digit images, learning a single appearance is less meaningful and often challenging (Fig. 6-(b)). In such cases, utilizing class information (Sec. 2.2) significantly improves the quality of multi-modal appearance learning (Fig. 6-(d)). As the network learns to classify the images implicitly in its latent space, it learns to generate a single canonical appearance for each class. Misclassified data will be decoded into an incorrect class: the image at position (2,4) in Fig. 6-(c,d) is interpreted as a 6.
We now demonstrate the effectiveness of texture inference using our network on in-the-wild human faces. Using the MAFL face dataset, we show that our network is able to align the faces to a common texture space under various poses, illumination conditions, or facial expressions (Fig. 10)-(d). The aligned textures retain the information of the input image such as lighting, gender, and facial hair, without a relevant supervision training signal. We further demonstrate the alignment on the 11k Hands dataset , where we align palmar images of the left hand of several subjects 8. This property of our network is especially useful for applications such as computer graphics, where establishing correspondences (UV map) between a class of objects is important but usually difficult.
3.2 Autoencoders vs. Deforming Autoencoders
We show the ability of our network to learn meaningful deformation representations without supervision. We compare our disentangling network with a plain auto-encoder (Fig. 9). Contrary to our network which disentangles an image into a template texture and a deformation field, the auto-encoder is trained to encode all of the image in a single latent representation, i.e., the bottleneck.
We train both networks in the MAFL faces-in-the-wild dataset. To evaluate the learned representation, we conduct manifold traversal (i.e., latent representation interpolation) between two randomly sampled face images: given a source face image and a target image , we first compute their latent representations s. We use and to denote the latent representations in our network for , and for the latent representation learned by a plain autoencoder. We then conduct linear interpolation on , between and : . We subsequently reconstruct the image from using the corresponding decoder(s), as shown in Fig. 9.
By traversing the learned deformation representation only, we can change the shape and pose of a face while maintaining its texture (Fig. 9-(1)); interpolating the texture representation results in pose-aligned texture transfer (Fig. 9-(2)); traversing on both representations will generate a smooth deformation from one image to another (Fig. 9-(3,5,7)). Compared to the interpolation using the autoencoder (Fig. 9-(4,6,8)), which often exhibits artifacts, our traversal stays on the semantic manifold of faces and generates sharp facial features.
3.3 Intrinsic Deforming Autoencoders
Having demonstrated the disentanglement abilities of Deforming Autoencoders, we now explore the disentanglement capabilities of Intrinsic-DAE described in Sec. 2.3. Using only the and regularization losses, the Intrinsic-DAE is able to generate convincing shading and albedo estimates without direct supervision (Fig. 10-(b) to (g)). Without the “learning-to-align” property, a baseline autoencoder structure with an intrinsic decomposition design (Fig. 4-(b)) cannot decompose the image into plausible shading and albedo components (Fig. 10-(h),(i),(j)).
In addition, we show that by manipulating the learned latent representation of , Intrinsic-DAE allows us to simulate illumination effects for face images, such as interpolating lighting directions (Fig. 11).
Training with reconstruction losses, autoencoder-like architectures are prone to generating smooth images which lack visual realism (Fig. 10). Inspired by the success of generative adversarial networks (GANs) , we follow previous work  where an adversarial loss is adopted to generate visually realistic images: we train the Intrinsic-DAE with an extra adversarial loss term
applied on the final output. The loss function becomes:
In practice, we apply a PatchGAN [39, 40] as the discriminator and set . We found that the adversarial loss improves the visual sharpness of the reconstruction while the deformation, shading are still successfully disentangled (Fig. 12).
3.4 Unsupervised alignment evaluation
Having qualitatively analyzed the disentanglement capabilities of our networks, we now turn to quantifying their performance on the task of unsupervised image alignment. We report the performance of our face DAE’s alignment on landmark detection on face images, specifically, the eyes, the nose, and corners of the mouth. We report performance on the MAFL dataset, which contains manually annotated landmark locations for 19,000 training and 1,000 test images. In our experiments, we use a model trained on the CelebA dataset without any form of supervision to estimate deformation fields on the MAFL training set. Following the evaluation protocol of the work that we directly compare to 
, we train a landmark regressor post-hoc on these deformation fields using the provided annotations. We use landmark locations from the MAFL training set as training data for this regressor, but do not pass gradients to the Deforming Autoencoder, which thereby remains fixed to the model learned without supervision. The regressor is a 2-layer fully-connected neural network. Its inputs are flattened deformation fields (vectors of size), which are provided as input to a 100-dimensional hidden layer, followed by a ReLU and a 10-D output layer to predict the spatial coordinates () for five landmarks corresponding to the eyes, nose, and mouth corner landmarks. We use L1 loss as the objective function for this regression task.
In testing, we predict landmark locations using the trained regressor and the deformation fields on the MAFL test set. In Table 1 we report the mean error in landmark localization as a percentage of the inter-ocular distance. As the deformation field determines the alignment in the texture space, it serves as an effective mapping between landmark locations on the aligned texture and those on the original, unaligned faces. Hence, the mean error we report directly quantifies the quality of the (unsupervised) face alignment.
|, MAFL||, MAFL||, MAFL||, CelebA||, CelebA, with Regressor|
In Table 2 we compare with the results of the best current method for semi-supervised image registration . We observe that by better modeling of the deformation space we quickly bridge the gap in performance, even though we never explicitly trained to learn correspondences.
|DAE||Dense-DAE||TCDCN||Thewlis et al.|
4 Conclusion and Future Work
In this paper we have developed deep autoencoders that can disentangle shape and appearance in latent representation space. We have shown that this method can be used for unsupervised groupwise image alignment. Our experiments with expression morphing in humans, image manipulation, such as shape and appearance interpolation, as well as unsupervised landmark localization, show the generality of our approach. We have shown that bringing images in a canonical coordinate system allows for a more extensive form of image disentangling, facilitating the estimation of decompositions into shape, albedo and shading without any form of supervision. We expect that this will lead in the future to a full-fledged disentanglement into normals, illumination, and 3D geometry.
This work was supported by a gift from Adobe, NSF grants CNS-1718014 and DMS 1737876, the Partner University Fund, and the SUNY2020 Infrastructure Transportation Security Center.
-  Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., Abbeel, P.: Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In: NIPS. (2016)
-  Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., Samaras, D.: Neural face editing with intrinsic image disentangling. In: CVPR. (2017)
-  Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Interpretable transformations with encoder-decoder networks. In: CVPR. (2017)
-  Sengupta, S., Kanazawa, A., Castillo, C.D., Jacobs, D.: Sfsnet: Learning shape, reflectance and illuminance of faces in the wild. arXiv preprint arXiv:1712.01261 (2017)
Memisevic, R., Hinton, G.E.:
Learning to represent spatial transformations with factored higher-order boltzmann machines.Neural Computation (2010)
-  Worrall, D.E., Garbin, S.J., Turmukhambetov, D., Brostow, G.J.: Harmonic networks: Deep translation and rotation equivariance. (2016)
-  Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3d view synthesis.
-  Lample, G., Zeghidour, N., Usunier, N., Bordes, A., Denoyer, L., Ranzato, M.: Fader networks: Manipulating images by sliding attributes. CoRR abs/1706.00409 (2017)
-  Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: European conference on computer vision, Springer (1998)
-  Matthews, I., Baker, S.: Active appearance models revisited. IJCV (2004)
-  Learned-Miller, E.G.: Data driven image models through continuous joint alignment. PAMI (2006)
-  Kokkinos, I., Yuille, A.L.: Unsupervised learning of object deformation models. In: ICCV. (2007)
-  Frey, B.J., Jojic, N.: Transformation-invariant clustering using the EM algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 25(1) (2003) 1–17
-  Jojic, N., Frey, B.J., Kannan, A.: Epitomic analysis of appearance and shape. In: 9th IEEE International Conference on Computer Vision (ICCV 2003), 14-17 October 2003, Nice, France. (2003) 34–43
-  Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. CoRR abs/1506.02025 (2015)
-  Papandreou, G., Kokkinos, I., Savalle, P.: Modeling local and global deformations in deep learning: Epitomic convolution, multiple instance learning, and sliding window detection. In: CVPR. (2015)
-  Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV. (2017)
-  Neverova, N., Kokkinos, I.: Mass displacement networks. Arxiv (2017)
-  Trigeorgis, G., Snape, P., Nicolaou, M.A., Antonakos, E., Zafeiriou, S.: Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In: Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition. (2016)
-  Güler, R.A., Trigeorgis, G., Antonakos, E., Snape, P., Zafeiriou, S., Kokkinos, I.: Densereg: Fully convolutional dense shape regression in-the-wild. CVPR (2017)
A parallel computation that assigns canonical object-based frames of
In: Proceedings of the 7th International Joint Conference on Artificial Intelligence, IJCAI ’81, Vancouver, BC, Canada, August 24-28, 1981. (1981) 683–685
-  Olshausen, B.A., Anderson, C.H., Essen, D.C.V.: A multiscale dynamic routing circuit for forming size- and position-invariant object representations. Journal of Computational Neuroscience 2(1) (1995) 45–62
-  Malsburg, C.: The correlation theory of brain function. In: Internal Report 81-2. Gottingen Max-Planck-Institute for Biophysical Chemistry. (1981)
Hinton, G.E., Krizhevsky, A., Wang, S.D.:
In: Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I. (2011) 44–51
-  Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. CoRR abs/1710.09829 (2017)
-  Bristow, H., Valmadre, J., Lucey, S.: Dense semantic correspondence where every pixel is a classifier. In: ICCV. (2015)
-  Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3d-guided cycle consistency. In: CVPR. (2016)
-  Gaur, U., Manjunath, B.S.: Weakly supervised manifold learning for dense semantic object correspondence. In: ICCV. (2017)
-  Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised object learning from dense equivariant image labelling. (2017)
-  Amit, Y., Grenander, U., Piccioni, M.: Structural image restoration through deformable templates. Journal of the American Statistical Association 86(414) (1991)
Deformable templates for face recognition.Journal of Cognitive Neuroscience 3(1) (1991)
-  Blanz, V.T., Vetter, T.: Face recognition based on fitting a 3D morphable model. 25(9) (2003) 1063–1074
-  Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017)
-  11th International Workshop on Image Analysis for Multimedia Interactive Services, WIAMIS 2010, Desenzano del Garda, Italy, April 12-14, 2010, IEEE (2010)
-  Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI. (2014) 94–108
-  Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV). (2015)
-  Afifi, M.: Gender recognition and biometric identification using a large dataset of hand images. CoRR abs/1711.04322 (2017)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
-  Li, C., Wand, M.: Precomputed real-time texture synthesis with markovian generative adversarial networks. In: European Conference on Computer Vision, Springer (2016) 702–716
-  Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arxiv (2016)
-  Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning deep representation for face alignment with auxiliary attributes. IEEE transactions on pattern analysis and machine intelligence 38(5) (2016) 918–930
Appendix 0.A Architectural Details
0.a.1 Convolutional Encoders and Decoders
In our experiments, where input images are of size (Nc is 1 for MNIST and 3 for faces), we use identical architectures for convolutional encoders and decoders.
The encoder architecture is
Conv(32)-LeakyReLU-Conv(64)-BN-LeakyReLU-Conv(128)-> ->BN-LeakyReLU-Conv(256)-BN-LeakyReLU-Conv(Nz)-> ->Sigmoid;
while the decoder architecture is
ConvT(256)-BN-ReLU-ConvT(128)-BN-ReLU-ConvT(64)-> ->BN-ReLU-ConvT(32)-BN-ReLU-ConvT(32)-BN-ReLU-ConvT(Nc)-> ->Threshold(0,1),
Conv(n): convolution layer with output feature map;
ConvT(n): transposed convolution (deconvolution) layer with output feature map;
: batch normalization layer
Nz: latent representation dimension
Nc: number of output image channel
0.a.2 DenseNet-style Encoders and Decoders
For DenseNet-style architectures, we employ dense convolutional connections. The architecture for the encoder is
BN-ReLU-Conv(32)-DBE(32,6)-TBE(32,64,2)-> ->DBE(64,12)-TBE(64,128,2)-DBE(128,24)-TBE(128,256,2)-> ->DBE(256,16)-TBE(256,Nz,4)-Sigmoid;
whereas the architecture for the decoder is
BN-Tanh-ConvT(256)-DBD(256,16)-TBD(256,128)-> ->DBD(128,24)-TBD(128,64)-DBD(64,12)-TBD(64,32)-> ->DBD(32,6)-TBD(32,32)-BN-Tanh-ConvT(Nc)-Threshold(0,1),
DBE(n,k): A dense encoder block with convolutions with channels.
TBE(m,n,p): An encoder transition block of convolutions with input channels and
output channels. Also includes a max-pooling operation of size.
DBD(n,k): A dense decoder block with transposed convolution operations with channels.
|Conv Encoder||Conv Decoder|
|Output Size||Operation||Output size||Operation|
|Dense Conv Encoder||Dense Conv Decoder|
|Output Size||Operation||Output size||Operation|
Appendix 0.B Ablation Study
0.b.1 Dimension of
In this section, we show experimental results on single deformed MNIST images of the digit 3 (Figure 14) as well as in-the-wild faces (without masking) from the MAFL dataset (Figure 15) to demonstrate the effect of varying the dimension of .
||reconstruction||texture||warping (x)||warping (y)|
0.b.2 Methods for deformation modeling
In this section, we demonstrate the effect of using different warping modules.
We first show additional comparisons between using our proposed affine + integral warping and a non-rigid warping field directly output from a convolutional decoder for non-rigid deformation modeling (Figure 16).
We visualize the utility of affine and integral warping modules in our network with face images (Figure 17
). We can see that the affine transformation handles global pose variance (Figure.17-(b)) but not local non-rigid deformation. Our proposed integral warping module aligns the faces in a non-rigid manner (Figure 17-(c)). Incorporating both deformation modules improves the non-rigid alignment (Figure 17-(d)).
(b)-3 warping (x)
(c)-3 warping (x)
Affine + Integral:
Appendix 0.C Latent Manifold Traversal
We provide additional results and comparisons with a plain autoencoder on traversing the learned manifolds. In addition to Figure 13 in our manuscript, we provide two more sets of results in Figure 18 and Figure 19. Compared to a plain autoencoder, our deforming autoencoder not only generates better reconstructions, but also learns a better face manifold - interpolating between learned latent representations generates sharper and more realistic face images. For this experiment, we use the convolutional encoder and decoder architecture as described in Sec. 0.A.1.
Appendix 0.D Intrinsic Decomposition with DAE
In Fig.21 we provide additional results of unsupervised intrinsic disentangling for faces-in-the-wild using Intrinsic-DAE. Using the architecture and objective functions described in Sec. 2.3 of the main paper the network learns to bring faces under different poses and illumination conditions, shown in Fig. 21-(a), to a canonical view, as shown in Fig. 21-(d), while separating the shading, shown in Fig. 21-(b) and albedo, shown in Fig. 21-(c) components in the canonical view using two independent decoders. With the learned deformation from the deformation decoder, we can warp the aligned shading and aligned albedo to its original view as in the input image, as shown in Fig. 21-(e,f).
In Fig. 22, we provide additional results for “changing lighting direction” of a face image using Intinsic-DAE. We show that even without explicitly modeling of geometry, we can simulate smooth and reasonable lighting direction changes in the image by interpolating the learned latent representation for shading, as shown in Fig. 22-a-(4),b-(4).
For Intrinsic-DAE, we use the DenseNet architecture as the encoders and decoders (0.A.2). The network is trained with a subset of images in the CelebA dataset. The dimensions of latent representations are: 16 for albedo, 16 for shading, and 128 for deformation field.