GANimation: Anatomically-aware Facial Animation from a Single Image (ECCV'18 Oral) [PyTorch]
Recent advances in Generative Adversarial Networks (GANs) have shown impressive results for task of facial expression synthesis. The most successful architecture is StarGAN, that conditions GANs generation process with images of a specific domain, namely a set of images of persons sharing the same expression. While effective, this approach can only generate a discrete number of expressions, determined by the content of the dataset. To address this limitation, in this paper, we introduce a novel GAN conditioning scheme based on Action Units (AU) annotations, which describes in a continuous manifold the anatomical facial movements defining a human expression. Our approach allows controlling the magnitude of activation of each AU and combine several of them. Additionally, we propose a fully unsupervised strategy to train the model, that only requires images annotated with their activated AUs, and exploit attention mechanisms that make our network robust to changing backgrounds and lighting conditions. Extensive evaluation show that our approach goes beyond competing conditional generators both in the capability to synthesize a much wider range of expressions ruled by anatomically feasible muscle movements, as in the capacity of dealing with images in the wild.READ FULL TEXT VIEW PDF
GANimation: Anatomically-aware Facial Animation from a Single Image (ECCV'18 Oral) [PyTorch]
Masking GAN - Image attribute mask generation
A Pytorch implementation of "Unsupervised Attention-Guided Image-to-Image Translation"
A tensorflow implementation of GANimation
A GAN architecture conditioned on Action Units (AU) annotations generating facial expressions in a continuous domain.
Being able to automatically animate the facial expression from a single image would open the door to many new exciting applications in different areas, including the movie industry, photography technologies, fashion and e-commerce business, to name but a few. As Generative and Adversarial Networks have become more prevalent, this task has experienced significant advances, with architectures such as StarGAN , which is able not only to synthesize novel expressions, but also to change other attributes of the face, such as age, hair color or gender. Despite its generality, StarGAN can only change a particular aspect of a face among a discrete number of attributes defined by the annotation granularity of the dataset. For instance, for the facial expression synthesis task,  is trained on the RaFD  dataset which has only 8 binary labels for facial expressions, namely sad, neutral, angry, contemptuous, disgusted, surprised, fearful and happy.
Facial expressions, however, are the result of the combined and coordinated action of facial muscles that cannot be categorized in a discrete and low number of classes. Ekman and Friesen  developed the Facial Action Coding System (FACS) for describing facial expressions in terms of the so-called Action Units (AUs), which are anatomically related to the contractions of specific facial muscles. Although the number of action units is relatively small (30 AUs were found to be anatomically related to the contraction of specific facial muscles), more than 7,000 different AU combinations have been observed . For example, the facial expression for fear is generally produced with activations: Inner Brow Raiser (AU1), Outer Brow Raiser (AU2), Brow Lowerer (AU4), Upper Lid Raiser (AU5), Lid Tightener (AU7), Lip Stretcher (AU20) and Jaw Drop (AU26) . Depending on the magnitude of each AU, the expression will transmit the emotion of fear to a greater or lesser extent.
In this paper we aim at building a model for synthetic facial animation with the level of expressiveness of FACS, and being able to generate anatomically-aware expressions in a continuous domain, without the need of obtaining any facial landmarks . For this purpose we leverage on the recent EmotioNet dataset , which consists of one million images of facial expressions (we use 200,000 of them) of emotion in the wild annotated with discrete AUs activations 111The dataset was re-annotated with  to obtain continuous activation annotations.. We build a GAN architecture which, instead of being conditioned with images of a specific domain as in 
, it is conditioned on a one-dimensional vector indicating the presence/absence and the magnitude of each action unit. We train this architecture in an unsupervised manner that only requires images with their activated AUs. To circumvent the need for pairs of training images of the same person under different expressions, we split the problem in two main stages. First, we consider an AU-conditioned bidirectional adversarial architecture which, given a single training photo, initially renders a new image under the desired expression. This synthesized image is then rendered-back to the original pose, hence being directly comparable to the input image. We incorporate very recent losses to assess the photorealism of the generated image. Additionally, our system also goes beyond state-of-the-art in that it can handle images under changing backgrounds and illumination conditions. We achieve this by means of an attention layer that focuses the action of the network only in those regions of the image that are relevant to convey the novel expression.
As a result, we build an anatomically coherent facial expression synthesis method, able to render images in a continuous domain, and which can handle images in the wild with complex backgrounds and illumination conditions. As we will show in the results section, it compares favorably to other conditioned-GANs schemes, both in terms of the visual quality of the results, and the possibilities of generation. Figure 1 shows some example of the results we obtain, in which given one input image, we gradually change the magnitude of activation of the AUs used to produce a smile.
Generative Adversarial Networks.
GANs are a powerful class of generative models based on game theory. A typical GAN optimization consists in simultaneously training a generator network to produce realistic fake samples and a discriminator network trained to distinguish between real and fake data. This idea is embedded by the so-calledadversarial loss. Recent works [1, 9] have shown improved stability relaying on the continuous Earth Mover Distance metric, which we shall use in this paper to train our model. GANs have been shown to produce very realistic images with a high level of detail and have been successfully used for image translation [38, 10, 13], face generation [12, 28]
, super-resolution imaging[34, 18], indoor scene modeling [12, 33] and human poses editing .
Conditional GANs. An active area of research is designing GAN models that incorporate conditions and constraints into the generation process. Prior studies have explored combining several conditions, such as text descriptions [29, 39, 37] and class information [24, 23]. Particularly interesting for this work are those methods exploring image based conditioning as in image super-resolution , future frame prediction , image in-painting 10] and multi-target domain transfer .
Unpaired Image-to-Image Translation. As in our framework, several works have also tackled the problem of using unpaired training data. First attempts  relied on Markov random field priors for Bayesian based generation models using images from the marginal distributions in individual domains. Others explored enhancing GANS with Variational Auto-Encoder strategies [21, 15]. Later, several works [25, 19] have exploited the idea of driving the system to produce mappings transforming the style without altering the original input image content. Our approach is more related to those works exploiting cycle consistency to preserve key attributes between the input and the mapped image, such as CycleGAN , DiscoGAN  and StarGAN .
Face Image Manipulation.
Face generation and editing is a well-studied topic in computer vision and generative models. Most works have tackled the task on attribute editing[17, 26, 31] trying to modify attribute categories such as adding glasses, changing color hair, gender swapping and aging. The works that are most related to ours are those synthesizing facial expressions. Early approaches addressed the problem using mass-and-spring models to physically approximate skin and muscle movement . The problem with this approach is that is difficult to generate natural looking facial expressions as there are many subtle skin movements that are difficult to render with simple spring models. Another line of research relied on 2D and 3D morphings , but produced strong artifacts around the region boundaries and was not able to model illumination changes.
train highly complex convolutional networks able to work with images in the wild. However, these approaches have been conditioned on discrete emotion categories (e.g., happy, neutral, and sad). Instead, our model resumes the idea of modeling skin and muscles, but we integrate it in modern deep learning machinery. More specifically, we learn a GAN model conditioned on a continuous embedding of muscle movements, allowing to generate a large range of anatomically possible face expressions as well as smooth facial movement transitions in video sequences.
Let us define an input RGB image as , captured under an arbitrary facial expression. Every gesture expression is encoded by means of a set of action units , where each denotes a normalized value between 0 and 1 to module the magnitude of the
-th action unit. It is worth pointing out that thanks to this continuous representation, a natural interpolation can be done between different expressions, allowing to render a wide range of realistic and smooth facial expressions.
Our aim is to learn a mapping to translate into an output image conditioned on an action-unit target
, i.e., we seek to estimate the mapping. To this end, we propose to train in an unsupervised manner, using training triplets , where the target vectors are randomly generated. Importantly, we neither require pairs of images of the same person under different expressions, nor the expected target image .
This section describes our novel approach to generate photo-realistic conditioned images, which, as shown in Fig. 2, consists of two main modules. On the one hand, a generator is trained to realistically transform the facial expression in image to the desired . Note that is applied twice, first to map the input image , and then to render it back . On the other hand, we use a WGAN-GP  based critic to evaluate the quality of the generated image as well as its expression.
Generator. Let be the generator block. Since it will be applied bidirectionally (i.e., to map either input image to desired expression and vice-versa) in the following discussion we use subscripts and to indicate origin and final.
Given the image and the -vector encoding the desired expression, we form the input of generator as a concatenation , where has been represented as arrays of size .
One key ingredient of our system is to make focus only on those regions of the image that are responsible of synthesizing the novel expression and keep the rest elements of the image such as hair, glasses, hats or jewelery untouched. For this purpose, we have embedded an attention mechanism to the generator. Concretely, instead of regressing a full image, our generator outputs two masks, a color mask and attention mask . The final image can be obtained as:
where and . The mask indicates to which extend each pixel of the contributes to the output image . In this way, the generator does not need to render static elements, and can focus exclusively on the pixels defining the facial movements, leading to sharper and more realistic synthetic images. This process is depicted in Fig. 3.
Conditional Critic. This is a network trained to evaluate the generated images in terms of their photo-realism and desired expression fulfillment. The structure of resembles that of the PatchGan  network mapping from the input image to a matrix , where
represents the probability of the overlapping patchto be real. Also, to evaluate its conditioning, on top of it we add an auxiliary regression head that estimates the AUs activations in the image.
The loss function we define contains four terms, namely animage adversarial loss  with the modification proposed by Gulrajani et al.  that pushes the distribution of the generated images to the distribution of the training images; the attention loss to drive the attention masks to be smooth and prevent them from saturating; the conditional expression loss that conditions the expression of the generated images to be similar to the desired one; and the identity loss that favors to preserve the person texture identity.
. Specifically, the original GAN formulation is based on the Jensen-Shannon (JS) divergence loss function and aims to maximize the probability of correctly classifying real and rendered images while the generator tries to foul the discriminator. This loss is potentially not continuous with respect to the generator’s parameters and can locally saturate leading to vanishing gradients in the discriminator. This is addressed in WGAN by replacing JS with the continuous Earth Mover Distance. To maintain a Lipschitz constraint, WGAN-GP  proposes to add a gradient penalty for the critic network computed as the norm of the gradients with respect to the critic input.
Formally, let be the input image with the initial condition , the desired final condition, the data distribution of the input image, and the random interpolation distribution. Then, the critic loss we use is:
where is a penalty coefficient.
Attention Loss. When training the model we do not have ground-truth annotation for the attention masks . Similarly as for the color masks , they are learned from the resulting gradients of the critic module and the rest of the losses. However, the attention masks can easily saturate to which makes that , that is, the generator has no effect. To prevent this situation, we regularize the mask with a -weight penalty. Also, to enforce smooth spatial color transformation when combining the pixel from the input image and the color transformation , we perform a Total Variation Regularization over . The attention loss can therefore be defined as:
where and is the entry of . is a penalty coefficient.
Conditional Expression Loss. While reducing the image adversarial loss, the generator must also reduce the error produced by the AUs regression head on top of . In this way, not only learns to render realistic samples but also learns to satisfy the target facial expression encoded by . This loss is defined with two components: an AUs regression loss with fake images used to optimize G, and an AUs regression loss of real images used to learn the regression head on top of D. This loss is computed as:
Identity Loss. With the previously defined losses the generator is enforced to generate photo-realistic face transformations. However, without ground-truth supervision, there is no constraint to guarantee that the face in both the input and output images correspond to the same person. Using a cycle consistency loss  we force the generator to maintain the identity of each individual by penalizing the difference between the original image and its reconstruction:
To produce realistic images it is critical for the generator to model both low and high frequencies. Our PatchGan based critic already enforces high-frequency correctness by restricting our attention to the structure in local image patches. To also capture low-frequencies it is sufficient to use -norm. In preliminary experiments, we also tried replacing -norm with a more sophisticated Perceptual  loss, although we did not observe improved performance.
Full Loss. To generate the target image , we build a loss function by linearly combining all previous partial losses:
where , and are the hyper-parameters that control the relative importance of every loss term. Finally, we can define the following minimax problem:
where draws samples from the data distribution. Additionally, we constrain our discriminator to lie in , that represents the set of 1-Lipschitz functions.
Our generator builds upon the variation of the network from Johnson et al.  proposed by  as it proved to achieve impressive results for image-to-image mapping. We have slightly modified it by substituting the last convolutional layer with two parallel convolutional layers, one to regress the color mask and the other to define the attention mask
. We also observed that changing batch normalization in the generator by instance normalization improved training stability. For the critic we have adopted thePatchGan architecture of , but removing feature normalization. Otherwise, when computing the gradient penalty, the norm of the critic’s gradient would be computed with respect to the entire batch and not with respect to each input independently.
with learning rate of 0.0001, beta1 0.5, beta2 0.999 and batch size 25. We train for 30 epochs and linearly decay the rate to zero over the last 10 epochs. Every 5 optimization steps of the critic network we perform a single optimization step of the generator. The weight coefficients for the loss terms in Eq. (5) are set to , , , , . To improve stability we tried updating the critic using a buffer with generated images in different updates of the generator as proposed in  but we did not observe performance improvement. The model takes two days to train with a single GeForce® GTX 1080 Ti GPU.
This section provides a thorough evaluation of our system. We first test the main component, namely the single and multiple AUs editing. We then compare our model against current competing techniques in the task of discrete emotions editing and demonstrate our model’s ability to deal with images in the wild and its capability to generate a wide range of anatomically coherent face transformations. Finally, we discuss the model’s limitations and failure cases.
It is worth noting that in some of the experiments the input faces are not cropped. In this cases we first use a detector 222We use the face detector from https://github.com/ageitgey/face_recognition. to localize and crop the face, apply the expression transformation to that area with Eq. (1), and finally place the generated face back to its original position in the image. The attention mechanism guaranties a smooth transition between the morphed cropped face and the original image. As we shall see later, this three steps process results on higher resolution images compared to previous models. Supplementary material can be found on http://www.albertpumarola.com/research/GANimation/.
We first evaluate our model’s ability to activate AUs at different intensities while preserving the person’s identity. Figure 4 shows a subset of 9 AUs individually transformed with four levels of intensity (0, 0.33, 0.66, 1). For the case of 0 intensity it is desired not to change the corresponding AU. The model properly handles this situation and generates an identical copy of the input image for every case. The ability to apply an identity transformation is essential to ensure that non-desired facial movement will not be introduced.
For the non-zero cases, it can be observed how each AU is progressively accentuated. Note the difference between generated images at intensity 0 and 1. The model convincingly renders complex facial movements which in most cases are difficult to distinguish from real images. It is also worth mentioning that the independence of facial muscle cluster is properly learned by the generator. AUs relative to the eyes and half-upper part of the face (AUs 1, 2, 4, 5, 45) do not affect the muscles of the mouth. Equivalently, mouth related transformations (AUs 10, 12, 15, 25) do not affect eyes nor eyebrow muscles.
Fig. 5 displays, for the same experiment, the attention and color masks that produced the final result . Note how the model has learned to focus its attention (darker area) onto the corresponding AU in an unsupervised manner. In this way, it relieves the color mask from having to accurately regress each pixel value. Only the pixels relevant to the expression change are carefully estimated, the rest are just noise. For example, the attention is clearly obviating background pixels allowing to directly copy them from the original image. This is a key ingredient to later being able to handle images in the wild (see Section 6.5).
We next push the limits of our model and evaluate it in editing multiple AUs. Additionally, we also assess its ability to interpolate between two expressions. The results of this experiment are shown in Fig. 1, the first column is the original image with expression , and the right-most column is a synthetically generated image conditioned on a target expression . The rest of columns result from evaluating the generator conditioned with a linear interpolation of the original and target expressions: . The outcomes show a very remarkable smooth an consistent transformation across frames. We have intentionally selected challenging samples to show robustness to light conditions and even, as in the case of the avatar, to non-real world data distributions which were not previously seen by the model. These results are encouraging to further extend the model to video generation in future works.
We next compare our approach, against the baselines DIAT , CycleGAN , IcGAN  and StarGAN . For a fair comparison, we adopt the results of these methods trained by the most recent work, StarGAN, on the task of rendering discrete emotions categories (e.g., happy, sad and fearful) in the RaFD dataset . Since DIAT  and CycleGAN  do not allow conditioning, they were independently trained for every possible pair of source/target emotions. We next briefly discuss the main aspects of each approach:
DIAT . Given an input image and a reference image , DIAT learns a GAN model to render the attributes of domain in the image while conserving the person’s identity. It is trained with the classic adversarial loss and a cycle loss to preserve the person’s identity.
CycleGAN . Similar to DIAT , CycleGAN also learns the mapping between two domains and . To train the domain transfer, it uses a regularization term denoted cycle consistency loss combining two cycles: and .
IcGAN . Given an input image, IcGAN uses a pretrained encoder-decoder to encode the image into a latent representation in concatenation with an expression vector to then reconstruct the original image. It can modify the expression by replacing with the desired expression before going through the decoder.
StarGAN . An extension of cycle loss for simultaneously training between multiple datasets with different data domains. It uses a mask vector to ignore unspecified labels and optimize only on known ground-truth labels. It yields more realistic results when training simultaneously with multiple datasets.
Our model differs from these approaches in two main aspects. First, we do not condition the model on discrete emotions categories, but we learn a basis of anatomically feasible warps that allows generating a continuum of expressions. Secondly, the use of the attention mask allows applying the transformation only on the cropped face, and put it back onto the original image without producing any artifact. As shown in Fig. 6, besides estimating more visually compelling images than other approaches, this results on images of higher spatial resolution.
Given a single image, we next use our model to produce a wide range of anatomically feasible face expressions while conserving the person’s identity. In Fig. 7 all faces are the result of conditioning the input image in the top-left corner with a desired face configuration defined by only 14 AUs. Note the large variability of anatomically feasible expressions that can be synthesized with only 14 AUs.
As previously seen in Fig. 5, the attention mechanism not only learns to focus on specific areas of the face but also allows merging the original and generated image background. This allows our approach to be easily applied to images in the wild while still obtaining high resolution images. For these images we follow the detection and cropping scheme we described before. Fig. 8 shows two examples on these challenging images. Note how the attention mask allows for a smooth and unnoticeable merge between the entire frame and the generated faces.
We next push the limits of our network and discuss the model limitations. We have split success cases into six categories which we summarize in Fig. 9-top. The first two examples (top-row) correspond to human-like sculptures and non-realistic drawings. In both cases, the generator is able to maintain the artistic effects of the original image. Also, note how the attention mask ignores artifacts such as the pixels occluded by the glasses. The third example shows robustness to non-homogeneous textures across the face. Observe that the model is not trying to homogenize the texture by adding/removing the beard’s hair. The middle-right category relates to anthropomorphic faces with non-real textures. As for the Avatar image, the network is able to warp the face without affecting its texture. The next category is related to non-standard illuminations/colors for which the model has already been shown robust in Fig. 1. The last and most surprising category is face-sketches (bottom-right). Although the generated face suffers from some artifacts, it is still impressive how the proposed method is still capable of finding sufficient features on the face to transform its expression from worried to excited. The second case shows failures with non-previously seen occlusions such as an eye patch causing artifacts in the missing face attributes.
We have also categorized the failure cases in Fig. 9-bottom, all of them presumably due to insufficient training data. The first case is related to errors in the attention mechanism when given extreme input expressions. The attention does not weight sufficiently the color transformation causing transparencies.
The model also fails when dealing with non-human anthropomorphic distributions as in the case of cyclopes. Lastly, we tested the model behavior when dealing with animals and observed artifacts like human face features.
We have presented a novel GAN model for face animation in the wild that can be trained in a fully unsupervised manner. It advances current works which, so far, had only addressed the problem for discrete emotions category editing and portrait images. Our model encodes anatomically consistent face deformations parameterized by means of AUs. Conditioning the GAN model on these AUs allows the generator to render a wide range of expressions by simple interpolation. Additionally, we embed an attention model within the network which allows focusing only on those regions of the image relevant for every specific expression. By doing this, we can easily process images in the wild, with distracting backgrounds and illumination artifacts. We have exhaustively evaluated the model capabilities and limits in the EmotioNet  and RaFD  datasets as well as in images from movies. The results are very promising, and show smooth transitions between different expressions. This opens the possibility of applying our approach to video sequences, which we plan to do in the future.
Acknowledgments: This work is partially supported by the Spanish Ministry of Economy and Competitiveness under projects HuMoUR TIN2017-90086-R, ColRobTransp DPI2016-78957 and María de Maeztu Seal of Excellence MDM-2016-0656; by the EU project AEROARMS ICT-2014-1-644271; and by the Grant R01-DC- 014498 of the National Institute of Health. We also thank Nvidia for hardware donation under the GPU Grant Program.
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
Larsen, A.B.L., Sønderby, S.K., Larochelle, H., Winther, O.: Autoencoding beyond pixels using a learned similarity metric. In: ICML (2016)