DeepAI
Log In Sign Up

ICface: Interpretable and Controllable Face Reenactment Using GANs

This paper presents a generic face animator that is able to control the pose and expressions of a given face image. The animation is driven by human interpretable control signals consisting of head pose angles and the Action Unit (AU) values. The control information can be obtained from multiple sources including external driving videos and manual controls. Due to the interpretable nature of the driving signal, one can easily mix the information between multiple sources (e.g. pose from one image and expression from another) and apply selective post-production editing. The proposed face animator is implemented as a two stage neural network model that is learned in self-supervised manner using a large video collection. The proposed Interpretable and Controllable face reenactment network (ICface) is compared to the state-of-the-art neural network based face animation techniques in multiple tasks. The results indicate that ICface produces better visual quality, while being more versatile than most of the comparison methods. The introduced model could provide a lightweight and easy to use tool for multitude of advanced image and video editing tasks.

READ FULL TEXT VIEW PDF

page 1

page 4

page 5

page 6

page 7

page 10

07/27/2018

X2Face: A network for controlling face generation by using images, audio, and pose codes

The objective of this paper is a neural network model that controls the ...
09/17/2022

Continuously Controllable Facial Expression Editing in Talking Face Videos

Recently audio-driven talking face video generation has attracted consid...
03/31/2020

StyleRig: Rigging StyleGAN for 3D Control over Portrait Images

StyleGAN generates photorealistic portrait images of faces with eyes, te...
06/13/2022

RigNeRF: Fully Controllable Neural 3D Portraits

Volumetric neural rendering methods, such as neural radiance fields (NeR...
05/27/2022

Video2StyleGAN: Disentangling Local and Global Variations in a Video

Image editing using a pretrained StyleGAN generator has emerged as a pow...
04/17/2019

Vid2Game: Controllable Characters Extracted from Real-World Videos

We are given a video of a person performing a certain activity, from whi...
09/17/2021

PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

Generating portrait images by controlling the motions of existing faces ...

I Introduction

The ability to create a realistic animated video from a single face image is a challenging task. It involves both rotating the face in 3D space as well as synthesising detailed deformations caused by the changes in the facial expression. A lightweight and easy-to-use tool for this type of manipulation task would have numerous applications in animation industry, movie post-production, virtual reality, photography technology, video editing and interactive system design, among others.

Several recent works have proposed automated face manipulation techniques. A commonly used procedure is to take a source

face and a set of desired facial attributes (e.g. pose) as an input and produce a face image depicting the source identity with the desired attributes. The source face is usually specified by one or more example images depicting the selected person. The facial attributes could be presented by categorical variables, continuous parameters or by another face image (referred as a

driving image) with desired pose and expression.

Traditionally, face manipulation systems fit a detailed 3D face model on the source image(s) that is later used to render the manipulated outputs. If the animation is driven by another face image, it must also be modelled to extract the necessary control parameters. Although these methods have reported impressive results (see e.g. Face2Face [25],[14]), they require complex 3D face models and considerable efforts to capture all the subtle movements in the face.

Recent works [30, 32] have studied the possibility to bypass the explicit 3D model fitting. Instead, the animation is directly formulated as an end-to-end learning problem, where the necessary model is obtained implicitly using a large data collection. Unfortunately, such implicit model usually lacks interpretability and does not easily allow selective editing or combining driving information from multiple sources. For example, it is not possible to generate an image which has all other attributes from the driving face, except for an extra smile on the face. Another challenge is to obtain expression and pose representation that is independent of the driving face identity. Such disentanglement problem is difficult to solve in a fully unsupervised setup and therefore we often see that the identity specific information of the driving face is ”leaking” to the generated output. This may limit the relevant use cases to a few identities or to faces with comparable size and shape.

In this paper, we propose a generative adversarial network (GAN) based system that is able to reenact realistic emotions and head poses for a wide range of source and driving identites. Our approach allows further selective editing of the attributes (e.g. rotating the head, closing the eyelid etc.) to produce novel face movements which were not seen in the original driving face. The proposed method offers extensive human interpretable control for obtaining more versatile and high quality face animation than with the previous approaches. Figure 1 depicts a set of example results generated by manipulating a single source image with different mixtures of driving information.

The proposed face manipulation process consists of two stages: 1) extracting the facial attributes (emotion and pose) from the given driving image, and 2) transferring the obtained attributes to the source image for producing a photorealistic animation. We implement the first step by representing the emotions and facial movements in terms of Action Units (AUs) [8] and head pose angles (pitch, yaw and roll). The AU activations [8] aim at modelling the specific muscle activities and each combination of them can produce different facial expression [8, 23]. Our main motivation is that such attributes are relatively straightforward to extract from any facial image using publicly available software and this representation is fairly independent of the identity specific characteristics of the face. We also believe that these parameters are capable of capturing the necessary and meaningful facial movements. This hypothesis is further supported by our experiments.

We formulate the second stage of the face animation process using a conditional generative model on the given source image and facial attribute vector. In order to eliminate the current expression of the source face, we first map the input image to a neutral state representing frontal pose and zero AU values. Afterwards, the neutral image is mapped to the final output depicting the desired combination of driving attributes (e.g. obtained from driving faces or defined manually). As a result, we obtain a model called

Interpretable and Controllable face reenactment network (ICface). The program code of our method will be publicly available upon the acceptance of the paper.

We make the following three contributions. i) We propose a data driven and GAN based face animation system that is applicable to a large number of source and driving identities. ii) The proposes system is driven by human interpretable control signals obtainable from multiple sources such as external driving videos and manual controls. iii) We demonstrate our system in multiple tasks including face reenactment, facial emotion synthesis, and multi-view image generation from single-view input. The presented qualitative results outperform several recent (possibly purpose-built) state-of-the-art works.

Fig. 2: The overall architecture of the proposed model (ICface) for face animation. In the training phase, we select two frames from the same video and denote them as source and driving image. The generator takes the encoded source image and neutral facial attributes () as input and produces an image representing the source identity with central pose and neutral expression (neutral image). In the second phase, the generator takes the encoded neutral image and attributes extracted from the driving image () as an input and produces an image representing the source identity with desired attribute parameters

. The generators are trained using multiple loss functions implemented using the discriminator D (see Section 3 for details). In addition, since the driving and source images have the same identity, a direct pixel based reconstruction loss can also be utilized. Note that this is assumed to be true only during training and in the test case the identities are likely to be different.

Ii Related work

The proposed approach is mainly related to face manipulation methods using deep neural networks and adversarial generative networks. Therefore, we concentrate on reviewing the most relevant literature under this scope.

Ii-a Face Manipulation by Generative Networks

Deep neural networks are very popular tools for controlling head pose, facial expressions, eye gaze, etc. Many works [9, 31, 15, 29] approach the problem using supervised paradigm that requires plenty of annotated training samples. While such data is expensive to obtain, the recent literature proposes several unsupervised and self-supervised alternatives [34, 5, 6]. Specifically, Mechrez et.al. [17] propose a new contextual loss that compares two unpaired images and transfers attributes from one image to another. Their methods was shown to be suitable for single image animation, style transfer and face puppeteering.

In [24], the face editing was approached by decomposing the face into a texture image and a deformation field. After decomposition, the deformation field could be manipulated to obtain desired facial expression, pose, etc. However, this is a difficult task, partially because the field is highly dependent on the identity of the person. Therefore, it would be hard to transfer attributes from another face image.

Finally, X2Face [30] proposes a generalized facial reenactor that is able to control the source face using driving video, audio, or pose parameters. The transferred facial features were automatically learned from the training data and thus lack clear visual interpretation (e.g. close eyes or smile). The approach may also leak some identity specific information from the driving frames to the output. X2Face seems to work best if the driving and source images are from the same person.

Ii-B Face Manipulation with GANs

The conditional variant of the Generative Adversarial Network (GAN) [19, 11] have received plenty of attention in image to image domain translation with paired data [13], unpaired data [35], or even both [28]. Simillar GAN based approaches are widely used for facial attribute manipulation in many supervised and unsupervised settings [7, 4, 26, 22]. The most common approach is to condition the generator on discrete attributes such as blond or black hair, happy or angry, glasses or no glasses and so on. A recent work [23] proposed a method called GANimation that was capable of generating a wide range of continuous emotions and expressions for a given face image. They utilized the well known concept of action units (AUs) as a conditioning vector and obtained very appealing results. Similar results are achieved on portrait face images in [10, 1]. Unfortunately, unlike our method, these approaches are not suitable for full reenactment, where the head pose has to be modified. Moreover, they add the new attribute directly on to the existing expression of the source image which can be problematic to handle.

In addition to the facial expression manipulation, the GANs are applied to the full face reenactment task. For instance, the CycleGAN [33] could be utilised to transform expressions and pose between image pair (see examples in [30]). Similarly, the ReenactGAN [32] is able to perform face reenactment, but only for a limited number of source identities. In contrast, our GAN based approach generalizes to a large number of source identities and driving videos. Furthermore, our method is based on interpretable and continuous attributes that are extracted from the driving video. The flexible interface allows user to easily mix the attributes from multiple driving videos (e.g. head pose from one and expression from another) and edit them manually at will (e.g. open eyes or add smile). In the extreme case, the attributes could be defined without any driving video at all.

Iii Method

The goal of our method is to animate a given source face in accordance to the facial attribute vector that consists of the head pose parameters p and action unit (AU) activations a. More specifically, the head pose is determined by three angles (pitch, yaw, and roll) and the AUs represent the activations of 17 facial muscles [8]. In total, the attribute vector consists of 20 values determining the pose and the expression of a face. In the following, we will briefly outline the workflow of our method. The subsequent sections and Figure 2 provide further details of the architecture and training procedure. The specific implementation details are found in the supplementary material.

In the first stage, we encode the input image (size WxHx3) into equal size feature representation (WxHx3) using an encoder network . The obtained features are then concatenated with the neutral facial parameters , where refers to the central pose. This is done by first spatially replicating the attribute vector and then channel-wise concatenating the replicated attributes (WxHx20) with the feature representation. The resulting representation (WxHx23) is subsequently fed to the neutralisation network that aims at producing a frontal face image (WxHx3) depicting the source identity with neutral facial expression.

In the second stage, we encode the obtained neutral (source) face image using the same encoder network as for the input image. The obtained features are concatenated with the driving attribute vector that determines the desired output pose and AU values. The concatenation is done in similar fashion as in the first stage. In our experiments, we used OpenFace [2, 3] to extract the pose and AUs when necessary. The concatenated result is passed to the generator network that produces the final animated output (WxHx3) depicting the original source identity with the desired facial attributes .

Fig. 3: Qualitative results for the face reenactment on VoxCeleb [21] test set. The images illustrate the reenactment result for four different source identities. For each source, the results correspond to: ICface (first row), DAE [24] (second row), X2Face [30] (third row), and the driving frames (last row). The performance differences are best illustrated in cases with large differences in pose and face shape between source and driving frames.

Iii-a Architecture

The architecture of our model consists of four different sub-networks: image encoder, face neutraliser, face generator, and discriminator. In the following, we will briefly explain the structure of each component.

Image Encoder

It was found to be beneficial to convert the input face image into a feature representation before feeding it to the generator network. This representation is obtained using an encoder network

that maps the input image into an equal size feature tensor. The

has a hour-glass architecture consisting of convolutions and deconvolutions with normalization and activation layers. We note that is used in two places during the animation process and these networks share the same parameters.

Neutralizer

The neutralizer is a generator network that transforms the feature representation into a canonical face that depicts the same identity as the input and has central pose with neutral expression. The architecture of the

network consists of strided convolution, residual blocks and deconvolution layers. The overall structure is inspired by the generator architecture of CycleGAN

[33].

Generator

The generator network transforms the feature representation of the neutral face into the final reenacted output image. The output image is expected to depict the source identity with pose and expression defined by the driving attribute vector . The architecture of the network is similar to that of .

Discriminator

The discriminator network performs three tasks simultaneously: i) it evaluates the realism of the neutral and reenacted images through ; ii) it predicts the facial attributes ( and ) through

; iii) it classifies the identity of the generated face with respect to the training identities through fully connected layer with a softmax activation as

. The blocks and consist of convolution block with sigmoid activation. The overall architecture of is similar to the PatchGANs [33] consisting of strided convolution and activation layers. The same discriminator with identical weights is used for and .

(a) Expression Reenactment
(b) Pose Reenactment
(c) Mixed Reenactment
Fig. 4: Results for selective editing of facial attributes in face reenactment. (a-b) illustrate emotion and pose reenactment for various source images (extreme left column) and driving images (top row). (c) illustrates mixed reenactment by combining various attributes from source (extreme left) and two driving images (top row). The proposed method produces good quality results and provides control over the animation process, unlike other methods. More results are in the supplementary material.

Iii-B Training the Model

Following [30], we train our model using VoxCeleb [21] dataset that contains short clips extracted from interview videos. Furthermore, Nagrani et al. [20] provide face detections and face tracks for the VoxCeleb dataset, which we utilise in our work. As in [30], we extract two frames from the same face track and feed one of them to our model as a source image. Then, we extract the pose and AUs from the second frame and feed them into our model as driving attributes . Since both frames originate from the same face track and depict the same identity, the output of our model should be identical to the second frame and it can be treated as a pixel-wise ground truth in the training procedure. We will describe the actual loss function in the following paragraphs.

Adversarial Loss

The adversarial loss is a crucial component for obtaining the photorealistic output images. The generator maps the feature representation into domain of real images . Now if is a sample from the training set of real images, then the discriminator has to distinguish between and . The corresponding loss function can be expressed as:

(1)

Similar loss function can also be formulated for and and it would be represented as .

Facial attribute reconstruction loss

The generators and are aiming at producing photorealistic face images, but they need to do this in accordance with the facial attribute vectors and , respective. To this end, we extend the discriminator to regress the facial attributes from the generated images and compare them to the given target values. The corresponding loss function is expressed as:

(2)

where is the driving image with attributes .

Identity classification loss

The goal of our system is to generate an output image retaining the identity of the source person. To encourage this, the discriminator aims at determining the identity from the generated output. The corresponding cross entropy loss function is defined as:

(3)

Reconstruction loss

Due to the specific training procedure described above, we have access to the pixel-wise ground truth of the output. We take advantage of this by applying L1 loss between the output and the ground truth. Furthermore, we stabilize the training of the by using generated images from with neutral attributes as a pseudo ground truth. The corresponding loss function is defined as

(4)

The complete loss function:

The full objective function of the proposed model is obtained as a weighted combination of the individual loss functions defined above. The corresponding full loss function, with as regularization parameters, is expressed as

(5)
Fig. 5: Results for multi-view face generation from a single view. In each block, the first row corresponds to CR-GAN [26] and the second row corresponds ICface. It is to be noted that each block contains the same identity with different crop sizes as both methods are trained with different image crops. Proposed architecture produces semantically consistent facial rotations by preserving the identity and expressions better than the CR-GAN [26]. The last two rows correspond to multi-view images generated from ICface by varying pitch and roll respectively which is not possible in CR-GAN [26].

Iv Experiments

This section presents the experimental evaluation of the proposed ICface method. In all experiments, we use a single ICface model that is trained using the publicly available VoxCeleb video dataset [21]. The video frames are extracted using the preprocessing techniques presented in [20] and resized to for further processing. We used 75% of the data for training and the rest for validation and testing. Each component of is normalized to the interval . The neutral attribute vector contains central head pose parameters and zeros for the AUs. More architectural and training details are provided in the supplementary material.

Iv-a Face Reenactment

In face reenactment, the task is to generate a realistic face image depicting a given source person with the same pose and expression as in a given driving face image. The source identity is usually specified by one or more example face images (one in our case). Figure 3 illustrates several reenactment outputs using different source and driving images. We compare our results with two recent methods: X2Face [30] and DAE [24]. We further refer to the supplementary material for additional reenactment examples.

The results indicate that our model is able to retain the source identity relatively well. Moreover, we see that the facial expression of the driving face is reproduced with decent accuracy. Recall that our model transfers the pose and expression information via three angles and 17 AU activations. Our hypothesis was that these values are adequate at presenting the necessary facial attributes. This assumption is further supported by the result in Figure 3. Another important aspect in using pose angles and AUs, was the fact that they are independent of the identity. For this reason, the driving identity is not ”leaking” to the output face (see also the results of the other experiments). Moreover, our model neutralises the source image from its prior pose and expression which helps in reenacting new facial movements from driving images. We assess this further in Section IV-C.

Comparison to X2Face [30]: X2Face disentangles the identity (texture and shape of the face) and facial movements (expressions and head pose) using the Embedding and Driving networks, respectively. Both of these models are trained in an unsupervised manner, which make it difficult to prevent all movement and identity leakages through the respective networks. These type of common artefacts are visible in some of the examples in Figure 3. We further note that the X2Face results are produced using three source images whereas our model uses only a single source. Additionally, the adversarial training of our system seems to lead to more vivid and sharp results than X2Face.

Comparison to DAE [24]:

DAE proposed a special autoencoder architecture to disentangle the appearance and facial movements into texture image and deformation fields. We trained their model on VoxCeleb dataset

[21] using the publicly available codes from the original authors. For reenactment, we first decomposed both the source and driving images into corresponding appearances and deformations. Then we reconstructed the output face using the appearance of source image and the deformation of driving image. The obtained results are presented in Figure 3.

The DAE is clearly capable replicating the local expression of the driving face. However, it often fails to transfer the head poses and identity accurately. The head pose related artefact is best observed when the pose difference between the source and driving is large. These challenges might be related to the fact that the deformation field is not free from the identity specific characteristics.

Iv-B Controllable Face Reenactment

The pure face reenactment animates the source face by copying the pose and expression from the driving face. In practice, it might be difficult to obtain any single driving image that contain all the desired attributes. The challenge is further emphasised if one aims at generating an animated video. Moreover, even if one could record the proper driving frames, one may still wish to perform post-production editing on the result. This type of editing is hard to implement with previous methods like X2Face [30] and DAE [24] since the facial representation is learned implicitly and it lacks a clear interpretability. Instead, the head pose angles and AUs, utilised in our approach, provide human interpretable and easy-to-use interface for selective editing. Moreover, this presentation allows to mix attributes from different driving images in controlled way. Figures 1 and 8 illustrate multiple examples, where we have mixed the driving information from different sources. The supplementary material contains further example cases.

Fig. 6: The results for manipulating emotion in the face images. For each source image, the first row is generated using ICface, the second row using GANimation [23] and the third row contains the driving images. As ICface first neutralises the source image, it is evident that it produces better emotion reenactment when the source has initial expressions (first and third row).

Iv-C Facial expression Manipulation

In this experiment, we concentrate on assessing how the proposed model can be used to transfer the expression from the driving face to the source face, while keeping the head pose fixed. We compare our results to GANimation [23] that is a purpose-built method for manipulating the facial expression (i.e. it is not able to modify the pose). Similarly to us, they utilise the action units in defining the expression. Figure 6 illustrates example results for the proposed ICface and the GANimation. The latter method seems to have challenges when the source face has an intense prior expression that is different from the driving expression. In contrast, our model neutralised the source face before applying the driving attributes. This approach seems to lead in better performance in the cases where source has intense expression. Although GANimation is trained for more restricted task, our model produces comparable or better results.

Iv-D Multiview Face generation

Another interesting aspect in face manipulation is the ability to change the viewpoint of a given source face. Previous works have studied this as an independent problem. We compare the performance of our model in this task with respect to the recently proposed Complete-Representation GAN (CR-GAN) [26] method. The CR-GAN is a purpose-built method for producing multi-view face images from a single view input. The model is further restricted to consider only the yaw angle of the face. The results in Figure 5 were obtained using the CR-GAN implementation from the original authors [16]. Their implementation was trained on CelebA [16] dataset, and therefore we used CelebA [16] test to produce these examples. We note that we did not re-train or fine-tune our model on CelebA. The results indicate that our model is able to perform facial rotation with relatively small semantic distortions. Moreover, last two rows of Figure 5 depict rotation along pitch and roll axis which is not achievable with the CR-GAN. We believe that our two-stage based approach is well suited for this type of rotation tasks.

Fig. 7: The results for generating neutral face from a single source image. The proposed method produces good image quality even with extreme head poses (third row).

Iv-E Identity disentanglement from face attributes

Finally, in Figure 7, we demonstrate the performance of our neutraliser network. The neutraliser was trained to produce a template face from the single source image with frontal pose and no expression. We believe that the effective neutralisation of the input face is one of the key reasons why our system produces high quality results in multiple tasks. Figure 7 also illustrates the neutral images (or texture image) produced by the baseline methods. One of these is DR-GAN [27] that is a purpose-built face frontalisation method (i.e. it does not change the expression). The ICface successfully neutralises the face while keeping the identity intact even if the source has extreme pose and expression.

V Conclusion

In this paper, we proposed a generic face animator that is able to control the pose and expression of a given face image. The animation was controlled using human interpretable attributes consisting of head pose angles and action unit activations. The selected attributes enabled selective manual editing as well as mixing the control signal from several different sources (e.g. multiple driving frames). One of the key ideas in our approach was to transform the source face into a canonical presentation that acts as a template for the subsequent animation steps. Our model was demonstrated in numerous face animation tasks including face reenactment, selective expression manipulation, 3D face rotation, and face frontalisation. The results were compared with several recent methods, some of which were purpose-built for a particular single application. In the experiments, the proposed ICface model was able to produce high quality results for a variety of different source and driving identities. The future work includes further increasing the resolution of the output images and further improving the performance with extreme poses having a few training samples.

References

  • [1] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen. Bringing portraits to life. ACM Trans. Graph., 36(6):196:1–196:13, Nov. 2017.
  • [2] T. Baltrusaitis, A. Zadeh, Y. Lim, and L. Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2018), pages 59–66, Los Alamitos, CA, USA, may 2018. IEEE Computer Society.
  • [3] T. Baltrušaitis, M. Mahmoud, and P. Robinson. Cross-dataset learning and person-specific normalisation for automatic action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), volume 06, pages 1–6, May 2015.
  • [4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 2180–2188, USA, 2016. Curran Associates Inc.
  • [5] Y.-C. Chen, H. Lin, M. Shu, R. Li, X. Tao, X. Shen, Y. Ye, and J. Jia. Facelet-bank for fast portrait manipulation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3541–3549, 2018.
  • [6] J. S. Chung, A. Jamaludin, and A. Zisserman. You said that? In British Machine Vision Conference, 2017.
  • [7] H. Ding, K. Sricharan, and R. Chellappa. Exprgan: Facial expression editing with controllable expression intensity. AAAI, 2018.
  • [8] P. Ekman and W. Friesen. Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto, 1978.
  • [9] Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European Conference on Computer Vision, pages 311–326. Springer, 2016.
  • [10] J. Geng, T. Shao, Y. Zheng, Y. Weng, and K. Zhou. Warp-guided gans for single-photo facial animation. ACM Trans. Graph., 37(6):231:1–231:12, Dec. 2018.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [12] C. Igel and M. Hüsken. Improving the rprop learning algorithm, 2000.
  • [13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, 2017.
  • [14] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, N. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt. Deep Video Portraits. ACM Transactions on Graphics 2018 (TOG), 2018.
  • [15] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In Advances in neural information processing systems, pages 2539–2547, 2015.
  • [16] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [17] R. Mechrez, I. Talmi, and L. Zelnik-Manor. The contextual loss for image transformation with non-aligned data. In ECCV, 2018.
  • [18] L. Mescheder, S. Nowozin, and A. Geiger. Which training methods for gans do actually converge? In

    International Conference on Machine Learning (ICML)

    , 2018.
  • [19] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [20] A. Nagrani, S. Albanie, and A. Zisserman. Seeing voices and hearing faces: Cross-modal biometric matching. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [21] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: a large-scale speaker identification dataset. In INTERSPEECH, 2017.
  • [22] G. Perarnau, J. van de Weijer, B. Raducanu, and J. M. Álvarez. Invertible Conditional GANs for image editing. In NIPS Workshop on Adversarial Training, 2016.
  • [23] A. Pumarola, A. Agudo, A. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [24] Z. Shu, M. Sahasrabudhe, R. Alp Güler, D. Samaras, N. Paragios, and I. Kokkinos. Deforming autoencoders: Unsupervised disentangling of shape and appearance. In Computer Vision – ECCV 2018, pages 664–680. Springer International Publishing, 2018.
  • [25] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner. Face2Face: Real-time Face Capture and Reenactment of RGB Videos. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2016.
  • [26] Y. Tian, X. Peng, L. Zhao, S. Zhang, and D. N. Metaxas. Cr-gan: Learning complete representations for multi-view generation. In IJCAI, 2018.
  • [27] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for pose-invariant face recognition. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Honolulu, HI, July 2017.
  • [28] S. Tripathy, J. Kannala, and E. Rahtu. Learning image-to-image translation using paired and unpaired training samples. In ACCV, 2018.
  • [29] P. Upchurch, J. R. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Q. Weinberger. Deep feature interpolation for image content changes. In CVPR, pages 6090–6099, 2017.
  • [30] O. Wiles, A. Koepke, and A. Zisserman. X2face: A network for controlling face generation by using images, audio, and pose codes. In European Conference on Computer Vision, 2018.
  • [31] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Interpretable transformations with encoder-decoder networks. In The IEEE International Conference on Computer Vision (ICCV), volume 4, 2017.
  • [32] W. Wu, Y. Zhang, C. Li, C. Qian, and C. C. Loy. Reenactgan: Learning to reenact faces via boundary transfer. In ECCV, 2018.
  • [33] R. Xu, Z. Zhou, W. Zhang, and Y. Yu. Face transfer with generative adversarial network. arXiv preprint arXiv:1710.06090, 2017.
  • [34] R. A. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala. Semantic facial expression editing using autoencoded flow. 2016.
  • [35] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.

Vi Supplementary Material

In this material, we provide the training details and the qualitative results of ICface (proposed model) for the controllable face reenactment as discussed in Section 3 and Section 4 respectively. The program code will be publicly available upon the acceptance of the paper which will be helpful to replicate the results.

It is highly recommended to watch the supplementary video for additional results.

Training Details:

Our model is trained to optimize the loss function in eq. (5) with regularization parameters’ values as 200, 10 and 40 as in Section 3. We have used the Dirac GAN [18]

variant for the adversarial training with regularization parameter of 0.0005 for gradient penalty on real data. The RMSprop optimizer

[12]

is used with an initial learning rate of 0.0001 for all our experiment. Finally, we trained our network for a total of 15 epochs with the batch size of 16 and then evaluated on the test set.



Controllable Face Reenactment:

In face reenactment, it is impossible to generate expressions and facial movements that are not present in the driving images. ICface can generate a diverse set of expressions and poses irrespective of the nature of the driving images. Hence, It is possible to selectively combine the poses and expressions from the multiple frames to animate the source image. This kind of controllable editing in the facial animation is not possible by other contemporary approaches. Figure 8 consist of qualitative results for controllable editing in the facial reenactment by the proposed model.

Vii Architecture

Our architecture is based on CycleGAN [35] and details are as follows:

Architectures of the Generators:

c7s1-128,d256,d512,R512,R512,R512,R512,R512,
R512,u256,u128,c7s1-3.

Where, c7s1-k stands for

Convolution-BatchNorm-Relu with

filters and stride of 1. dk denotes Convolution-BatchNorm-Relu with stride 2 and Rk signifies residual blocks with Convolutions where stands for number of filters in both cases. The uk stands for fractional-strided-Convolution-BatchNorm-Relu with filters and stride of

. Finally we have used reflection padding as suggested in

[35].

Architectures of the Discriminator:

c256-c256-c512-c512-c1024-c1024-c2048-c2048-c128-[c1 or Fc939 or Fc20]

Where ck denotes the Convolution-ReLU layer with filters and Fck stands for fully connected layer with

output neurons. The c128 layer acts as a feature extractor which is then fed to c1 to be used for adversarial loss, Fc20 layer for facial attribute reconstruction loss and Fc939 layer for identity classification loss. Note that we have not used ReLU layer between two consecutive layers with same

values.

Architectures of the Image Encoder:

c64-c64-c128-c3

Where, ck stands for Convolution-BathchNorm-ReLU layer with filters, stride 1 and padding 1. Note that we have not used BatchNorm-ReLU layer in the first and last convolution layers.

(a) Expression Reenactment
(b) Pose Reenactment
(c) Mixed Reenactment
Fig. 8: Results for selective editing of facial attributes in face reenactment. (a-b) illustrate emotion and pose reenactment for various source images (extreme left column) and driving images (top row). (c) illustrates mixed reenactment by combining various attributes from source (extreme left) and two driving images (top row). The proposed method produces good quality results and provides control over the animation process, unlike other methods.