All-In-One: Facial Expression Transfer, Editing and Recognition Using A Single Network

11/16/2019 ∙ by Kamran Ali, et al. ∙ University of Central Florida 0

In this paper, we present a unified architecture known as Transfer-Editing and Recognition Generative Adversarial Network (TER-GAN) which can be used: 1. to transfer facial expressions from one identity to another identity, known as Facial Expression Transfer (FET), 2. to transform the expression of a given image to a target expression, while preserving the identity of the image, known as Facial Expression Editing (FEE), and 3. to recognize the facial expression of a face image, known as Facial Expression Recognition (FER). In TER-GAN, we combine the capabilities of generative models to generate synthetic images, while learning important information about the input images during the reconstruction process. More specifically, two encoders are used in TER-GAN to encode identity and expression information from two input images, and a synthetic expression image is generated by the decoder part of TER-GAN. To improve the feature disentanglement and extraction process, we also introduce a novel expression consistency loss and an identity consistency loss which exploit extra expression and identity information from generated images. Experimental results show that the proposed method can be used for efficient facial expression transfer, facial expression editing and facial expression recognition. In order to evaluate the proposed technique and to compare our results with state-of-the-art methods, we have used the Oulu-CASIA dataset for our experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial Expression synthesis and manipulation is a challenging task because it requires the disentanglement of facial expression features from identity information. It has recently gained a great deal of attention from the computer vision research community due to the exciting research challenges it offers apart from its many applications, e.g facial animation, human-computer interactions, entertainment and facial reenactment  

[32].

Figure 1: TER-GAN takes source image and target image as input and extracts expression information and identity information from each image respectively. These encoded information is then used to generate an expression image which contains the expression of source image while the identity of target image is preserved.

Many techniques have been developed for facial expression manipulation and editing. These techniques can be divided into two categories: graphic-based techniques  [38],  [21],  [40] and generative methods  [31],  [26],  [3][6],  [28],  [22],  [17]. In the first category, image warping is used to synthesize expression images by modeling the variations of a face during facial expressions. In the second category, deep generative models are used to generate synthesized expression images. In  [6] an expression controller module is trained using a GAN-based architecture to generate expression images of various intensities. Similarly in  [4], a unified GAN framework is used to transfer expressions from one image domain to another. Kamran et al.  [1]

concatenated a one-hot identity vector with the expression information extracted from the encoder to generate an expression image. Pumarola et al. 

[23] and Shao et al.  [28] exploited Facial Action Units (AU), and the expression synthesis process is guided by the learned AU features. Similarly, in  [30] and  [25], facial landmarks are used to produce synthesized expression images.

Existing facial expression synthesis techniques have the capability to transform the expression of a given image; however, there are two main problems with these methods: 1. they require auxiliary information in the form of an expression code, facial landmarks and action unit information to synthesize an expression image and 2. many of these techniques fail to preserve the identity information of the given image, which is due to the fact that they fail to disentangle expression features from identity representation. Hence, during facial expression transfer process the identity information of the source image is usually leaked through the expression feature vector, which degrades the identity of generated images [36]. Similarly, in [30] and [25], it is very difficult to synthesize an expression image using the landmark information of source image with a different facial shape than the target face. To reduce the identity information leakage, in [1] and [6], the expression synthesis process is conditioned on expression and identity codes, rather than using the extracted features. But it is a well known fact that the identity and expression representations are too rich to be represented by one-hot vectors.

In order to overcome the above problems, we propose a Transfer-Editing and Recognition Generative Adversarial Network (TER-GAN) to automatically and explicitly extract a disentangled expression representation from a source image and disentangled identity features from a target image in order to synthesize a photo-realistic expression image without requiring any auxiliary information such as expression or identity code, facial landmarks or action units, while preserving the identity of the target image. The overall framework of our proposed technique is shown in Figure 1. TER-GAN has two main objectives: 1. to automatically extract disentangled expression features and identity features from a source image and a target image respectively, and 2. to synthesize a photo-realistic expression image containing the expression of the source image while preserving the identity of the target image.

To achieve these objectives, we employ a Generative Adversarial Network (GAN) with an encoder-decoder based generator . As opposed to previous generative model based expression synthesis and manipulation architectures  [6],  [17],  [35], we, instead of using just one encoder, employ two encoders and in our generator . TER-GAN takes two images as input, source image and target image . Encoder is aimed to encode expression representation from source image and encoder is used to extract identity features from target image . The expression representation is then concatenated with the identity feature : , and the concatenated feature vector is then fed to the decoder to synthesize an expression image , which contains the expression of source image while preserving the identity of target image

. In order to further improve the quality of extracted expression and identity features, we make use of synthetic expression images along with real images, and introduce two adversarial losses at the output of each encoder: an adversarial expression consistency loss and an adversarial identity consistency loss. Our experimental results show that these two consistency losses help in extracting effective expression and identity features through which we can generate synthetic images that preserve the identity of the target image. Moreover, to generate more realistic synthesized expression images we use a multi-class classifier as our discriminator,

. The main contributions of our paper are as follows:

  • We present a novel unified architecture, Transfer-Editing and Recognition Generative Adversarial Network (TER-GAN) that can be used efficiently for three purposes: facial expression transfer, facial expression editing and facial expression recognition, without requiring any explicit expression or identity code or any other auxiliary information such as facial landmarks or action units to guide the synthesis process, while preserving the identity information.

  • Apart from the encoder-decoder architecture of TER-GAN, our adversarial expression and identity consistency losses also ensure that the expression and identity features are disentangled, and this disentanglement of features helps in synthesizing expression images that preserve the identity information of the target image.

  • In order to deal with small expression datasets, TER-GAN learns to extract expression and identity representations using the information contained in synthesized images as well.

  • We show that the disentangled expression embedding learned by TER-GAN can be effectively used for facial expression recognition.

Figure 2: The over-all architecture of our TER-GAN (best viewed in color)

2 Related Work

In this section we first review previous facial expression synthesis and manipulation techniques, followed by discussing conventional feature disentanglement techniques.

2.1 Facial Expression Manipulation

There are many facial expression manipulation techniques proposed in the literature, some of which combine computer graphics methods such as 2D or 3D image warping  [9], flow mapping  [38] and image rendering  [37] with computer vision algorithms to generate synthesized expression images. Although these techniques are able to produce photo-realistic high resolution synthesized images, the main problem with these approaches is that they suffer from high computation cost. In contrast to conventional facial expression synthesis techniques, recently proposed methods are mostly based on conditional Generative Adversarial Networks (cGANs).

Some earlier techniques such as [16], [14], and [42] used deterministic target expressions as one-hot vectors and generated synthesized images conditioned on discrete facial expressions. While the generated images are of better quality, these techniques are only able to generate discrete expressions. To overcome this problem, Ding et al. [6] proposed an Expressive GAN (ExprGAN) to synthesize an expression image conditioned on a real-valued vector that contains more complex information such as intensity variation. Similarly in [30] and [24], the image synthesis process is conditioned on geometry information in the form of facial landmarks. Choi et al. [5] proposed the StarGAN method to employ domain information and generate an image into a corresponding domain. In [5], Royer et al. presents XGAN to translate attributes from one domain to another by using an adversarial classifier on top of encoded layers, but their architecture is too simple to generate photo-realistic expression images. In another work, Pumarola et al. [42] used AUs as a conditional label to synthesize an expression image. All of these proposed techniques require explicit expression, AU and landmark information to guide the expression synthesis process.

2.2 Feature Disentanglement

Many techniques have been developed in the past to disentangle an image into its different representations. These techniques are based on the idea of learning by reconstruction, and therefore, often involve encoder-decoder structure coupled with GANs. For instance Tran et al. [34]

proposed a disentangled representation learning GAN for pose-invariant face recognition in which the face identity representation is disentangled from the pose information by explicitly providing the pose information to the generator. Similarly, Lee et al.

[15] generated photo-realistic images from various domains by disentangling the factors of variations in an input image. Shu et al. [29] proposed an unsupervised generative model to disentangle shape from appearance information. All of these methods learn the disentangled representation by explicitly providing information about other (uninterested) factors of variation that constitute an image. In contrast, our method automatically extracts the factors of variations from input images using two augmenting feature learning techniques, i.e learning by reconstruction employing an encoder-decoder based GAN set-up and by applying adversarial expression consistency loss and adversarial identity consistency loss at each encoder.

3 Overview of TER-GAN

The main objective of TER-GAN is twofold: to extract descriptive, discriminative and disentangled expression and identity representations from input images, and secondly to generate synthesized expression images using the extracted expression and identity information in such away that the expression and identity information of input images are preserved. To do this, as opposed to previous facial expression synthesis architectures, where expression or identity information is explicitly provided in the form of expression or identity codes, our TER-GAN uses two encoders to automatically extract expression and identity information. It has been reported in the literature that face images lie on a manifold  [35], therefore, we argue that representing a face image with an identity code in the form of a one-hot vector is not enough to capture fine details of a target face. Similarly, in order to represent various intensities of a facial expression, using just an expression code is not sufficient to generate wide range of expression intensities. In order to learn efficient expression and identity representation, we, in addition to real images, use the synthetic expression images generated by the decoder of our generator. Thus, in this manner joint expression-invariant identity embedding and identity-invariant expression embedding are learned using two additional adversarial losses on top of representation layers at each encoder, apart from the adversarial loss imposed on the overall generator.

3.1 Network Architecture

The input to TER-GAN is a source image and a target image . These two images are fed to two different encoders and , where aims to map source image to an expression representation , while is used to project the target image to an identity embedding . The concatenation of the two embeddings: , bridges the two encoders with a decoder . The objective of decoder is to synthesize an expression image having the expression of the source image and the identity of the target image: . To further improve the quality of synthesized images, an adversarial loss is imposed on generator by using a multi-class CNN as our discriminator . In order to generate synthetic images with desired expressions and identities, our discriminator performs identity and expression classification, apart from classifying between real and fake images. The overall architecture of the proposed TER-GAN is shown in Figure 2. Two additional discriminators and are used with and , respectively, to learn an identity-invariant expression embedding at the FC layer of encoder and an expression-invariant identity embedding at the FC layer of encoder . The adversarial learning scheme of identity-invariant expression embedding and expression-invariant identity embedding is shown with a brown-colored dashed line in in Figure 2.

3.1.1 Discriminator

The discriminator of TER-GAN is a multi-task CNN that aims for three objectives: 1. to classify between real and fake images, 2. to classify facial expressions, and 3. to classify the identities of expression images. To achieve these objectives, discriminator is divided into two parts: , where corresponds to the part of that is used for the classification of expressions i.e denotes the number of expressions, in our case it represents six basic expressions, and an additional dimension is used to differentiate between real and fake images. Similarly, is the part of that is used to classify the identities of expression images, where denotes the number of identities. The overall objective function of our discriminator is given by the following equation:

The first part of equation LABEL:eq:1 represents the objective of

to maximize the probability of source image

and target image to be classified to its true expression label and true identity label , respectively. While the second part of the function corresponds to the objective of to maximize the probability of classifying as a fake image.

3.1.2 Generator

In previous facial expression synthesis and manipulation methods  [6][33][2] one encoder is used to extract feature information from an input image, while a conditional code is explicitly fed to the network to guide the facial expression synthesis process. However, in TER-GAN, the main objective of is to efficiently extract expression and identity representation from source image and target image , respectively, and to generate an image to fool to classify it to the expression of and the identity of . Therefore, the generator in TER-GAN consists of two encoders and a decoder: . The objective function of is given by the following equation:

(2)

The adversarial loss is given as below:

(3)

In order to improve the capability of both of our encoders to extract identity-invariant expression features and expression-invariant identity features, we introduce two additional adversarial losses on top of the representation layer of each encoder: adversarial expression consistency loss at encoder and adversarial identity consistency loss at .

Encoder : The main objective of encoder is to extract expression representation from input source image . To achieve this goal, apart from employing learning by reconstruction phenomena, we propose another adversarial expression consistency loss at encoder , which does not require any paired data and helps in learning an identity-invariant expression representation in a self-supervised manner. Specifically, since the input source image and the generated image share the same expression information but have different identities, we leverage these two images to learn an identity-invariant expression embedding. To do this, a discriminator is trained on top of expression embedding , to classify the encoded features to be extracted from or . To learn an identity-invariant expression embedding , discriminator strives to maximize its classification accuracy, while encoder is trained to confuse discriminator by minimizing its accuracy. The optimization function is given by the equation below:

(4)

Where denotes a cross-entropy loss.

Encoder : The target image is fed to encoder , which extracts identity representation for image synthesis. The target image can have any expression or it can be a neutral image, since it is only used for getting identity information. Therefore, to extract an expression-invariant identity representation, we employ an adversarial identity consistency loss on top of identity representation layer at encoder . The synthesized image , which has the same identity as is fed to encoder along with the input target image to learn the expression-invariant identity embedding. This goal is achieved by using a discriminator on top of identity embedding , which is trained to recognize the encoded identity representation as coming from or . The discriminator strives to maximize its classification accuracy while the encoder is aimed to confuse discriminator by minimizing its accuracy. The optimization function is given by the equation below:

(5)

Where denotes a cross-entropy loss.

Decoder : The input to decoder is a concatenation of and : , through which will generate a synthesized image having the expression encoded in and the identity information represented by . For this purpose, two pixel-wise reconstruction losses are used: 1. an identity reconstruction loss between input target image and output image and 2. an expression reconstruction loss between input source image and output image .

(6)
(7)

However, it is known that pixel-level metrics are not very optimal for the purpose of image comparison, especially when dealing with semantics level comparison  

[27],  [6]. Therefore, to further preserve the expression information between source image and synthesized image , a pre-trained version of encoder is used to enforce expression similarity in feature space:

(8)

where represents the layer feature maps extracted from the pre-trained version of and denotes the layer’s weight. The activations at all five convolutional layers of the network are used.

Similarly, to further preserve the identity information between target image and synthesized image , the pre-trained version of encoder is used to extract identity features from various inter-mediate layers. The feature-level identity preserving loss is given by the following equation:

(9)

3.1.3 Total Loss

The overall objective function of TER-GAN is the weighted sum of all the losses discussed in the previous sections:

(10)

3.1.4 Training the Network

Since the efficiency of learning an identity-invariant expression embedding and an expression-invariant identity embedding depends on the quality of the generated images as well, TER-GAN, having multiple loss functions, is trained employing a curriculum strategy  

[6],  [36] with two training stages. In the first stage of training, the encoder is pre-trained in a supervised manner to classify facial expressions. Similarly, the encoder is pre-trained in a fully supervised way to recognize identities. The classification layers of these two networks are then discarded and the rest of the remaining networks are attached to the over-all architecture of TER-GAN. In the second stage of training, the entire TER-GAN architecture is trained in two steps of updating the network parameters. In the first step, the expression and identity features are extracted from the source image and the target image by the two encoders and , and these two features are concatenated into and fed to decoder to synthesize an expression image . In the second step, the generated output image is fed to two encoders along with their corresponding input images to further improve the quality of and , and to learn in an adversarial manner an identity-invariant expression representation and expression-invariant identity embedding .

Figure 3: TER-GAN encodes expression information from source image, and identity information from target image, and generates an output expression image having the expression of source image while preserving the identity of the target image.
Figure 4: Comparison (visual) of facial expression editing techniques. For each input (target image), the corresponding synthetic expression images are generated. We compare our results with CAAE and Expr-GAN.

The objective of training TER-GAN is to minimize equation 10. Specifically, the adversarial losses , and lead to a min-max optimization problem, resulting in our employing a gradient reversal layer  [8],  [27] into TER-GAN architecture. The gradient reversal layer is implemented between and expression embedding in order to perform adversarial training scheme for . The gradient reversal layer does not affect the forward pass during training, but it is used to invert the gradient sign during back-propagation to practically implement the min-max training scheme. The gradient reversal layer is also used between and identity embedding in order to perform adversarial training scheme for .

4 Experiments

In this section we first describe the implementation details, followed by presenting the experimental results of TER-GAN and then we demonstrate the applications of TER-GAN and its the ability to perform facial expression transfer and facial expression editing and to support facial expression recognition.

4.1 Implementation Details

The proposed technique is evaluated on the widely used Oulu-CASIA  [44] dataset. Initially Convolutional Experts Constrained Local Model (CE-CLM) is used to detect facial landmarks to perform face detection and face alignment. To avoid overfitting due to using small dataset, data augmentation is performed to increase the number of images in the training dataset. From each image, five samples are extracted from the center and four corner locations. Image rotation is then applied on each of those cropped samples using four angles: , , , . Horizontal flipping is then applied on each rotated image, and thus, the size of the dataset is increased 5 times the original dataset size after data augmentation.

Method Setting Accuracy
LBP-TOP[45] Dynamic 68.13
HOG 3D[13] Dynamic 70.63
STM-Explet[18] Dynamic 74.59
Atlases[10] Dynamic 75.52
DTAGN[11] Dynamic 81.46
FN2EN[7] Static 87.71
PPDN[46] Static 84.59
DeRL[39] Static 88.0
CNN(baseline) Static 73.14
TER-GAN(Ours) Static 89.65
Table 1: Oulu-CASIA: Accuracy for six expressions classification.

TER-GAN is initially pre-trained on the BU-4DFE  [41] dataset, which consists of 60,600 images from 101 identities. Six image sequences are captured for each identity, and these six sequences correspond to six basic expressions. Each of these sequences are arranged in such a way that it starts from a neutral expression, reaches the peak expression in the middle, and then again ends at a neutral expression. The middle peak expression images are extracted to construct the dataset.

Figure 5: Expression feature space. Each color represents a different expression. Fig. 5a shows the expression distribution of features obtained from pre-trained encoder (CNN baseline). Fig 5b. depicts the expression distribution of features obtained from encoder after training in a TER-GAN set-up. (Best viewed in color)

The pre-trained TER-GAN is then fine-tuned on the Oulu-CASIA (OC) dataset. OC dataset contains 480 video sequences captured under three different illumination conditions using two different cameras. In this experiment, only images captured under strong condition with VIS camera are used. There are 80 identities in the OC dataset, and each identity has six video sequences, corresponding to six basic expressions. Each video sequence starts with a neutral image and ends at the peak expression image. In this experiment the last three frames of each sequence are selected to construct the dataset. An identity-independent training-testing split is formed to evaluate the proposed method.

The architecture of both encoders, and , is designed based on five downsampling blocks consisting of a stride 1 convolution. The number of channels are 64, 128, 256, 512, 1024 and one 30-dimensional FC layer for expression feature vector , and a 50-dimensional identity representation , constitute and , respectively. The decoder is built on five upsampling blocks containing a stride 1 convolution. The number of channels are 512, 256, 128, 64 and 3. As opposed to previous GAN architectures with a multi-task CNN based discriminator [33], [1], in TER-GAN, the discriminator is designed in such a way that the initial downsampling convolutional layers and a FC layer are shared between and in order to reduce the computation cost. More specifically, four CNN blocks with 16, 32, 64, 128 channels and a 1024-dimensional FC layer are shared between the two parts. It is then divided into two branches, where, each branch has two additional FC layers with 512 and 256 channels. then has an expression classification layer and has an identity classification layer. The architectures of and are the same, which consists of three FC layers with channels 32, 16 and 1.

TER-GAN is trained using the Adam optimizer [12], with a batch size of 64 and learning rate of 0.0002. The values of parameters are empirically set to , , , , and . Similarly, the weights of the total loss are set empirically as , , , , , , and .

4.2 Facial Expression transfer

In this section, we demonstrate our model’s ability to transfer facial expressions from source image to target image . As opposed to previous methods [6], where, a separate expression classifier is used to extract expression label in order to transfer facial expression from one image to another image, in TER-GAN, the facial expression transfer task is performed in an end-to-end manner. To do this, is fed to encoder and is input to encoder to extract expression information from and identity features from respectively. The expression information is then concatenated with the identity features, and the concatenated feature vector is fed to decoder to synthesize an expression image having the expression of and containing the identity of . Figure 3 shows that TER-GAN transfers the facial expressions from source images quite accurately, while also preserving the target identities specified by target images.

4.3 Facial Expression Editing

In this section we demonstrate the capability of our proposed TER-GAN to edit the expression of a given image. Different from previous facial expression editing methods, like in [6], where the expression code is explicitly fed to the network, in TER-GAN, the expression information is extracted from another expression image (source image) to encode more valuable expression information than just an expression label. In this experiment, the identity feature is extracted from the given image (target image, ) by inputting it to encoder , while the expression information is automatically extracted from another image of the same identity but with a different expression (source image ). and are then concatenated and the concatenated feature vector is then fed to the decoder to generate a synthetic image , which has the expression information taken from , while preserving the identity information. The experimental results are shown in Figure 4, where the first column corresponds to the input image (target image), while the ground truth images in top row in the right column represent the source images, , which are used to extract the corresponding expression information. Although, our main objective is different than the previous methods [6], [43], and the network architecture of TER-GAN has more functionalities, and is more complex than the models proposed in [6] and [43], we, for the sake of comparison, compare our results with [6] and [43]. As it can be seen in Figure 4, our TER-GAN can not only synthesize an image of the desired expression, but the identity information is preserved more in our case than in [6] and [43]. To reduce the computational cost and model complexity, we have not used any variation regularization [20] method as used in [6] to reduce spike artifacts on the reconstructed images, which, we believe, if incorporated in TER-GAN, will enhance the visual quality of our synthesized images as well.

4.4 Facial Expression Recognition

In this section we demonstrate the ability of TER-GAN to efficiently disentangle the expression information of any expression image from its identity information. One of the major issues with conventional FER techniques is that the representation used for facial expression recognition contains identity information as well as the expression information and, as a result, the performance of FER degrades on unseen identities during real-time applications. Therefore, in order to obtain identity free expression information, the encoder of TER-GAN is detached from the rest of the architecture after training and is used to perform facial expression recognition. Specifically, an expression image is fed to the detached encoder and the output expression representation is extracted. This feature vector is then fed to a shallow classifier for facial expression recognition.

To evaluate the performance of the proposed disentangled facial expression recognition technique, we conducted an eight fold cross validation on the Oulu-CASIA dataset. Table 1 shows the average accuracy obtained using the proposed technique. The reported results show that TER-GAN outperforms state-of-the-art techniques including GAN-based methods like DeRL  [39] and deep CNN-based techniques like DTAGN-Joint  [11], FN2EN  [7], and PPDN  [46]. Although, we are using only images to extract expression information for FER, our method out-performs techniques like DTAGN[11], Atlases[10], STM-Explet[18], HOG 3D[13] and LBP-TOP[45] that exploit temporal information of video sequences.

4.5 Expression Feature Visualization

In this part, we demonstrate that the expression representation learned by our proposed TER-GAN is disentangled from identity information. To do this, we first extract the expression feature vector from encoder , and then employ t-SNE  [19] to project the 30-dim feature vector on a two dimensional space for visualization purpose. The 2d expression feature space is shown in Figure 5. For the sake of comparison with our CNN baseline, we conduct two experiments to show that the expression representation learned by encoder , when trained in a TER-GAN set-up, is disentangled from identity information. In the first experiment, we extract 30-dim expression features from our CNN baseline network (pre-trained encoder ), and visualize this in a 2d feature space using t-SNE. The result of our first experiment is shown in Figure 5(a). It can be seen that the expression features are all entangled with each other in the expression feature space, which clearly indicates that the CNN baseline model fails to disentangle expression information from identity features. In the second experiment, we employ encoder trained in an end-to-end manner in a TER-GAN set-up, to extract expression representation , and project it to 2d feature space using t-SNE. The result of the second experiment is shown in Figure 5(b). We can see that the expression features are organised in the form of six clusters, corresponding to six basic expressions, which indicates that the expression information is effectively disentangled from identity information.

5 Conclusion

In this paper we have proposed a unified expression transfer, editing and recognition architecture, TER-GAN, which has two objectives: 1). to extract efficient and disentangled expression and identity features from input images, and 2). to employ the extracted expression and identity representations for realistic looking expression synthesis that preserves the identity information of the target (given) image. This goal is achieved by explicitly encoding the expression information from a source image and extracting identity information from a target image by using two different dedicated encoders, and these two feature vectors are than combined to generate an expression image by employing the decoder part of TER-GAN. In order to further improve the expression and identity feature extraction process, we have introduced novel expression and identity consistency losses. Experimental results show that the proposed method can be used for efficient facial expression transfer and facial expression editing, and the disentangled feature representation can be used for facial expression recognition.

References

  • [1] K. Ali and C. E. Hughes (2019) Facial expression recognition using disentangled adversarial learning. arXiv preprint arXiv:1909.13135. Cited by: §1, §1, §4.1.
  • [2] K. Ali, I. Isler, and C. Hughes (2019) Facial expression recognition using human to animated-character expression translation. arXiv preprint arXiv:1910.05595. Cited by: §3.1.2.
  • [3] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen (2014) Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583. Cited by: §1.
  • [4] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8789–8797. Cited by: §1.
  • [5] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018) Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8789–8797. Cited by: §2.1.
  • [6] H. Ding, K. Sricharan, and R. Chellappa (2018) Exprgan: facial expression editing with controllable expression intensity. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §1, §1, §2.1, §3.1.2, §3.1.2, §3.1.4, §4.2, §4.3.
  • [7] H. Ding, S. K. Zhou, and R. Chellappa (2017) Facenet2expnet: regularizing a deep face recognition net for expression recognition. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 118–126. Cited by: §4.4, Table 1.
  • [8] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016)

    Domain-adversarial training of neural networks

    .

    The Journal of Machine Learning Research

    17 (1), pp. 2096–2030.
    Cited by: §3.1.4.
  • [9] P. Garrido, L. Valgaerts, O. Rehmsen, T. Thormahlen, P. Perez, and C. Theobalt (2014) Automatic face reenactment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4217–4224. Cited by: §2.1.
  • [10] Y. Guo, G. Zhao, and M. Pietikäinen (2012) Dynamic facial expression recognition using longitudinal facial expression atlases. In European Conference on Computer Vision, pp. 631–644. Cited by: §4.4, Table 1.
  • [11] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim (2015) Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE international conference on computer vision, pp. 2983–2991. Cited by: §4.4, Table 1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [13] A. Klaser, M. Marszałek, and C. Schmid (2008) A spatio-temporal descriptor based on 3d-gradients. Cited by: §4.4, Table 1.
  • [14] Y. Lai and S. Lai (2018) Emotion-preserving representation learning via generative adversarial network for multi-view facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 263–270. Cited by: §2.1.
  • [15] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §2.2.
  • [16] M. Li, W. Zuo, and D. Zhang (2016) Deep identity-aware transfer of facial attributes. arXiv preprint arXiv:1610.05586. Cited by: §2.1.
  • [17] A. Lindt, P. Barros, H. Siqueira, and S. Wermter (2019) Facial expression editing with continuous emotion labels. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8. Cited by: §1, §1.
  • [18] M. Liu, S. Shan, R. Wang, and X. Chen (2014) Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1749–1756. Cited by: §4.4, Table 1.
  • [19] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §4.5.
  • [20] A. Mahendran and A. Vedaldi (2015) Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196. Cited by: §4.3.
  • [21] U. Mohammed, S. J. Prince, and J. Kautz (2009) Visio-lization: generating novel facial images. ACM Transactions on Graphics (TOG) 28 (3), pp. 57. Cited by: §1.
  • [22] N. Otberdout, M. Daoudi, A. Kacem, L. Ballihi, and S. Berretti (2019) Dynamic facial expression generation on hilbert hypersphere with conditional wasserstein generative adversarial nets. arXiv preprint arXiv:1907.10087. Cited by: §1.
  • [23] A. Pumarola, A. Agudo, A. M. Martinez, A. Sanfeliu, and F. Moreno-Noguer (2018) Ganimation: anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 818–833. Cited by: §1.
  • [24] F. Qiao, N. Yao, Z. Jiao, Z. Li, H. Chen, and H. Wang (1802) Geometry-contrastive generative adversarial network for facial expression synthesis. corr abs/1802.01822 (2018). Cited by: §2.1.
  • [25] F. Qiao, N. Yao, Z. Jiao, Z. Li, H. Chen, and H. Wang (2018) Emotional facial expression transfer from a single image via generative adversarial nets. Computer Animation and Virtual Worlds 29 (3-4), pp. e1819. Cited by: §1, §1.
  • [26] S. Reed, K. Sohn, Y. Zhang, and H. Lee (2014) Learning to disentangle factors of variation with manifold interaction. In International Conference on Machine Learning, pp. 1431–1439. Cited by: §1.
  • [27] A. Royer, K. Bousmalis, S. Gouws, F. Bertsch, I. Mosseri, F. Cole, and K. Murphy (2017) Xgan: unsupervised image-to-image translation for many-to-many mappings. arXiv preprint arXiv:1711.05139. Cited by: §3.1.2, §3.1.4.
  • [28] Z. Shao, H. Zhu, J. Tang, X. Lu, and L. Ma (2019) Explicit facial expression transfer via fine-grained semantic representations. arXiv preprint arXiv:1909.02967. Cited by: §1.
  • [29] Z. Shu, M. Sahasrabudhe, R. Alp Guler, D. Samaras, N. Paragios, and I. Kokkinos (2018)

    Deforming autoencoders: unsupervised disentangling of shape and appearance

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 650–665. Cited by: §2.2.
  • [30] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan (2018) Geometry guided adversarial facial expression synthesis. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 627–635. Cited by: §1, §1, §2.1.
  • [31] J. M. Susskind, G. E. Hinton, J. R. Movellan, and A. K. Anderson (2008) Generating facial expressions with deep belief nets. In Affective computing, Cited by: §1.
  • [32] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2395. Cited by: §1.
  • [33] L. Tran, X. Yin, and X. Liu (2017) Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1415–1424. Cited by: §3.1.2, §4.1.
  • [34] L. Tran, X. Yin, and X. Liu (2017) Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1415–1424. Cited by: §2.2.
  • [35] X. Wang, W. Li, G. Mu, D. Huang, and Y. Wang (2018) Facial expression synthesis by u-net conditional generative adversarial networks. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 283–290. Cited by: §1, §3.
  • [36] O. Wiles, A. Sophia Koepke, and A. Zisserman (2018) X2face: a network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 670–686. Cited by: §1, §3.1.4.
  • [37] F. Yang, L. Bourdev, E. Shechtman, J. Wang, and D. Metaxas (2012) Facial expression editing in video using a temporally-smooth factorization. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 861–868. Cited by: §2.1.
  • [38] F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas (2011) Expression flow for 3d-aware face component transfer. ACM transactions on graphics (TOG) 30 (4), pp. 60. Cited by: §1, §2.1.
  • [39] H. Yang, U. Ciftci, and L. Yin (2018) Facial expression recognition by de-expression residue learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2168–2177. Cited by: §4.4, Table 1.
  • [40] R. Yeh, Z. Liu, D. B. Goldman, and A. Agarwala (2016) Semantic facial expression editing using autoencoded flow. arXiv preprint arXiv:1611.09961. Cited by: §1.
  • [41] L. Yin, X. C. Y. Sun, T. Worm, and M. Reale A high-resolution 3d dynamic facial expression database, 2008. In IEEE International Conference on Automatic Face and Gesture Recognition, Amsterdam, The Netherlands, Vol. 126. Cited by: §4.1.
  • [42] F. Zhang, T. Zhang, Q. Mao, and C. Xu (2018) Joint pose and expression modeling for facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3359–3368. Cited by: §2.1.
  • [43] Z. Zhang, Y. Song, and H. Qi (2017) Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5810–5818. Cited by: §4.3.
  • [44] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen (2011) Facial expression recognition from near-infrared videos. Image and Vision Computing 29 (9), pp. 607–619. Cited by: §4.1.
  • [45] G. Zhao and M. Pietikainen (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Transactions on Pattern Analysis & Machine Intelligence (6), pp. 915–928. Cited by: §4.4, Table 1.
  • [46] X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, and S. Yan (2016) Peak-piloted deep network for facial expression recognition. In European conference on computer vision, pp. 425–442. Cited by: §4.4, Table 1.