Log In Sign Up

Dynamic Facial Expression Generation on Hilbert Hypersphere with Conditional Wasserstein Generative Adversarial Nets

by   Naima Otberdout, et al.

In this work, we propose a novel approach for generating videos of the six basic facial expressions given a neutral face image. We propose to exploit the face geometry by modeling the facial landmarks motion as curves encoded as points on a hypersphere. By proposing a conditional version of manifold-valued Wasserstein generative adversarial network (GAN) for motion generation on the hypersphere, we learn the distribution of facial expression dynamics of different classes, from which we synthesize new facial expression motions. The resulting motions can be transformed to sequences of landmarks and then to images sequences by editing the texture information using another conditional Generative Adversarial Network. To the best of our knowledge, this is the first work that explores manifold-valued representations with GAN to address the problem of dynamic facial expression generation. We evaluate our proposed approach both quantitatively and qualitatively on two public datasets; Oulu-CASIA and MUG Facial Expression. Our experimental results demonstrate the effectiveness of our approach in generating realistic videos with continuous motion, realistic appearance and identity preservation. We also show the efficiency of our framework for dynamic facial expressions generation, dynamic facial expression transfer and data augmentation for training improved emotion recognition models.


page 1

page 11

page 12

page 13

page 15


Geometry-Contrastive Generative Adversarial Network for Facial Expression Synthesis

In this paper, we propose a geometry-contrastive generative adversarial ...

Facial Expression Video Generation Based-On Spatio-temporal Convolutional GAN: FEV-GAN

Facial expression generation has always been an intriguing task for scie...

Image-to-Video Generation via 3D Facial Dynamics

We present a versatile model, FaceAnime, for various video generation ta...

Conditional MoCoGAN for Zero-Shot Video Generation

We propose a conditional generative adversarial network (GAN) model for ...

F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks

Formulated as a conditional generation problem, face animation aims at s...

Generating Complex 4D Expression Transitions by Learning Face Landmark Trajectories

In this work, we address the problem of 4D facial expressions generation...

Every Smile is Unique: Landmark-Guided Diverse Smile Generation

Each smile is unique: one person surely smiles in different ways (e.g., ...

1 Introduction

Fig. 1: Given a neutral image, our proposed model is able to generate sequences of facial landmarks for different facial expressions and transform them to videos.
Fig. 2: Overview of the proposed approach. The general architecture consists of two GAN models; MotionGAN synthesizes facial expression motion corresponding to the desired expression from noise. The resulting motion encoded by the Square-Root Velocity Function is then applied to neutral face landmark configurations to generate a sequence of facial landmarks. Finally, TextureGAN transforms the sequence of landmarks to a sequence of frames corresponding to the input identity.

Since several decades, automated facial expression analysis has been studied in many computer vision researches 

[30, 8]. Indeed, facial expressions play a vital role in social interaction and involve several potential applications that go from human computer interaction to medical and psychological investigations. For a long time, several works have tackled the problem of facial expression recognition, while the problem of facial expression generation was more challenging and less addressed in the state-of-the-art. Recently, with the success of Generative Adversarial Networks [16] for image generation, a big progress has been made in this task. These networks, that learn to produce samples similar to a given data distribution through a two-player game, have shown to be powerful in many image generation tasks including static facial expression synthesis [11]. However, the majority of the proposed solutions have addressed the problem of static facial expression generation, while the problem of dynamic facial expression synthesis is still less addressed. Given that facial expressions are much more described by a dynamic process then a static one, we need new solutions to synthesize realistic facial expression videos. This is a more challenging problem due to the difficulty of generating the dynamic evolution of facial expressions. Indeed, generating a facial expression video from a single profile photo is a one-to-many problem, where the output has much more unknowns to find than the input, which does not include any temporal information. Besides, in this problem we do not need only to generate realistic images, but we have also to generate continuous frames that change smoothly along time without sudden changes. The addressed question in this paper is, given one single neutral face, can we generate diverse face expression videos conditioned on a facial expression label as shown in Figure 1? Responding to this question involves three tasks to be learned by the generative model: (1) the dynamic evolution of the facial expressions to synthesis video frames with continuous and smooth changes, which necessitates to define a robust and efficient motion representation, (2) the video appearance to synthesis realistic images, and (3) how to change the facial expression, while preserving the identity and its characteristics (e.g., eyeglasses, beard, etc.).

In this paper, we address these problems by proposing a new approach that utilizes a trajectory-based representation of landmark sequences on the hypersphere manifold to learn and generate video motions. The pipeline of the proposed framework is shown in Figure 2. First, a compact representation of a landmarks sequence will be encoded as points on the hyperphere manifold. Next, a new conditional version of Wasserstein GAN for manifold-valued data that we call MotionGAN will be used to learn the distribution of the resulting representations. After training MotionGAN

, we generate new facial expression motions corresponding to one of the six basic facial expressions. These motions can then be applied to any neutral facial landmarks configuration to generate a sequence of facial landmarks. Finally, this sequence is fed to a Pix2Pix network 

[18], and generates realistic textured facial expression videos.

In summary, the novel aspects and contributions of this paper are as follows:

  1. Manifold-valued representations for GANs: We represent a sequence of facial expression landmarks as curves that can be mapped to points on the hypersphere manifold. To the best of our knowledge, this is the first work that explores manifold-valued representations with GAN to address the problem of dynamic facial expression generation;

  2. A conditional manifold-valued Wasserstein GAN: We generalize manifold-valued Wasserstein GAN to supervised generation by proposing a conditional version of this GAN;

  3. Six dynamic facial expression synthesis from a single face: By combining two disparate ideas GAN and Riemannian geometry tools, we propose a facial expression generation approach that achieves good results for dynamic facial expression editing, facial expression transfer and data augmentation.

We evaluate our proposed approach both quantitatively and qualitatively on two public databases; the Oulu-CASIA and MUG Facial Expression datasets. Our experimental results demonstrate the effectiveness of our approach in generating realistic videos with continuous motion, realistic appearance and identity preservation. Besides, we show how our approach can be used for dynamic facial expression transfer. We also demonstrate the usefulness of the generated data by using them for data augmentation to train a recurrent neural network classifier. The rest of the paper is organized as follows: In Section 

2, we review some recent works that have tackled the same or related problems. In Section 3, we present and formulate our problem. Then, MotionGAN and TextureGAN details are provided in Sections 4 and 5, respectively. In Section 6, we present an extensive experimentation of the proposed approach as well as its application to facial expression transfer and data augmentation; Lastly, conclusions and discussion are reported in Section 7.

2 Related work

A big progress has been recently obtained in image generation using deep generative models, especially with Generative Adversarial Networks (GANs) [16]. Facial expressions editing is one of the numerous fields that have benefited from this progress. Indeed, many solutions have been proposed recently to synthesize facial expression images or to transfer facial expressions from one identity to another one using GANs. In this section, we review some of these solutions that we divided into two groups; Firstly, we review some recent advances achieved in static facial expression synthesis with GANs; Then, we discuss relevant works that tackled the problem of dynamic facial expression generation. Before discussing these works, we introduce GANs and their different architectures including their generalized version to manifold-valued data.

Generative Adversarial Networks: Recently, GANs have shown to be extremely efficient for synthesizing realistic images. GANs have been used for several applications including image synthesis [33], image-to-image translation [18, 54, 7]

, image super-resolution 

[24, 6], facial expression transfer [12, 36, 52], face aging [48, 45], face pose manipulation [46], etc. The general idea of GANs consists of training two neural networks, generator and discriminator, in a minimax two-player game. The training process aims to learn the natural image distribution by forcing the generator to output samples indistinguishable from natural images. After the great success of GAN, several variants of its architecture have been proposed, including conditional GAN (CGAN) [27], that uses condition label to guide the generation process. Based on CGAN, Isola et al. [18] proposed the pix2pix network that was widely used in the state-of-the art [42, 54]. It is a conditional GAN, where the generator is a U-Net [34], and the discriminator adopts a patch-based fully convolutional network [25]. This network uses an reconstruction loss to enforce the generated samples to be locally close to the ground truth, and an adversarial loss to correct the blur effect of the loss and synthesize realistic images. Theoretically, all the GAN architectures discussed above actually minimizes Jensen-Shannon divergence between the true and the generated data distributions. By contrast, the Wasserstein GAN [4] minimizes a Wasserstein-1 distance between the two distributions. While GAN techniques have shown a great success for real-valued image generation, they have been rarely applied to manifold-valued images. This idea has been discussed in [17], where a generalization of Wasserstein GANs for manifold-valued data in the context of static image generation was proposed. This poses several challenges such as the adaptation of cost functions and optimization algorithms to the manifold-valued data, and the generalization of the classic distribution distance to manifolds. To the best of our knowledge, our work is the first one that proposes a generalization of the Conditional Wasserstein GAN for manifold-valued data to address the problem of dynamic facial expression generation.

Facial Expression Image Generation: Conditional GANs [27]

have been exploited in many works to generate realistic face images corresponding to a given expression. Following a first research direction, some works integrated the target expression in a GAN as a deterministic one-hot vector that encodes the target class, thus generating faces conditioned on discrete emotion states 

[47, 23]. While this solution can be considered more simple, it generates only discrete facial expressions, which significantly reduces the diversity of the generated samples. To tackle this issue, Ding et al. [12], proposed an Expressive GAN (ExprGAN) for facial expression editing that uses an expression controller module. This module is a real-valued vector conditioned on the label that encodes more complex information about the target expression such as its intensity variation. Following a different direction, other works take advantage of facial geometry and encode the target expression through facial landmark positions [36, 32]. This solution is more flexible, allowing the synthesis of continuous facial expressions, and can be exploited for facial expression transfer as well. The main idea consists of encoding facial landmark positions in heatmaps that provide a per-pixel likelihood for these point locations. These heatmaps are usually fed to the GAN as an additional channel concatenated with the face image to synthesize this face performing the expression encoded in the heatmap. This strategy was adopted by many works including Song et al. [36], that proposed a geometry-guided GAN architecture for facial expression editing and removal. The face geometry is used as a condition introduced in the network to control the process of expression generation or expression removal. In the same direction, Qiao et al. [32], proposed a Geometry-Contrastive GAN (GC-GAN) to transfer facial expressions across different identities using face geometry. Motivated by the ideas discussed above, we also take advantage of the face geometry by encoding the expression in a landmarks configuration. However, unlike all previous works that generate static images of facial expressions, we tackle the problem of facial expression video synthesis, which is more challenging because of the temporal dynamics of the video that has to be captured and generated.

Video Generation with GAN: Despite the remarkable success achieved in image synthesis, video generation is still a more challenging task, due to the difficulty of generating the temporal motion of the video. Some recent works have started to explore GANs for video generation. While some works tackled the problem of future frame prediction [15, 51, 28], other works generate a whole video starting from one image. These works can be roughly divided into two groups according to the way they handle and generate the video motion. In the first group, a spatio-temporal network is used to generate all video frames at the same time. This first category includes the methods in [41] and [35] that exploit 3D spatio-temporal GANs to generate multiple frames of the video simultaneously. However, these approaches usually synthesize videos with poor image quality. To tackle this issue, other works explored different directions and proposed to generate the video motion separately, then generate the video frames sequentially. This is achieved by using Recurrent Neural Networks (RNNs). In [38], Sergery et al. proposed the MoCoGAN framework for video generation. The idea consists of decomposing the video into content and motion information, where the video motion is learned by Gated RNN (GRU) and the video frames are generated sequentially by a GAN. This approach was applied in different applications including dynamic facial expression synthesis. However, the generated images present content and motion artifacts and fail to capture fine details of the facial expressions. In [43], Wei et al. exploited facial landmarks for generating smile videos. They also modeled the temporal evolution of the smile expression using a RNN that produced landmark sequences translated later to image sequences using GAN. This work generates videos with acceptable image quality, but they focused only on smile expression that is more simple to capture and generate compared to other facial expressions. Following these last approaches, we also guide the generation process of dynamic facial expression by facial landmark sequences. However, different from the previous works, instead of directly generating facial landmark sequences, we generate the motion of these landmarks that can then be applied to any facial landmarks configuration to generate videos of different identities performing this motion. This is achieved by modeling landmark sequences as curves that can be mapped to points on a manifold space. We explicitly exploit the geometry of this manifold to generate the six facial expression sequences as new points on this manifold. To the best of our knowledge, this is the first study exploring manifold-valued representations to generate video motions with GAN, while the common solutions to this task in the state-of-the-art were based on RNN.

3 Problem Statement

Given a neutral face image with its landmarks , and a desired expression , we aim to generate a facial expression sequence of length corresponding to the identity of and the expression . To achieve this goal, we seek a mapping between the input and its associated facial expression sequence . To simplify the complexity of the task at hand, we define this mapping as a composition of two functions , such that the function learns the distribution of facial expression dynamics to generate the temporal evolution of the sequence, while learns how to synthesize its texture information. Since the visual signal in a video can be divided into content (texture) and dynamics, such decomposition facilitates the problem of video generation and allows us to design functions that focus on learning less complicated tasks. Moreover, this decomposition allows us to apply the same generated dynamics to different identities to generate videos of different persons performing the same facial expression. It is also possible to apply different dynamics to the same identity to generate videos of the same person performing different facial expressions.

In order to define the function , we completely ignore the texture information and focus only on facial landmarks. Let us consider a set of training image sequences and its corresponding set of sequences of landmark configurations , such that is the landmarks configuration of the facial image . From the set of landmark configuration sequences, the function can learn the distribution of facial expressions dynamics. The key idea here, is to model the temporal evolution of landmark configurations as time-dependent 2D curves which can be efficiently represented as single and compact points on the Hypersphere manifold [37]. By doing so, the landmark sequences are considered as manifold-valued data, and the geometric properties of this manifold are exploited to define a manifold-valued GAN MotionGAN for learning the distribution of facial expressions dynamics . As manifold-valued data lie on Riemannian manifolds rather than Euclidean space, the definition of manifold-valued data distribution is different from that of real-valued data distribution; then, it is unfeasible to apply the traditional GANs directly in this case. This is due to the fact that traditional GANs will generate new samples that did not lie on the Hypersphere manifold in contrast to the training real data dynamics that were represented on the latter manifold. By proposing a conditional version of the manifold-valued Wasserstein GAN introduced in [17] in the hypersphere manifold, our approach learns the manifold-valued distribution and generates the dynamics of new facial expressions that can be used to generate a new sequence of landmarks . Finally, another GAN, called TextureGAN is used to define the function that learns from training data and to translate the sequence of landmark configurations to a sequence of video frames .

The overall architecture as illustrated in Figure 2 consists of three blocks. First, MotionGAN learns the motion of landmarks from facial landmark sequences of the training set and generates new conditional facial expressions motions. The second block generates a sequence of landmarks corresponding to the facial expression dynamics generated by MotionGAN and a neutral landmark configuration. The last block translates the landmark sequence to a video: it receives a neutral face image and a sequence of landmarks and produces their corresponding facial expression video. In the following sections, we will provide more details about each one of these blocks.

4 MotionGAN: A Conditional manifold-valued GAN for Facial Expression motions Generation

Facial landmarks have been widely used for facial expression analysis [20, 19]. Indeed, facial landmarks are considered as a powerful tool to capture the geometric features of the face and encode both the appearance and the dynamic of the facial expression . Besides, facial landmarks have been recently adopted as guiding information in facial expression synthesis in several works [36, 43]. The idea here consists of using facial landmarks as an additional image channel to be concatenated with the input face directly or as a vector of landmark coordinates to be used as a guide during the generation process.

Based on this motivation, we also take advantage of the geometric information provided by facial landmarks in two ways. On one hand, we exploit the evolution of landmark locations to encode the dynamics of the face. This is achieved by modeling the facial landmarks evolution as a curve that can be represented and analyzed in a Riemannian manifold. By exploiting these curves, we train a generative adversarial network to generate new facial dynamics that we can transform to new dynamic landmark configurations of any face given its neutral landmarks configuration. On the other hand, following the state-of-the-art, we use the generated landmarks configurations to guide the generation of the final video frames by adding the texture information in the input image of the GAN to the geometric features provided by landmarks.

In this section, we present more details about the first stage of our method, which consists of modeling, analyzing and generating dynamic facial landmarks using MotionGAN. We first introduce the geometric framework used for data modeling, analysis and representation in a Riemannian manifold. Then, we present the architecture and the details about the learning process of MotionGAN.

4.1 MotionGAN Data Modeling

Let us represent a facial expression video of frames by its corresponding sequence of landmark configurations , where each configuration is a matrix of rank encoding the 2D positions of distinct landmark points . Following such representation, we model the sequence as a curve represented by a continuous parameterized function . These representations allow us to widely simplify the problem of landmark sequences generation given that each curve can be mapped to one single point on a given manifold. More formally, each curve can be represented by its Square-Root Velocity Function (SRVF) [37], according to,


where is the -norm in . The effectiveness of such specific representation for shape analysis has been proven in 3D facial curves [14] and action recognition  [10]. This representation encodes the temporal evolution of the facial landmark configurations and so the dynamics of the facial expression. In [37], authors proposed to remove the scale variability of the resulting curves by scaling the -norm of these functions to (i.e., ). Accordingly, the space of the resulting representations becomes a unit hypersphere in the Hilbert manifold , and each landmark configuration sequence becomes a point on this spherical manifold. Consequently, we reduce the problem of landmark sequences generation to a problem of generating points on the spherical manifold .

The first step needed in analyzing and comparing these representations consists of parametrizing them. Indeed, due to the different execution rates of the facial expressions, aligning and parametrizing these curves is a crucial processing for efficiently compare them. Formally, given two curves and , their corresponding SRVFs and can be registered by finding the non-linear function , , which optimally registers the two curves allowing their rate-invariant comparison. The optimal parametrization function can be found by using a Dynamic Programming algorithm as explained in [37]. After registration, we can compute efficiently the distance between the two registered curves and according to,


This distance quantifies the similarity between the two curves in . As explained in [37], it is invariant to rotation and scaling, and it also considers the stretching and the bending of the curves. In our approach, we also need to define the statistical mean of these representations in order to define a representative element of a specific group (e.g., a representative curve of the happy expression). To this end, we introduce the Riemannian center of mass, also known as Karcher mean [21], which can be used to compute an average element of a set of points in the hypersphere manifold. More formally, we define the Karcher mean of a set of points in the hypersphere manifold according to . In our work, we compute the Karcher mean to derive a representative curve of each facial expression which is used to align the other curves. We also exploit the Karcher mean to define a reference point in which we define the tangent space of that will be used in the training of MotionGAN. In the next section, we will show how to use all the mathematical tools introduced here to train MotionGAN to generate new points in that encode the dynamics of new facial expressions.

After defining the mathematical tools used to analyze and generate facial expression dynamics in the hypersphere manifold , we need to recover the landmarks configuration sequence that corresponds to a new generated point in . Conveniently to us, for each there exists a unique curve up to a translation such that the given is the SRVF of that . Formally, the curve can be recovered within a translation, using,


where represents the landmarks configuration of the initial frame. According to this equation, we can apply the generated facial expression dynamics encoded in to any identity. Indeed, by using the landmarks configuration of any identity as an initial condition in Eq. (3), we can recover the sequence of landmark configurations corresponding to this identity performing the motion encoded in .

4.2 MotionGAN Network

Given a set of training samples of facial landmark configuration sequences with their associated facial expression classes, we compute their corresponding parametrized SRVF functions set to train the MotionGAN model.

Given that SRVF representations are manifold-valued data that lie on Riemannian manifolds rather than Euclidean space, we propose MotionGAN, a conditional version of Wasserstein GAN for manifold-valued data to learn the distribution of SRVFs associated to each emotion class. This GAN is an extended version of CGAN from the Euclidean space to the hypersphere manifold . We exploit the logarithm and exponential maps defined for the hypersphere manifold , given later by Eq. (6) and Eq. (7), respectively, to optimize the Wasserstein distance between the distribution of the generated manifold-valued data and that of the real manifold-valued data under an adversarial training. MotionGAN maps a random vector to an SRVF point on . It consists of two adversarial models: a generative model that captures the data distribution of the facial expressions dynamics encoded in the SRVFs , and a discriminative model

that estimates the probability that a sample come from the training data rather than the generator

. The goal of training these two models is to learn a function , which maps an

-dimensional noise vector sampled from a normal distribution

to an SRVF encoding the dynamic evolution of facial landmarks.

Fig. 3: Overview of MotionGAN, the Conditional Wasserstein GAN used for motion generation.

4.3 Loss Function

The global objective function used to train MotionGAN is a weighted sum of three loss functions: the adversarial loss

, the reconstruction loss in , and the reconstruction loss in such that,


Regarding the adversarial loss, we propose the conditional version of the objective function proposed in [17]. This function is the generalized version of the objective function of Wasserstein GAN [4] to the hypersphere manifold-valued data, according to,


where and are the logarithm and exponential maps, respectively, defined for the hypersphere manifold in a particular point . The logarithm map projects the SRVF from the hypersphere to its tangent space in . This tangent space is a vector space, where any deep network can be applied directly, while the exponential map transforms the data back to the hypersphere manifold . The logarithm and exponential maps for the hypersphere manifold are defined by:


where represents the distance between and in defined by Eq. (2).

In Eq. (5), is random noise, and is a random sample following the distribution , which is sampled uniformly along straight lines between pairs of points sampled from the real distribution and the generated distribution . It is given by,


and is the gradient with respect to .

In addition to the adversarial loss, we use the two reconstruction losses and . measures the distance in the tangent space between the tangent vector of the ground truth SRVF in and its associated reconstructed vector . While quantifies the similarities between the generated SRVF and its corresponding ground truth . The reconstruction loss in the tangent space is given by,


where , represents the -norm. The reconstruction loss on the hypersphere is given by,


where is the geodesic distance in given by Eq. (2).

According to the objective function, the optimization of MotionGAN is done in the tangent space of the hypersphere in a reference point . Since the tangent space is a linear vector space, any regular network can be directly applied in . This is achieved by exploiting the logarithm map that maps the SRVFs of the database from to , which forms the real data of the network. On the other hand, the exponential map transforms the generated fake data from to , then the logarithm map transforms the data back to the tangent space to force the generator to generate data on the desired tangent space of the sphere, which is . The discriminator takes as input the real data and the generated data , both laying on the tangent space of the sphere, and tries to distinguish between them. At the end of the training, the generator learns the distribution of the real data and generates data similar to the real one on the desired tangent space of the sphere. The tangent space used in our training corresponds to the tangent space of the hypersphere in the Karcher mean of the data. In the following, we use the tangent space of the hypersphere to refer to the tangent space of the hypersphere computed in the Karcher mean of the training data.

After training, we use the resulting generator to generate a point on the tangent hypersphere conditioned on the desired expression, then we use the exponential map to find its corresponding point on the sphere, which corresponds to an SRVF that encodes a dynamic evolution of the desired facial expression. Finally, we use Eq. (3) with a neutral landmarks configuration of any identity to generate the sequence of landmark configurations corresponding to this identity performing the desired facial expression encoded in the generated SRVF. In the following, we present the module that transforms the sequence of landmarks to a sequence of frames.

Algorithm 1 outlines the steps used to train MotionGAN, while Algorithm 2 summarizes the steps needed to generate a new sequence of facial landmark configurations using the trained MotionGAN.

Input: , training data with their corresponding labels; , initial discriminator parameters; , initial generator parameters; , learning rate; , batch size; , discriminator iterations per generation iteration; , balance parameter of gradient norm penalty; , number of iterations for the generator.

1:for   do
2:     for   do
3:         Sample minibatch of noise samples from noise prior
4:         Sample minibatch of examples from real data distribution
5:         Compute the stochastic gradient of Eq. (5) with respect to
6:          . AdamOptimizer
7:     end for
8:     Sample minibatch of noise samples from noise prior
9:     Compute the stochastic gradient of,
with respect to
11:end for
Algorithm 1 Shape Conditional Wasserstein GAN training

Input: , Generator trained with Algorithm 1; , neutral landmarks configuration; desired facial expression.

1:Sample , a random noise from noise prior
2:Generate , a point on the tangent space of the hypersphere
3:Generate by mapping the generated point to the hypersphere using Eq. (7)
4:Generate sequence of landmarks using Eq. (3) and as initial condition
Algorithm 2 Landmarks Sequence Generation

5 TextureGAN: Landmarks Sequence to Video Generation

After learning the distribution of facial expressions dynamics from the facial landmarks, we need to add the texture information to synthesize the final facial expression video. To this end, we use which is a conditional GAN for real-valued data . This GAN uses a map of facial landmarks configuration as a guide to generate the face of the input image with the facial expression corresponding to . Since the temporal information has been already tackled with MotionGAN, TextureGAN focuses only on the texture information in individual static images. However, even if it generates static images, its role is more complicated than generating a face with a certain facial expression as done in other static state-of-the-art approaches [12] that focus on generating six or seven basic facial expressions. Indeed, TextureGAN has to generate also the intermediate frames of the video, that do not necessarily correspond to any of the basic facial expressions. Moreover, the generated frames have to show smooth continuous changes along time without sudden transitions. To this end, we chose to guide the generation of the facial expression with facial landmarks, which simplifies the task of generating intermediate frames and allows us to generate continuous expressions with smooth changes along time. This result is not possible, for example, when using a condition in the form of one-hot vector.

Given a set of training image sequences and its corresponding set of sequences of landmark configurations , such that is the landmarks configuration of the facial image . Let denote the input neutral face image corresponding to the -th training sequence. These sets will be used to train the generator and the discriminator of TextureGAN to learn the mapping between and . To this end, we exploit a combination of a reconstruction and an adversarial losses. The global objective function of TextureGAN is a weighted sum of three loss functions; adversarial loss , identity loss and reconstruction loss such that,


where the adversarial loss is given by,


To further keep the face identity in the generated frames, we make use of an identity loss that enforces the similarity of identity features between the input and the output faces. To this end, we exploit the VGG-face [31]

model trained for face recognition to extract identity features, and maximize similarities between them. The identity loss is given by,


where represents the L1 norm and

are the features extracted from the

-th convolutional layer of the VGG-face. The used layers are conv1, conv2, conv3, conv4 and conv5. In order to keep the generated frames close to the ground-truth, we add a reconstruction loss to the global objective function. The reconstruction loss is given by,


6 Experiments

We performed different experiments to evaluate our approach. We first describe the used benchmarks and the experimental setup. Then, we present a quantitative and a qualitative evaluation for each part of our method including motion generation and video synthesis. We also introduce an ablation study to show the importance of each component of our approach. Finally, we introduce and evaluate the usefulness of our proposed method for two other applications: data augmentation for facial expression recognition and dynamic facial expression transfer.

6.1 Datasets

Oulu-CASIA [49]: This dataset contains over 480 videos of 80 subjects. Each one of these subjects has six videos corresponding to six basic emotion labels; All videos begin with a neutral expression and end with the apex of the corresponding expression. The fist of the subjects was used for training, while the last was used as test set.

MUG Facial Expression [2]: This database includes videos of 86 subjects. Each video consists of 50 to 160 frames. We used only the sequences representing one of the six basic facial expressions including anger, disgust, fear, happiness, sadness, and surprise. Following [50], we split the dataset into three parts in a subject independent manner. The first two parts were used for training, while the last part was used as a test set. The beginning and the end of each video in this database correspond to neutral expressions. Accordingly, we used only the first half of the videos, which start from a neutral expression and ends with a peak expression.

Extended Cohn Kanade (CK+) [26]: This dataset comprises 327 sequences of posed expressions, annotated with seven expression labels from which we selected the six basic expressions; anger, disgust, fear, happiness, sadness, and surprise. Each sequence starts with a neutral expression, and reaches the peak in the last frame. We use this dataset for data augmentation in the training of MotionGAN.

6.2 Implementation Details

Preprocessing: For all the images used in our experiments, we cropped the face regions using OpenFace [5], and scaled them to . Then, we normalized all videos to frames using the approach proposed in [53]. For MotionGAN data, we used OpenFace [5] to extract 2D coordinates of facial landmarks from each video frame. These landmarks were then arranged in a matrix representing a curve in , with representing the video length, and corresponding to the number of landmarks. Equation (1) was used to compute the Square Root Velocity Function (SRVF) of the resulting curves. By computing the Karcher mean of the SRVFs belonging to the same class, we obtain a representative element of each expression class. Then, we align each training SRVF with the representative element of its corresponding class. The resulting SRVFs were used to train MotionGAN that can produce new SRVFs corresponding to new facial expression dynamics. The generated SRVFs were then transformed to facial landmark sequences using Equation (1) with any neutral facial landmarks configuration. The training of TextureGAN was performed with pre-processed faces and guided by their ground truth landmark sequences, while the landmark sequences generated by MotionGAN were used in the test stage. Indeed, we avoid using landmark sequences generated by MotionGAN during TextureGAN training, since MotionGAN generates random motions for the same expression given that it starts from noise. Thus, we can not be sure to generate the exact sequence corresponding to the ground truth video that will be used to minimize the reconstruction loss of TextureGAN. To define the target pose for TextureGAN, we encoded facial landmark locations in heatmaps that were used as additional image channels to be concatenated with the input face image. Each heatmap is a multi-channel image with the same size as the input face image, where each channel encodes the location of one of the 68 facial landmarks, and the value of each pixel in a channel corresponds to the likelihood for its corresponding point location.

MotionGAN architecture details: The MotionGAN architecture consists of multiple upsampling and downsampling blocks. The generator takes as input a vector of size 128 sampled from normal distribution and concatenated with the input label, which is encoded as one-hot vector of size 6 (number of classes). This vector is handled by one fully connected layer of outputs, and five upsampling blocks with 530, 274, 146, 64 and 1 output channels. Each upsampling block consists of the nearest-neighbor upsampling followed by a stride

convolution. The outputs of the first four convolution layers are activated by the Relu function and concatenated with the label vectors that was transformed to one channel, while the last convolution layer uses hyperbolic tangent. The final output of the MotionGAN generator is a matrix of size

for each noise sample. The Discriminator of MotionGAN consists of three downsampling blocks with 64, 32 and 16 output channels. Each block is a stride

convolution layer followed by batch normalization and Relu activation. These layers are then followed by two fully connected (FC) layers of 1024 and 1 outputs. The first FC layer uses Leaky ReLU and batch normalization. Except the last FC layer, all outputs are concatenated with the label either in the form of one channel for convolution layers output or one-hot vector for the fully connected layers.

Fig. 4: Visualization of some facial landmark sequences. The left block shows landmark sequences obtained with the generated SRVF using MotionGAN applied to neutral landmark configurations. The right block shows some landmark sequences from the Oulu-CASIA dataset used in the training of MotionGAN. Each row corresponds to one facial expression. The original sequences contain 32 frames from which 7 frames were selected to visualize each sequence.

TextureGAN architecture details: The TextureGAN generator takes in input a pre-processed face image concatenated with a heatmap encoding the target pose. We base our TextureGAN architecture on the image-to-image translation network proposed in [18]. The generator is composed by an encoder and a decoder that have symmetric architectures. The encoder consists of six , stride convolutional layers with 64, 128, 256, 1024, 1024 and 1024 output channels. Each convolutional layer is followed by ReLu activation and batch normalization. Following the U-Net network [34] and [18], we add skip connections between each layer and layer , where is the number of the network layers. The skip connections consist of concatenating all channels at layer with those at the layer . To avoid outfighting, we add dropout with probability after the first three convolutional layers. The architecture of the decoder is symmetric to that of the encoder.

The discriminator of TextureGAN contains four stride convolutional layers with 64, 128, 256 and 1024 output channels. Each convolutional layer is followed by batch normalization and Lacky ReLu activation except the first one that does not use batch normalization. The network ends with one fully connected layer of 1 output followed by sigmoid activation.

We train the networks using the Adam optimizer [22], with learning rate of 0.0002, and mini-batch size of 64 for TextureGAN and 128 for MotionGAN. Regarding MotionGAN, we empirically set its hyper-parameters to , and , while we fix the hyper-parameters , and

for TextureGAN. For data augmentation, we perform random flipping of the input images. The two networks MotionGAN and TextureGAN were trained separately, MotionGAN was trained for 200 epochs, while TextureGAN was trained for 400 epochs. We implemented our models with Tensorflow 

[1] framework based on the implementation of [18].

6.3 Evaluation

we assessed the performance of the proposed approach by evaluating the generated videos: firstly, we evaluated quantitatively and qualitatively the facial expression dynamics generated by MotionGAN; Then, we assessed the quality of the videos generated by TextureGAN.

6.3.1 Landmark Sequence Generation

Qualitative results: In order to qualitatively evaluate the generated facial expression dynamics (i.e., SRVF), we applied them to a neutral facial landmark configuration following Eq. (3). This results in sequences of facial landmarks that follow the dynamics encoded in the SRVFs. In Figure 4, we show some generated facial landmarks for six basic facial expressions. The visualized landmark sequences show that MotionGAN is able to generate realistic facial expression dynamics from noise that are comparable with the ground truth sequences. Moreover, Figure 4 shows that the proposed MotionGAN generator is able to synthesize facial expressions dynamics (and their associated facial landmark configurations) corresponding to different conditioning labels.

Quantitative Results: In order to assess quantitatively the facial expressions dynamics generated by MotionGAN and then their associated landmark sequences, we propose to exploit the geodesic distance between the SRVFs in given by Eq. (2). This distance allows us to quantify the similarity between the generated sequences and those of the ground truth; it also allows us to measure the similarities between sequences of the same facial expressions and those of different facial expressions.

Accordingly, we used the MotionGAN generator to synthesize 64 facial expression dynamics (i.e., SRVFs) for each one of the six basic facial expressions, which results in 384 generated SRVFs. By computing the geodesic distances between all these generated samples, we used Multidimensional Scaling (MDS) [9] to visualize them in a two dimensional space. In Figure 5, we show the 2D visualization of the generated samples as well as those of the databases used to train MotionGAN. The first aspect that can be noted is the effectiveness of the representation used to encode the motion of the facial expressions. Indeed, the plots show that SRVFs can easily differentiate facial expression classes, while keeping close the classes that share more inter-class similarities (e.g., Fear and Surprise). In addition, we notice that the most of the generated expressions are well separated, which demonstrates that MotionGAN learns the distribution of each class and it is capable to generate samples conditioned on the input labels. Moreover, this visualization shows the high diversity of the generated motions, which allows us to generate different dynamics for the same facial expression.

(a) Training SRVF data from the Oulu-CASIA and CK+ datasets used in MotionGAN training
(b) Generated SRVF data with MotionGAN
(c) Superposition of the training and the generated data
Fig. 5: 2D visualization of the SRVF data using MDS based on the geodesic distance in .

6.3.2 Video Generation

After evaluating the facial expressions dynamics generated by MotionGAN, we aim to evaluate the quality of the final generated videos. The generated videos in the following experiments are guided by landmarks sequences resulting from applying the generated SRVFs on neutral input facial landmark configurations.

Qualitative Results: In Figure 6, we show some generated videos using our approach. Firstly, from noise, we generate facial expression dynamics using MotionGAN, then we apply these dynamics to the facial landmarks configuration of the face at hand. This results in a sequence of facial landmarks that is used to guide TextureGAN to generate the final video from the input face image. All the faces used in the following results are taken from the test set of the corresponding dataset. We also note that the split of the datasets was subject independent, thus all the used faces here are not seen by the model during the training phase. The resulting videos demonstrate that our method is able to synthesize realistic videos of previously unseen faces with smooth continuity and without discrete transitions. Moreover, the results show that our model preserves well the characteristics of the input faces such as the identity and eyeglasses, while it can also generate the teeth region, which does not exist in the source image for happy and surprise expressions. Some examples of animated videos are shown in this link.

In Figure 7, we compare some frames generated by our model with three state-of-the-art video generation approaches: MoCOGAN [38], VGAN [41], and TGAN [35]. For each approach, we show the synthesized videos of two identities performing two facial expressions in the two first rows. Then, we report our generated videos for the same identities and the same expressions in the last two rows. We observe that our model can generate more realistic images with less content and motion artifacts, while other approaches generate blurry images and failed to keep fine details of the expressions. Even the dynamics of our generated videos looks more natural and changes more smoothly compared to that generated by the other approaches.

Fig. 6: Generated videos by MotionGAN and TextureGAN. The image sequences were randomly selected and the reported images are taken every 5 frames. Some examples of animated videos are shown in this link.
(a) Generated by MoCOGAN [38]
(b) Generated by VGAN [41]
(c) Generated by TGAN [35]
(d) Generated by MotionGAN and TextureGAN
Fig. 7: Qualitative comparison with the state-of-the-art on the MUG Facial Expression database. The samples generated by our model are randomly selected, while those generated with state-of-art approaches are taken from [38]. The state-of-the-art samples are generated with a model trained on the MUG Facial Expression database, while our samples are generated with our model trained on the Oulu-CASIA dataset.

Quantitative Results: To evaluate our model quantitatively, we used the average content distance ACD [38] and its ACD-I variant proposed in [50]. This metric measures the content consistency of the generated video based on how well the video preserves identity of the input face. The ACD metric is computed by using OpenFace [3], which is a deep model trained for face recognition. This model extracts identity features in the form of vectors from each frame of the video. Then, the ACD is given by the average of the distances between vectors of consecutive frames. The ACD-I metric is the average distance between each generated frame and the original input frame. It is used to assess how the facial identity is captured in the generated video. In Table I and II, we compare our results with the state-of-the-art based on ACD and ACD-I metrics on the MUG facial expression dataset [2]. In Table I, we report the ACD obtained with 256 generated videos following the setting of [38]. We further report the ACD of the training set as reference. In Table II, we compare the content consistency of our generated videos using ACD and ACD-I with other state-of-the-art approaches following the setting of [50], where we generate a video of six basic facial expressions for each subject of the test sample of the MUG facial expressions dataset.

The quantitative results show that our approach achieves the best ACD score, thus demonstrating it is capable of generating videos with consistent content, while well preserving the identity of the input face. This is consistent with the qualitative results discussed previously.

Identity features visualization: Here, we visualize the identity features of the generated videos. To this end, we used the OpenFace [3] model to generate identity features for each frame of the videos. The average Euclidean distance between these features is then used with multidimensional scaling to visualize, in a two dimensional space, the identity features of the generated videos of six identities chosen randomly. Figure 8 supports our previous results and shows that our model preserves the identity information of the input face.

Fig. 8: 2D visualization of the identity features space of six subjects chosen randomely from the MUG (top) and Oulu-CASIA (bottom) datasets. The neutral face images in the plot show the identity of the subjects, while the colored dots indicate the generated expressive ones.
Approach ACD
Reference 0.116
VGAN [41] 0.322
TGAN [35] 0.305
MOCOGAN [38] 0.201
Ours 0.187
TABLE I: Quantitative comparison with state-of-the-art based on the ACD metric
Approach ACD ACD-I
Reference 0.098 0.109
MCNet [39] 0.322 0.545
Villegas et al. [40] 0.130 0.683
MOCOGAN [38] 0.205 0.291
FRGAN [50] 0.107 0.184
Ours 0.115 0.147
TABLE II: Quantitative comparison with state-of-the-art based on the ACD-I and ACD-C metrics following the FRGAN [50] setting

6.4 Ablation Study

In order to evaluate the effect of each component of our model, we conducted an ablation study on the Oulu-CASIA dataset. In this study, we exploited four common used metrics; peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [44], ACD and ACD-I. While ACD and ACD-I measure content consistency over the generated videos as discussed previously, PSNR and SSIM reflect the quality of the generated frames by measuring pixel-level similarity between the generated videos and their ground truth. The ablations study was performed using three models. Each model was trained by avoiding one component of the full model. In the first model, we discard the identity loss during the training of TextureGAN, while, in the second model, we avoid the reconstruction loss on during the training of MotionGAN. The last model consists of using ground truth landmark sequences to guide TextureGAN instead of those generated by MotionGAN. Results are shown in Table III. From these results, we notice that using the identity loss during TextureGAN training improves the content consistency of the generated videos as indicated by the higher ACD and ACD-I scores. This is expected as the ACD and ACD-I metrics are based on the similarities of identity features. This result demonstrates that using identity loss helps to maintain the subject identity throughout the generated videos. The model trained without reconstruction loss in shows the worst PSNR and SSIM scores and higher ACD and ACD-I comparing to our full model, which demonstrates the usefulness of this loss in training a good landmark sequences generator that leads to good videos quality. Regarding the third model that uses ground truth landmarks to guide TextureGAN, we notice that it achieves the best scores, while the results are still close to those achieved by our full model. This evidences that MotionGAN is capable of generating landmark sequences that are very similar to those of the ground truth.

w/o 0.0193 0.143 26.05 0.908
w/o 0.017 0.127 24.180 0.886
w/o MotionGAN 0.016 0.10 25.98 0.90
Full model 0.016 0.107 24.443 0.891
TABLE III: Ablation study on the Oulu-CASIA dataset

6.5 Data Augmentation for Facial Expressions Recognition

In order to quantitatively demonstrate the usefulness of the synthesized facial expression videos, we trained a facial expression classifier based on Long Short-Term Memory (LSTM) 

[13] with one layer. This LSTM model was used to classify the videos of the test set of CASIA dataset. The input features of the LSTM network consist of facial expressions features learned by the CNN model used in [29]. This CNN was trained for facial expression recognition from static images on the training set of the Oulu-CASIA dataset. Using this CNN, we extracted vectors of features from the last fully connected layer for each video frame; then we used these feature vectors as inputs of the LSTM model to classify the whole video.

We first trained the LSTM on the training set, and we considered the accuracy achieved by this model on the test set as baseline. Then, we re-trained the model from scratch by augmenting the training data. In each experiment, we multiplied the number of the original training data by 4, 6, 10 and 15. The generated data used for data augmentation here correspond to the same identities of the original training data performing new motions generated by our MotionGAN. The results obtained in these experiments are reported in Table IV. We notice that multiplying the number of training data by 4, 6 and 10 using our generated samples improves the results from to , and , respectively. While the results saturate when multiplying the number of the original data by more than 10 times. These results show the usefulness of our approach in generating new samples comparable with real ones that can be exploited for video data augmentation to train improved emotion recognition models.

# Training samples Accuracy (%)
original data (baseline) 87.5
# Training data 90.62
# Training data 91.66
# Training data 92.7
# Training data 92.7
TABLE IV: Expression recognition accuracy using a different number of synthesized videos

6.6 Facial Expression Transfer

Facial expression transfer aims to transfer facial expressions from a source subject to a target subject. The newly-synthesized expressions of the target subject are supposed to be identity-preserving and exhibit similar emotions to the source subject. In addition to facial expressions synthesis, our proposed model can also be used for dynamic facial expression transfer. In Figure 9, we show that our model is able to transfer a dynamic facial expression from the source identity to a target identity . This is achieved by encoding the motion of in a SRVF representation, then using Eq. (3) we map this motion to the neutral landmarks configuration of the target identity . Finally, the resulting landmark sequences are used to guide TextureGAN to generate the final video of performing the dynamic expression of . In Figure 9, we used an identity from the Oulu-CASIA database performing four facial expressions; disgust, fear, sad and surprise. Then, we transfer each one of these expressions to two identities in the MUG dataset. Results show that our model is able to transfer the facial expressions to different identities with good image quality, while keeping the characteristics of the target identity. For what concerns facial expression generation, we can notice that also in facial expression transfer the model can synthesize the teeth region that was hidden in the input face.

Fig. 9: Facial expression transfer. Expressions from the left column taken from the Oulu-CASIA dataset are transferred to faces in the middle column taken from the MUG dataset. Each expression was transferred to two different identities and the results of the transfer are shown in the right column.

7 Conclusion

In this paper, we addressed the difficult task of dynamic facial expression generation; in particular, we showed how to generate the six prototypical facial expressions given a neutral face. To this end, we proposed a novel framework, which processes separately facial expression dynamics and face appearance using two different GAN architectures. The temporal information of the video was first represented by the facial landmarks evolution that was encoded as points on a Hypersphere manifold. In order to learn the distribution of these manifold-valued data, we proposed a conditional version of manifold-valued Wasserstein GAN that have been used to generate, from noise, new facial expressions motions corresponding to a given emotion state. The resulting motions can be applied to any neutral facial landmarks configuration to generate a sequences of facial landmarks performing the generated motion. Finally, the second conditional real-valued GAN transforms the generated facial landmarks sequences to video frames by adding the texture information. The reported experiments on two public datasets demonstrate the effectiveness of our approach in facial expression editing, facial expression transfer and data augmentation.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al.

    Tensorflow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
  • [2] N. Aifanti, C. Papachristou, and A. Delopoulos. The mug facial expression database. In Int. Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), pages 1–4. IEEE, 2010.
  • [3] B. Amos, B. Ludwiczuk, M. Satyanarayanan, et al. Openface: A general-purpose face recognition library with mobile applications. CMU School of Computer Science, 6, 2016.
  • [4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In Int. Conf. on Machine Learning (ICML)), volume 70, pages 214–223, Aug 2017.
  • [5] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency. Openface 2.0: Facial behavior analysis toolkit. In IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG), pages 59–66. IEEE, 2018.
  • [6] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang. FSRNet: End-to-end learning face super-resolution with facial priors. In

    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [7] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 8789–8797, 2018.
  • [8] I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang. Facial expression recognition from video sequences: temporal and static modeling. Computer Vision and Image Understanding, 91(1):160 – 187, 2003. Special Issue on Face Recognition.
  • [9] T. F. Cox and M. A. Cox. Multidimensional scaling. Chapman and hall/CRC, 2000.
  • [10] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. D. Bimbo. 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cybernetics, 45(7):1340–1352, 2015.
  • [11] H. Ding, K. Sricharan, and R. Chellappa. Exprgan: Facial expression editing with controllable expression intensity. CoRR, abs/1709.03842, 2017.
  • [12] H. Ding, K. Sricharan, and R. Chellappa. Exprgan: Facial expression editing with controllable expression intensity. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [13] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015.
  • [14] H. Drira, B. Ben Amor, A. Srivastava, M. Daoudi, and R. Slama. 3D face recognition under expressions, occlusions, and pose variations. IEEE Trans. Pattern Analysis and Machine Intelligence, 35(9):2270–2283, 2013.
  • [15] L. Fan, W. Huang, C. Gan, J. Huang, and B. Gong. Controllable image-to-video translation: A case study on facial expression generation. arXiv preprint arXiv:1808.02992, 2018.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
  • [17] Z. Huang, J. Wu, and L. V. Gool. Manifold-valued image generation with wasserstein adversarial networks. In AAAI Conf. on Artificial Intelligence, 2019.
  • [18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [19] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In IEEE Int. Conf. on Computer Vision (ICCV), pages 2983–2991, 2015.
  • [20] A. Kacem, M. Daoudi, B. B. Amor, S. Berretti, and J. C. Alvarez-Paiva. A novel geometric framework on gram matrix trajectories for human behavior understanding. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2018.
  • [21] H. Karcher. Riemannian center of mass and mollifier smoothing. Communications on Pure and Applied Mathematics, 30:509–541, 1977.
  • [22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [23] Y.-H. Lai and S.-H. Lai. Emotion-preserving representation learning via generative adversarial network for multi-view facial expression recognition. In IEEE Int. Conf. on Automatic Face & Gesture Recognition (FG), pages 263–270. IEEE, 2018.
  • [24] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4681–4690, 2017.
  • [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, 2015.
  • [26] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In IEEE Conf. on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 94–101, 2010.
  • [27] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [28] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems (NIPS), pages 2863–2871, 2015.
  • [29] N. Otberdout, A. Kacem, M. Daoudi, L. Ballihi, and S. Berretti. Deep covariance descriptors for facial expression recognition. In British Machine Vision Conference (BMVC), Sept. 2018.
  • [30] M. Pantic and I. Patras. Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36(2):433–449, April 2006.
  • [31] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conf. (BMVC), pages 41.1–41.12. BMVA Press, 2015.
  • [32] F. Qiao, N. Yao, Z. Jiao, Z. Li, H. Chen, and H. Wang. Geometry-contrastive generative adversarial network for facial expression synthesis. CoRR, abs/1802.01822, 2018.
  • [33] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [34] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Int. Conf. on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [35] M. Saito, E. Matsumoto, and S. Saito.

    Temporal generative adversarial nets with singular value clipping.

    In IEEE Int. Conf. on Computer Vision, ICCV, pages 2849–2858, 2017.
  • [36] L. Song, Z. Lu, R. He, Z. Sun, and T. Tan. Geometry guided adversarial facial expression synthesis. In ACM Conference on Multimedia, pages 627–635. ACM, 2018.
  • [37] A. Srivastava, E. Klassen, S. Joshi, and I. Jermyn. Shape analysis of elastic curves in euclidean spaces. IEEE Trans. pn Pattern Analysis and Machine Intelligence, 33(7):1415–1428, 2011.
  • [38] S. Tulyakov, M. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In IEEE Conf. on Computer Vision and Pattern Recognition, CVPR, pages 1526–1535, 2018.
  • [39] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. Int. Conf. on Learning Representations (ICLR), 2017.
  • [40] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee. Learning to generate long-term future via hierarchical prediction. In Int. Conf. on Machine Learning (ICML), volume Volume 70, pages 3560–3569. JMLR. org, 2017.
  • [41] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), pages 613–621. 2016.
  • [42] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 8798–8807, 2018.
  • [43] W. Wang, X. Alameda-Pineda, D. Xu, P. Fua, E. Ricci, and N. Sebe. Every smile is unique: Landmark-guided diverse smile generation. In IEEE Conf. on Computer Vision and Pattern Recognition, CVPR, pages 7083–7092, 2018.
  • [44] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Processing, 13(4):600–612, 2004.
  • [45] H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 31–39, 2018.
  • [46] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker. Towards large-pose face frontalization in the wild. In IEEE Int. Conf. on Computer Vision (ICCV), pages 3990–3999, 2017.
  • [47] F. Zhang, T. Zhang, Q. Mao, and C. Xu. Joint pose and expression modeling for facial expression recognition. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 3359–3368, June 2018.
  • [48] Z. Zhang, Y. Song, and H. Qi.

    Age progression/regression by conditional adversarial autoencoder.

    In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5810–5818, 2017.
  • [49] G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. PietikäInen. Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9):607–619, 2011.
  • [50] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. Metaxas. Learning to forecast and refine residual motion for image-to-video generation. In European Conf. on Computer Vision (ECCV), pages 387–403, 2018.
  • [51] Y. Zhou and T. L. Berg. Learning temporal transformations from time-lapse videos. In European Conf. on Computer Vision (ECCV), pages 262–277. Springer, 2016.
  • [52] Y. Zhou and B. E. Shi. Photorealistic facial expression synthesis by the conditional difference adversarial autoencoder. In Int. Conf. on Affective Computing and Intelligent Interaction (ACII), pages 370–376. IEEE, 2017.
  • [53] Z. Zhou, G. Zhao, and M. Pietikäinen. Towards a practical lipreading system. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 137–144, 2011.
  • [54] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE Int. Conf. on Computer Vision (ICCV), pages 2223–2232, 2017.