Controllable Image-to-Video Translation: A Case Study on Facial Expression Generation

08/09/2018 ∙ by Lijie Fan, et al. ∙ 6

The recent advances in deep learning have made it possible to generate photo-realistic images by using neural networks and even to extrapolate video frames from an input video clip. In this paper, for the sake of both furthering this exploration and our own interest in a realistic application, we study image-to-video translation and particularly focus on the videos of facial expressions. This problem challenges the deep neural networks by another temporal dimension comparing to the image-to-image translation. Moreover, its single input image fails most existing video generation methods that rely on recurrent models. We propose a user-controllable approach so as to generate video clips of various lengths from a single face image. The lengths and types of the expressions are controlled by users. To this end, we design a novel neural network architecture that can incorporate the user input into its skip connections and propose several improvements to the adversarial training method for the neural network. Experiments and user studies verify the effectiveness of our approach. Especially, we would like to highlight that even for the face images in the wild (downloaded from the Web and the authors' own photos), our model can generate high-quality facial expression videos of which about 50% are labeled as real by Amazon Mechanical Turk workers.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Upon observing the accomplishments of deep neural networks in a variety of subfields of AI, researchers have gained keen interests in pushing its boundaries forward. Among the new domains in which they have recently achieved remarkable results, photo-realistic image generation [1, 2] and image-to-image translation [3, 4]

are two well-known examples — they were considered very difficult in general as the desired output is extremely high-dimensional, incurring the curse of dimensionality to conventional generative models. In this paper, for the sake of both furthering this exploration and our interest in a realistic application, we propose to study image-to-video translation which challenges the deep models by yet another temporal dimension. We focus on a special case study: how to generate video clips of rich facial expressions from a single profile photo of the neutral expression.

The image-to-video translation might seem like an ill-posed problem because the output has much more unknowns to fill in than the input values. Although there have been some works on video generation [5, 6, 7, 8, 9, 10, 11], they usually take as input multiple video frames and then extrapolate the future from the recurrent pattern inferred from the input, preventing them from tackling the image-to-video translation whose input supplies no temporal cue at all. Moreover, it is especially difficult to generate satisfying video clips of facial expressions for the following two reasons. One is that humans are familiar with and sensitive about the facial expressions. Any artifacts, no matter in the spatial dimensions or along the temporal dimension, could be noticed by users. The other is that the face identity is supposed to be preserved in the generated video clips. In other words, the neural network cannot remember the faces seen in the training stage but instead learn the “imagination” capabilities so as to handle new faces in the deployment stage.

Despite the difficulties discussed above, we believe it is feasible to tackle the image-to-video translation at least in the particular domain of facial expression generation. First, different people express emotions in similar manners. For instance, one often opens its mouth when s/he becomes excited or surprised. Second, the expressions are often “unimodal” for a fixed type of emotion. In other words, there exists a procedure of gradual change from the neutral mode to the peak state of an expression. For instance, one increases her/his degree of happiness monotonically until s/he reaches the largest degree of expression. Third, the human face of a profile photo draws a majority of users’ attention, leaving the quality of the generated background less important. All these characteristics significantly reduce the variability of the video frames, making the image-to-video translation plausible.

In this paper, we propose a user-controllable approach to the image-to-video translation. Given a single profile photo as input and a target expression (e.g., happiness), our model generates several video clips of various lengths. We allow users to conveniently control the length of a video clip by specifying an array of real numbers between 0 and 1. Each number indicates the expression degree (e.g., 0.6 out of 1) the corresponding frame is supposed to depict. Moreover, our approach can generate a video frame of a particular degree of laughing, for example, without the need of rendering the frames before it. In contrast, most existing video generation methods [5, 6, 7, 8, 9, 10, 11] cannot due to their recurrent generators. Two notable exceptions are [12] and [13]. However, their goals differ from ours; the former predicts the probabilistic future of the input while the latter takes as input both a video frame and sparse trajectories.

We design our deep neural network and the training losses in the following manner in order to achieve the aforementioned properties. The frame generator consists of three modules: a base encoder, a residual encoder, and a decoder taking as input from both encoders. We weigh the skip connections between the residual encoder and the decoder using the expression degrees supplied by users in the test stage. In the training stage, we infer the degrees by assigning 0 to the neutral expression frame, 1 to the frame of the peak expression, and then numbers between 0 and 1 to the frames in between in proportion to their distances to the neutral frame. We train our model following the practice of generative adversarial nets [1] with the following improvements. Noting the importance of the mouth region in expressing emotions, we use a separate discriminator to take care of it. Besides, we regularize the change between adjacent frames to ensure smoothness along the temporal dimension. Finally, we augment the main task of frame generation by predicting the face landmarks.

Extensive experiments and user studies verify that the video clips generated by our approach are of superior quality over those by the competing methods. We would like to highlight that, by even inputting the face images in the wild (downloaded from the Web and the authors’ own photos), our model can generate almost realistic facial expression videos, of which around 50% are labeled as real by Amazon Mechanical Turk workers.

2 Related Work

Image-to-Image Translation. Image-to-image translation has re-gained much attention due to the recent advances of deep generative models [1]. Earlier, researchers usually formulate this task as per-pixel classification or regression [14], where the training loss conditioning on the input image is applied to each pixel such as conditional random fields [15] and nonparametric loss [16]. More recent approaches apply the conditional GAN as a structured loss to penalize the joint configuration of the output, such as the Pixe2Pixel framework by [17]. Subsequently, the translation between two unpaired domains is also studied as CycleGAN [4] and the unsupervised domain adaption method in [18]. Comparing with them, our image-to-video task is more challenging because the temporal dynamics have to be captured in our task.

Video Generation. Predicting the future may benefit many applications, such as learning feature representations [19, 20, 21] and interactions [22]

. Previous works on video generation can be roughly divided into two categories: unconditional video generation and video prediction. The first focuses mainly generates short video clips from random vectors sampled from a prior distribution 

[23, 9]. VGAN [23] does this by separately generating the static background and the foreground. MoCoGAN [9]

decomposes the motion and content into two subspaces where the motion trajectory is learned by a Recurrent Neural Network (RNN). The second category, i.e., video prediction, aims at extrapolating or interpolating video frames from the observed frames 

[5, 6, 7, 8]. Early work focuses on small patches [24]. Owing to the development of deep learning, recent approaches in video prediction have shifted from predicting patches to full frame prediction [5]. For example, [6] proposed an adversarial loss for video prediction and a multi-scale network architecture that results in high quality prediction for a few time steps in natural video. Upon observing that the frame prediction quality by [6] degrades quickly, the HP method by [8] generates the long-term feature frames by first learning the evolution of the high-level structure (e.g. the pose) with a RNN and then constructing the current image frame conditioned on the predicted high-level structure and a image in the pass. A more recent work by [13] attempts to control the video prediction by using user-defined sparse trajectories. Our image-to-video translation is in the same vein as the video prediction, but we emphasize some of its unique characteristics. First, our task requires one single input image other than multiple video frames, opening the door for more potential applications. Second, unlike [5, 6, 7, 8] where recurrent models are applied, our method can skip an arbitrary number of frames during inference and training.

Facial Attribute Manipulation. Several works [25, 26] have been conducted for facial images manipulation. The study by [25] addresses the face attribute manipulation by modifying a face image according to attributes. The approach by [26] performs attribute-guided face image generation on unpaired image data. Since both of the above methods are mainly for static face generation, they are not naturally applicable for our task to generate continues videos of facial expressions.

3 Approach

We first formalize the image-to-video translation problem and then describe our approach in detail.

3.1 Problem formulation

Given an input image where and are respectively the height and width of the image, our goal is to generate a sequence of video frames , where denotes the model to be learned. Note that the variable , called an action variable, takes continuous values between 0 and 1, implying that there could be an infinite number of frames in the generated video clip. In practice, we allow users to give an arbitrary number of values to and, for each of them, our model generates a frame. For simplicity, we use a separate model for each type of facial expressions to describe our approach.

We demand the following properties from the model . It is supposed to reconstruct the input image when , i.e., . Besides, the function has to be smooth with respect to the input . In other words, the generated video frames and should be visually similar when is small. The larger is, the bigger change the generated frame is from the original image . In the case of facial expression generation, we let be the peak state of the expression (e.g., the state when one’s mouth opens to the most when s/he laughs).

The way we formalize the frame generator implies several advantages over the popular recurrent models for video generation. First, the generation process is controllable. One may control the total number of frames by supplying the proper number of values for the action variable . One may also tune the position of the peak state of the expression in the video. For instance, an array of monotonically increasing values let the subject of the input image express his emotion from mild to strong, while a unimodal array like

makes the subject express to the most and then cool down. Besides, the frames to be generated are independent of each other, taxing less over the format of the training data — temporal smoothness is enforced by a regularization term. Finally, this model structure also benefits the optimization procedure because we do not need backpropagate gradients through time, avoiding the potential caveat of vanishing gradients.

Figure 1: Illustration of our model. It consists of two encoders and , one decoder , and two discriminators and .

3.2 Network design for the video frame generator

Figure 1 sketches the neural network modules we designed for the video frame generator

. It is mainly composed of three modules: a base encoder, a residual encoder, and a decoder. In addition, there are two two discriminators for the purpose of generative adversarial training. We employ the Convolution-BatchNorm-ReLu layers in these modules 


Generator. Considering that , a straightforward construction of is to linearly combine the input image with a residual term. However, it would incur severe artifacts to add the two in the pixel space. Instead, we perform linear aggregation in the feature space. Denote by and the base encoder and the residual encoder, respectively, where the former is to extract the feature hierarchy for self-reconstruction and the latter takes care of the change that is useful for constructing the future frames. Concretely, we have the following,


where the variate explicitly determines the intensity of shift off the base encoder. Note that the summation in eq. 1 is layer by layer (cf. Figure 1). The resulting feature hierarchy is then fed to the decoder for video frame generation, i.e.,


where the decoder mirrors the base encoder’s architecture and takes as input the feature hierarchy in the reverse order (cf. Figure 1).

Discriminators. We use two discriminators for the purpose of adversarial training: a global discriminator and a local one . The global discriminator contrasts the generated video frame to the groundtruth frame. This is a standard and effective practice in video generation [23, 9] and image-to-image translation [17]. In addition, we employ a local discriminator to take special account of certain local parts of interest. Taking the smile expression for example, the mouth region is the most active part and deserves more detailed synthesis than the others. We first compute a mask as a convex closure of the detected facial landmarks around the subject’s mouth and then filter out the mouth regions by the mask for both groundtruth and the generated frames. The local discriminator is then applied to the filtered pairs.

3.3 Training loss

We prepare training data in the following manner. Given a video clip of length , assume it has been labeled such that the 1st frame is in the neutral expression state and the -th is at the peak of the expression. We assign coefficient to the -th frame of this clip. Denote by one of these groundtruth frames. We train our neural network using the adversarial loss


for the global discriminator and


for the local discriminator , where is the mask to crop out the local patch of interest and denotes element-wise multiplication. Joining the previous work [17], we find that it is beneficial to augment the adversarial loss with the reconstruction error: .

Temporal continuity. The generative adversarial training of the neural network may result in mode collapse (different modes collapse to a mixed mode that does not exist in the real data) and mode dropping (the generator fails to capture some of the modes). Whereas the reconstruction loss alleviates these issues to some extent, it is defined at a particular time step and does not track the temporal continuity in the video. We propose to regularize the difference between nearby video frames generated by the network. The regularization both helps prevent the mode dropping issue and makes the generated video clips smooth over time. It is defined as below,


where is a small increment. The training is still frame-wise and efficient as the frames , , and are computed independently from each other.

Facial landmark prediction. As discussed earlier, we use facial landmarks to extract the local regions of interest for the local discriminator. In our experiments, we use the Dlib Library [28] to detect 68 landmarks from any groundtruth video frame. These landmarks are supposed to be at the same locations for the correspondingly synthesized video frame. Therefore, we stack another 68-dimensional channel on the top of the second-to-last layer of the global decoder to predict the landmarks, enforcing the generator to provide details of the face. This loss is denoted by , where and are the predicted and groundtruth landmarks, respectively.

Putting the above together, we train our neural networks by alternating between optimizing the generator and the discriminators in order to solve the following problem,


In the experiments, we use slightly different weights in front of the loss and regularization terms.

3.4 Jointly learning the models of different types of facial expressions

Thus far, we have assumed a separate model for each type of facial expressions. It is straightforward to extend it to handle types of emotions jointly:


where is an -dimensional vector with each dimension standing for one emotion type. Since each training video clip contains one type of emotion, only one entry of the vector is non-zero in the training stage. At the test stage, however, we examine the effect of mixing some emotions by allowing non-zeros values in multiple entries of the vector . Note that different types of emotions share the same base encoder (as well as the decoder and discriminators ) and differ only by the residual encoders .

4 Experiments

Given a neutral face image and a target expression(e.g., smile), we generate a video clip to simulate how the face will change towards the target expression. We not only use the public CK+ [29] dataset for model training but also significantly extend it in scale. The new larger-scale dataset is named CK++. To better evaluate the performance of our method, we further collect around raw face images from the Web. We then generate the facial expression videos based on these collected photos and submit them to the AMT for rating.

CK+. The Extended Cohn-Kanade (CK+) dataset [29] is a widely used dataset for facial emotion analysis. It contains 593 videos of 8 different emotion categories (including the neutral category) and 123 subjects. Each video frame is provided with a 68-point facial landmark label. We use three major categories (i.e., “happy”, “angry”, and “surprised”) in this paper.

CK++. Most images in CK+ are in the gray-scale. We augment CK+ by additionally collecting the facial expression videos in the RGB-scale. The videos are collected by a fixed camera from 65 volunteers consisting of 32 males and 33 females. Each volunteer is asked to perform each of the “happy”, “angry” and “surprised” expressions for at least twice. We manually remove the redundant frames before the initial neutral state and after the stationary peak state. We also remove the videos that contain severe head movement or blurry faces. There are 214, 167, and 177 video clips for the “happy”, “angry”, and “surprised” expressions, respectively. On average, each clip has 21 frames. Finally, we use the Dlib Library [28] to detect 68 landmarks from each of the frames.

4.1 Implementation Details

For our encoders, we employ eight-downsampling-layer architectures with the Leaky-ReLu activation function. The decoder mirrors the encoder’s architecture by eight upsampling layers and yet the ReLu activation function. Inspired by the U-Net 

[30], we further add skip connections between intermediate layers of the encoders and the decoder (cf. Figure 1). Both the global and local discriminators are constructed by concatenating 3 convolution layers.

We use 10 video clips from the CK++ dataset for validation and all the others for training. Our network is trained from scratch with all parameters normally initialized. For each training batch, we randomly sample a video clip and then use its first frame as the input image to train the network. All images are resized to 289x289 and randomly cropped to 256x256 before being fed into the network. The Adam optimizer is used in the experiments, with the initial learning rate of 0.0002. The whole training process takes 2100 epochs, where one epoch means a complete pass over the training data. As discussed in § 

3.2, training the local discriminator requires a mask to crop the local regions of interest. Since mouth is the most expressive region, we crop it out by a convex closure of the landmarks around the mouth. We set the small increment to for temporal regulation .

As our task of controllable image-to-video translation is new, there is no exactly related method in the literature. Conservatively, we adapt two previous methods to our experiments including Hierarchical Prediction (HP) [8] and Convolution-LSTM (ConvLSTM) [31]. In particular, we make the following changes to HP and ConvLSTM to fit them to our problem: 1) Since both HP and CovLSTM use LSTM to recursively generate video frames, we have to fix the length of the video sequence to be generated. We do so by uniformly sampling 10 frames per video clip. 2) We train a separate model for each target expression. 3) We replace the CovNet in ConvLSTM and the Visual-Structure in HP with U-Net because the results of their default architectures works not well.

Figure 2: Visualization for the “happy” expression by different methods.
Figure 3: The L2 norm between the landmarks in each frame and the initial frame over time. The left and right sub-figures are on the training and validation samples, respectively.

4.2 Evaluations

Visualization. For a fair comparison, we let our method output the same number of frames as HP and ConvLSTM do (e.g., 10) by setting . Figure 2 displays the generated video frames of the “happy” emotion for two persons, one seen at training and the other unseen. We can see that both our method and the LSTM-based baseline models perform well on the training image. However, when it comes to the person of the validation set, our model clearly outperforms the baselines in terms of both image quality and temporal continuity.

Analysis on temporal continuity. To evaluate temporal continuity quantitatively, we extract facial landmarks from each generated video frame (i.e. a 68x2 dimensional vector) and then compute the L2 distance between the landmarks of each frame and those of the initial one. Figure 3 plots the distances versus the time steps on the same training and validation samples in Figure 2. We find that the curve of our approach aligns well with that of the groundtruth video frames. the face in the image sequence generated by ConvLSTM doesn’t seem to move much, as the facial keypoints almost remain in the same place as the generation process goes. As for HP baseline, there is a sudden change between the generate frame and the initial frame, which means the expression generate by HP do not have a good temporal continuity. In fact, in our validation data, we see many cases where the person in the generated video by HP seem to smile a little, go back, and then smile again. We expect to avoid this phenomenon in our task. While for our proposed model, the L2 norm between keypoints in each frame and the initial frame grows steadily and linearly without decreasing, showing the images generated by our method have a good temporal continuity.

AMT results. Following [8], we also conduct user studies to compare the results of different methods. For this purpose, we formulate test data by downloading about face images from the Web without any post-processing. We then generate the facial expression videos of “happy”, “surprise”, and “angry” by our approach and two baselines. We pair the video clip of the same input image for the same emotion by our method with that by either of the baselines, and then ask an AMT worker to choose which one is more realistic in terms of the temporal continuity, image quality, naturalness of the expressions, etc. The rows above the last in Table 1 show that the users prefer our results to either of the baselines’ to a large margin. We also ask users to choose the most realistic clip from three, respectively generated by our model and the two baselines. As shown in the last row of Table 1, our results are again selected significantly more often than the other two.

In addition, we perform a more challenging test by mixing the simulated videos by our method with real videos, and then asking an AMT worker to judge if the displayed video is real or not. As reported in Table 2, the results are encouraging, as nearly 50% of our generated videos from the test faces are labeled as real by AMT workers.

Figure 4: Ablation studies on our method.
Figure 5: Evaluation on multiple-label models. (a) Videos generated by single-label models; (b) Videos generated by multi-label models; (c) Transferring the “angry” expression to the “happy” one by controlling the action variate in the multi-label model.
"Which video looks more realistic?" Happy Surprise Angry Mean
Prefers ours over ConvLSTM 83.8% 82.2% 83.0% 83.0%
Prefers ours over HP 77.3% 69.7% 67.1% 71.4%
Prefers ours over both baselines 69.9% 63.7 % 61.8% 65.1%
Table 1: Comparisons of AMT results between our method against two baselines.
"Is this video real?" Happy Surprise Angry Mean
Training Videos 64.5% 59.7% 57.3% 60.5%
Testing Videos 49.4% 52.2% 48.3% 49.9%
Table 2: AMT results on how many videos generated by our method can fool the workers.

4.3 Other analysis

Ablation Studies. We have run some ablation studies to examine some key components of our approach, as illustrated in Figure 4. We implement several variants of our method without the local discriminator, without predicting the landmarks, and without the temporal continuity regularization. I. People easily focus on mouths when they first see a video. So a local discriminator on the mouth would make the video seem more realistic to audience. Without local discriminator, Figure 4 (f) easily involves blurring artifact compared to the original model. II. Landmark prediction gives a higher level regulation, which can enable our model to have the ability to generate facial feature in the right place, avoiding generating multiple features in the same image, therefore avoid blurring artifact and make the generate image more clear and reasonable. III. We can see from the example in Figure 4 (e), temporal regulation not only force the movement perform continually, avoid sudden change, but also have the effect to make the generated image more clear.

Controlling Action Variable. One of the most interesting part in our proposed approach is that we can control the lengths of the videos by the action variable . We provide demos on controlling the action variable in the supplementary materials.

Jointly model multiple types of expressions. We present in § 3.4 that our model is applicable for learning different types of emotions simultaneously. As a result, we may mix different emotions by providing more than one non-zero entries to the vector . We first show that it gives rise to better results to simultaneously model different types of expressions in one neural network than learning a separate model for each. To show this, we further formulate a training set by using the faces of the “happy” and “angry” emotions. Each training person has either emotion but not both. For example, the two persons in Figure 5 have only the “angry” expression. Figure 5 (a) and (b) demonstrate that the generated “happy” videos by the jointly modeling the two emotions is more realistic than the models of individual emotions. We conjecture that it is due to the strong correlation between emotions that enables information sharing between the residual encoders.

Another interesting application for the joint model is that it can easily perform transfer between two different emotions by using proper values of the action variable, as illustrated in Figure 5 (c). More results are included in the supplementary materials.

5 Conclusion

In this paper, we study image-to-video translation with a special focus on the facial expression videos. We propose a user-controllable approach so as to generate video clips of various lengths and different target expressions from a single face image. Both the lengths and types of the expressions can be controlled by users. To this end, we design a novel neural network architecture that can incorporate the user input and also propose several improvements to the adversarial training method for the neural networks. Experiments and user studies verify the effectiveness of our approach. It would be interesting to investigate the image-to-video translation in domains other than the facial expressions in the future work. In addition, we will explore the potential of progressive training [2] for generating higher-definition video clips from a single input image.