VGAN-Based Image Representation Learning for Privacy-Preserving Facial Expression Recognition, CVPRW, 2018
Reliable facial expression recognition plays a critical role in human-machine interactions. However, most of the facial expression analysis methodologies proposed to date pay little or no attention to the protection of a user's privacy. In this paper, we propose a Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN) to learn an image representation that is explicitly disentangled from the identity information. At the same time, this representation is discriminative from the standpoint of facial expression recognition and generative as it allows expression-equivalent face image synthesis. We evaluate the proposed model on two public datasets under various threat scenarios. Quantitative and qualitative results demonstrate that our approach strikes a balance between the preservation of privacy and data utility. We further demonstrate that our model can be effectively applied to other tasks such as expression morphing and image completion.READ FULL TEXT VIEW PDF
The representation used for Facial Expression Recognition (FER) usually
Unprecedented data collection and sharing have exacerbated privacy conce...
From TV news to Google StreetView, face obscuration has been used for pr...
The ICML 2013 Workshop on Challenges in Representation Learning focused ...
Limited annotated data available for the recognition of facial expressio...
In this paper, we present an Attention-based Identity Preserving Generat...
As the expressive depth of an emotional face differs with individuals,
VGAN-Based Image Representation Learning for Privacy-Preserving Facial Expression Recognition, CVPRW, 2018
The recent proliferation of sensors in living spaces is propelling the development of “smart” rooms that can sense and interact with occupants to deliver a number of benefits such as improvements in energy efficiency, health outcomes, and productivity . Automatic facial expression recognition is an important component of human-machine interaction. To date, a wide variety of methods have been proposed to accomplish this, however they typically rely on high-resolution images and ignore the visual privacy  of users. Growing privacy concerns will prove to be a major deterrent in the widespread adoption of camera-equipped smart rooms and the attainment of their concomitant benefits. Therefore, reliable and accurate privacy-preserving methodologies for facial expression recognition are needed.
One approach to increase visual privacy is to reduce identity traits within a face image via
modification or redaction methods such as pixelization or blurring. However, this will also reduce the visual quality of the modified image and an algorithm’s ability to accurately recognize the facial expression from it. Another extreme approach is to withhold releasing the face image altogether and only release an estimate of the facial expression. While this approach guarantees visual privacy, it provides no visual utility. In order to strike a balance between privacy and data utility, we propose a third radically different approach: seamlesslyreplace the user-identity in an image without significantly degrading its visual quality or the ability to accurately infer facial expression. We leverage variational generative-adversarial networks (VGANs) to learn an identity-invariant representation of an image while enabling the synthesis of a utility-equivalent, realistic version of this image with a different identity (Fig. 1). We call this framework Privacy-Preserving Representation-Learning Variational Generative Adversarial Network (PPRL-VGAN). Beyond its application to privacy-preserving visual analytics, our approach could also be used to generate realistic avatars for animation and gaming.
Our proposed framework combines the generative power of two models: the Variational Auto-Encoder (VAE)  and the Generative Adversarial Network (GAN) . A VAE consists of two networks: the encoder, which maps a data sample to a latent representation, and the decoder, which maps this representation back to data space. VAE networks are trained by minimizing a cost function that encourages learning a latent representation which leads to realistic data synthesis while ensuring sufficient diversity in the synthesized data. Like a VAE, a GAN also consists of two networks: a generator network (
) which aims to synthesize realistic data from a random noise input vector and a discriminator network () which aims to differentiate between real and synthetic data. GANs are trained via a game between and in which aims to fool into believing that the data samples synthesized by it are realistic, and which aims to accurately distinguish between real and “fake” samples. In this work, we combine VAEs with GANs by replacing the generator in a conventional GAN, which uses random noise as input, with a VAE encoder-decoder pair, which takes a real image as an input and outputs a synthesized image. As shown in Fig. 2, the encoder learns a mapping from a face image to a latent representation . The representation is subsequently fed into the decoder to synthesize a face image with some target identity (specified by identity code
) but with the same facial expression as the input image. The discriminator includes multiple classifiers that are trained to (i) distinguish real face images from synthesized ones, (ii) recognize the identity of the person in a face image and (iii) recognize the expression in a face image. During training, feedback signals fromguide to create realistic expression-preserving face images. In addition, as the identity of the synthesized images is determined by the identity code , the network will learn to disentangle the identity-related information from the latent representation.
This paper makes the following contributions:
We propose a framework for learning an identity-invariant representation for a face image. This representation is discriminative for facial expression recognition and generative for expression-preserving, identity-altered face image synthesis.
We thoroughly evaluate our approach under three threat scenarios to demonstrate that our method strikes a balance between privacy and data utility.
We demonstrate that our model can synthesize new face images with or without an input image, and illustrate how our model can also be applied to other image processing tasks such as expression morphing and image completion.
Privacy-Preserving Visual Analytics: There is a growing body of research on methods to perform various visual analysis tasks from data in a manner that does not disclose subject’s identity. According to how privacy is protected, the literature can be broadly classified as reversible and irreversible approaches .
Reversible methods include scrambling and encryption [12, 32, 35, 36] that permit exact data recovery, but are also prone to exposing the original data to possible hacks. In particular, methods for recognizing facial expression directly in the encrypted domain have been proposed [3, 28]. However, these methods rely upon public-key homomorphic cryptosystems, such as Paillier , which are known to be computationally heavy due to their use of large encryption and decryption keys. In order to relieve the computational burden, lightweight algorithms based on randomization techniques have been proposed in . Although methods proposed in [3, 28, 29] perform well for facial expression recognition in the encrypted domain, no tests have been conducted to ascertain whether the identity information is indeed removed in the encrypted domain. It is unclear whether a classifier that is trained on encrypted-domain images will fail to recognize the identity of a person from the encrypted image.
Irreversible methods include image processing and filtering techniques [7, 8, 11, 15, 19, 26, 31]. However, it has been shown that simple filtering methods do not fool identity-recognition algorithms if they are trained using images that have the same distortion as the test images . A face de-identification method was proposed in  wherein several face images with appearance attributes similar to the target image are fused by minimizing a cost function promoting attribute preservation and de-identification. A recent line of irreversible methods makes use of adversarial networks [6, 27, 30]. In , the focus is on full-body de-identification without an additional utility criterion such as accuracy of facial expression. Their methodology also relies upon a segmentation algorithm to accurately extract the silhouette of the person to be de-identified. Moreover, the synthesized images are blurry. While  uses adversarial networks to jointly optimize privacy and utility objectives, it focuses on the relatively simple task of detecting and removing a QR code embedded in an image. Moreover, the synthesized images are poor-quality renderings of the input image. The approach in  is similar in spirit to  but the output is not required to look realistic. Our approach differs from these methods in that we use a VAE within a GAN in order to explicitly learn an identity-invariant facial expression representation with the explicit goal of expression-preserving identity replacement in the synthesized output image which is required to look realistic. As we show, our learned representation is not only discriminative for expression recognition, but also robust to both human and algorithm-based privacy attacks. Our framework can also be used for other tasks such as expression morphing.
Disentangled Representation Learning: A number of models have been proposed in the literature to learn a so-called “disentangled representation”. In early work, a bilinear model was proposed to separate content and style for face and text images 
. An autoencoder (AE) augmented with simple regularization terms during training was proposed in and demonstrated to discover and explicitly learn various latent factors of variation. Methods proposed in [17, 21]
use VAEs in a semi-supervised manner. Their models disentangle label information from the latent representation by providing additional labels as input to the decoder. However, methods based on AE/VAE tend to produce blurry images due to the pixel-wise reconstruction error used in the loss function. Our model may be viewed as replacing the image reconstruction error with an adversarial loss to improve the visual quality of synthesized images. Recently, a two-stage pipeline was proposed to learn disentangled image representations of background, foreground, and pose to generate novel person images. However, this method requires a pre-processing step to estimate a coarse pose mask of the input image.
Among works on disentangled representation learning, perhaps the closest to ours are those in [22, 34]. The approach proposed in  addresses the problem of disentaglement by combining a deep convolutional VAE with a form of adversarial training. It can disentangle the latent factors of variation within a labeled dataset, and separate them into complementary codes. However, it has not been tested on a real-world dataset. Our approach is different from that in  as we completely discard the VAE’s reconstruction error in the objective function. Instead, we employ the adversarial loss from a GAN for high-quality image synthesis and improved representation learning. In 
, a disentangled representation-learning GAN was proposed for pose-invariant face recognition. The proposed model is a fusion of an AE and a GAN. It explicitly disentangles the identity representation from pose variation by passing a pose code to the decoder during training. The major difference between this model and ours is that in PPRL-VGAN we use a VAE instead of an AE which permits learning a probability distribution over the latent space. This enables our model to synthesize new images without an input image; all we need to do is generate a latent vector from the prior distribution and pass it to the decoder along with an identity code.
A VAE network consists of two neural networks: an encoder network () and a decoder network (). The encoder is a randomized mapping of a data sample to a latent representation while the decoder is a randomized mapping from a latent representation back to data space:
In practice, these randomized mappings are implemented via deterministic maps (given by the neural networks) with additional inputs which provide the source of randomness. For example, it is common to set where the vector and the square matrix are the outputs of a neural network with input , and , a standard multivariate Gaussian, is the source of randomness. Then, . VAE networks are trained by minimizing a cost function which is additive over all training data samples. The cost function for a single data sample is given by
is the Kullback-Leibler divergence and, the marginal distribution of the latent representation, is typically taken to be . The first term encourages the decoder to assign higher probability to the observed data samples . In practice, the expectation in the first term is replaced by an empirical average across a small batch of independent and identically distributed for a given . The term encourages the encoder to be close to a target which has sufficient spread (diversity) in the latent space. The term has a closed analytic form since both its arguments are Gaussian . The total cost across all data samples is typically minimized via mini-batch gradient descent.
A standard GAN consists of a generator neural network and a discriminator neural network that are trained by making them compete in a two-player min-max game. The discriminator network D adjusts its weights so as to reliably distinguish real data samples from fake data samples generated by passing , randomly sampled from some distribution , through the generator network . The generator network adjusts its weights to fool . The discriminator D assigns probability to the event that is a “real” training data sample and the probability to the event that is a “fake” sample synthesized by the generator. The two networks are trained iteratively using a loss function given by
with aiming to minimize and aiming to maximize it. In practice, the expectations are replaced by empirical averages over a mini-batch of samples and the loss function is alternately minimized and maximized from one mini-batch to the next as in mini-batch gradient descent.
Given a face image with an identity label and an expression label , where and are the numbers of distinct subjects and facial expressions, respectively, the proposed model has two objectives: 1) to learn an identity-invariant face image representation for facial expression recognition, and 2) to synthesize a realistic face image with the same facial expression as in
and target identity specified by a one-hot encoded identity code.
Discriminator: Different from the discriminator network in a conventional GAN, the discriminator in PPRL-VGAN is a multi-task classifier consisting of three separate neural networks (Fig. 2): 1) the network classifies an input face image as real or synthetic, 2) the network estimates the identity of the person in the input face image, and 3) the network classifies the facial expression in the input face image. The weights of the networks in are trained to classify real face image inputs as real and accurately recognize the person’s identity and the facial expression. They are also trained to classify synthetic image inputs as fake. This is accomplished by adjusting the network weights to maximize the following discriminator cost function:
where , are the predicted probabilities of the th class for identity and facial expression, respectively. The tuning parameters , and control the relative importance between image quality, identity recognition, and expression recognition objectives.
Generator: In contrast to the generator in a conventional GAN which directly maps a “noise” vector to a synthesized image, the generator in a PPRL-VGAN maps a real input image with identity and expression to a synthesized output image with a target identity and the same expression . This is accomplished via a VAE-like encoder-decoder structure. Specifically, the encoder aims to learn an image representation from via a randomized mapping parameterized by the weights of the encoder neural network. Similarly to a VAE, the cost function for training the generator includes divergence between a prior distribution on the latent space and the conditional distribution . Training attempts to minimize this term. The generator cost function also includes a term that encourages the decoder to learn to synthesize a face image that can fool into classifying it as a real face image having the same facial expression as the input image , but with a target identity determined by . Specifically, the generator network weights are adjusted during training to minimize the following generator cost function:
where , , and are tuning parameters of the loss functions for , , and divergence respectively. A key difference compared to the cost in Eq. 3 is that first term (reconstruction error) in Eq. 3 has been replaced with a perceptual loss term for the discriminator in Eq. 6.
Training alternates between maximizing Eq. 5 with respect to the weights of the networks in and minimizing Eq. 6 with respect to the weights of the networks in . As the target identity code ranges over all distinct subjects, synthetic images are produced for each training or test image
. As in the training of VAEs and GANs, the expectations are approximated by empirical averages computed from a mini-batch of training examples. Over successive training epochs,learns to fit the true data distribution and create a realistic face image that can fool having the same facial expression as the input image, which can be correctly recognized by , and identity , which can be correctly recognized by . As the latent code determines the identity of , the encoder is encouraged to disentangle the identity information from . Moreover, as retains information about facial expression, the encoder is also encouraged to embed as many expression attributes as possible into . As a consequence, is a generative representation that is not only invariant to identity, but also discriminative for facial expression recognition.
In order to validate the effectiveness of the proposed model, we conducted experiments on two public facial expression datasets: FERG  and MUG . FERG is a database of cartoon characters with annotated facial expressions containing 55,769 annotated face images of six characters. The images for each character are grouped into 7 types of cardinal expressions, viz. anger, disgust, fear, joy, neutral, sadness and surprise. The MUG database is video-based. It consists of realistic image sequences of 86 subjects performing the same 7 cardinal expressions. For the sake of computational efficiency, we chose the 8 subjects having the most image samples as our training and testing data. In each image sequence, we removed the first and last 20 frames which mostly correspond to the neutral expression. We used 11,549 images in total. In experiments with both datasets, we randomly selected (without replacement) images of each expression from each subject for the training set. The remaining of images were used as testing data. We also resized each RGB image to -pixel resolution.
We used the same network architecture for both datasets. Details of PPRL-VGAN structure are listed in Table 1
. We implemented our algorithm in Keras
and trained all networks from scratch. The weights were initialized to be zero-mean Gaussian with a small standard deviation of
. We used a batch size of 256 and performed batch normalization after each convolutional/deconvolutional layer except the last deconvolutional layer in the decoder. We set
for LeakyReLU’s across the network. We used RMSprop optimizer with a learning rate of . We observed that network training is very sensitive to the choice of the tuning parameters in the generator and discriminator cost functions. We optimized these parameters using grid search. We found that the following values: , , for discriminator training and , , , for generator training work well. In conventional GANs, it is common to optimize the discriminator more frequently than the generator. However, we update the generator twice as frequently as the discriminator in training because the class labels used in PPRL-VGAN provide additional labeled data that help the discriminator training.
|1||conv. , BNorm, LeakyReLU||2048 FC layers , LeakyReLU||conv, BNorm, LeakyReLU|
|2||conv. , BNorm, LeakyReLU||deconv. , BNorm, LeakyReLU||conv, BNorm, LeakyReLU|
|3||conv. , BNorm, LeakyReLU||deconv. , BNorm, LeakyReLU||conv, BNorm, LeakyReLU|
|4||conv. , BNorm, LeakyReLU||deconv. , BNorm, LeakyReLU||conv, BNorm, LeakyReLU|
|5||128 fully-connected (FC), Linear||deconv, tanh||256 fully-connected, LeakyReLU|
|6||: 1 FC , : FC , : FC|
The source code, additional implementation details and more experimental results are available on our project website .
We evaluate privacy-preserving performance of the proposed PPRL-VGAN under three threat scenarios.
Attack scenario i@: This is a simple scenario in which the attacker has access to the unaltered training set . However, the attacker’s test set consists of all images in the original test set after they have been passed through the trained PPRL-VGAN network. Thus, the attacker never gets to see the original test image but only its privacy-protected version . Also, the test set for the attacker contains all distinct privacy-protected versions of each corresponding to distinct values of the identity code .
Attack scenario ii@: This is a more challenging scenario (from the perspective of protecting privacy) where the attacker has access to the privacy-protected training images and knows their underlying ground-truth identities . Therefore, the attacker can train an identifier on training images that have the same type of identity-protecting transformation as the test images. If the proposed privacy-preserving transformation is weak and the identifier has sufficient learning capacity, it may be possible for a trained identifier to correctly predict the underlying ground-truth identity even from a privacy-protected test image. Similarly to scenario i@, there are images for each training and testing image.
Attack scenario iii@: In this scenario, the attacker gets access to the encoder network and can obtain the latent representation for any image . Then, if the produced latent representation is not void of identity traits, the attacker can train an identifier using and apply it to for identification. Although more challenging than scenario ii@, because the attacker can access the “more pristine” , there are fewer training and test samples available since the identity code does not enter into the picture and thus there is no -fold dataset expansion. Moreover whereas resembles a real image, needs not (and typically does not).
In terms of utility, we train a dedicated facial expression classifier in each scenario with the available format of training data and the corresponding ground-truth expression labels. Then, we apply this classifier to test data and measure the facial expression recognition performance.
We first conduct a detailed evaluation of the proposed framework with respect to privacy preservation and data utility. We use correct classification rate (CCR) in person identification to measure how much privacy is preserved (the lower the CCR, the better) and also in facial expression recognition to measure the utility of data (the higher the CCR, the better). Table 2 summarizes the performance of the proposed approach on the FERG and MUG datasets under a privacy-unconstrained scenario (training and testing sets are both unaltered), under a random-guessing attack and under the three attack scenarios described earlier. In each scenario, the identification and facial expression are estimated separately by different neural network classifiers.
|Attack Scenario i@|
|Attack Scenario ii@|
|Attack Scenario iii@|
For attack scenario i@, we train an identifier using the original training set and apply it to privacy-protected test images . The identifier has the same structure as (Fig. 2). We first observe that the identification CCRs are for FERG and for MUG. Both are llose to a random guess ( for FERG since there are 6 characters and for MUG since we selected 8 subjects). However, the same classifier applied to the privacy-unconstrained test images results in identification performance on both datasets. Such a huge performance gap confirms the proposed model effectively protects users’ privacy when the attacker has no information about the applied privacy-preserving transformation. For utility evaluation, we train a dedicated facial expression classifier, with the same structure as , using pairs and test it on images. The resulting expression recognition accuracies are for FERG and for MUG. These results are close to those achieved in the privacy-unconstrained scenario, which indicates that the synthesized images look realistic and retain the expression of the input images.
In attack scenario ii@, we use the privacy protected training data and the corresponding ground-truth identity labels to train an identity recognizer and the ground-truth expressions to train a facial expression classifier (having the same architectures as in scenario i@). We first observe that the identification accuracy in scenario ii@ is about higher than that of a random guess for both datasets, which suggests that some identity-related information is leaked into the synthesized images, but this is still much lower than in the privacy-unconstrained scenario. With respect to facial expression recognition, the performance in scenario ii@ is consistently better than that in scenario i@. This is likely because the number of training samples in scenario ii@ is times that in scenario i@, which benefits the training of the facial expression classifier.
In attack scenario iii@, we assume the attacker can access the latent representations of the training and probe images. We simulate this attack scenario by training an identifier using and test it on . However, as is a 1-D vector, the 2-D ConvNet classifiers we used before are not suitable. We have experimented with 3 classifiers for
, namely a Support Vector Machine (SVM), a customized 1-D ConvNet and a customized Artificial Neural Network (ANN). The customized ANN (3 hidden layers, each with 256 nodes) performed best in terms of identification and expression recognition accuracy. Therefore, only results for the customized ANN classifier are reported. As shown in Table2, the identification performance is reduced in comparison with scenario ii@. However, the expression recognition performance in scenario iii@ is the best among the three attack scenarios. Effectively, this suggests that the learned image representation contains crucial facial expression information, but is largely disentangled from the identity information.
Identity Replacement/Expression Transfer: In addition to producing an identity-invariant image representation, PPRL-VGAN can be applied to an input face image of any identity to synthesize a realistic, expression-equivalent output face image of a target identity specified by the latent code (see Fig. 3). This may also be equivalently viewed as “transferring” an expression from one face to another. Unlike in a standard GAN, the synthesized image contains a lot of detail about the target identity due to the incorporation of the identifier and the expression classifier .
Face Image Synthesis without Input Image: Once trained, our model can also synthesize face images without using an input image. This is due to the constraint we impose on the encoder which forces the distribution of the latent representation to follow a prior distribution (in our experiments: ). To generate a new face image, we simply sample a latent vector from the prior distribution and concatenate it with an identity code. Then, we feed the concatenated vector into the decoder for image generation. As shown in Fig. 4, the synthesized images are realistic and the identities are consistent with the identity code . While the current model is incapable of controlling the facial expression of a generated image when no input image is given, we believe the synthesized images are useful for other applications, e.g, augmenting the original dataset.
Face Image Synthesis for Left-Out Expression: In order to further evaluate the generative capacity of PPRL-VGAN, we conducted experiments where we intentionally left out all samples of a specific facial expression from subject in training (images of expression from other subjects are still used) and then synthesized the left-out expression for subject after the model had been trained. This was done by feeding the generator an image with expression from subject , , and an identity code with th entry equal to 1 and all other entries 0.
Figure 5 shows examples of left-out expression synthesis. While artifacts are clearly visible, the synthesized images capture the essential traits of a left-out expression, thus validating the generative capacity of PPRL-VGAN.
Facial expression morphing is a challenging problem because a human face is highly non-rigid and significantly deforms across expressions. Most methods perform face morphing in image space. Here, we leverage the latent representation and apply linear interpolation in latent space. Let, be a pair of source images with different expressions for subject and , their corresponding latent representations. First, we linearly interpolate and in the latent space to obtain a series of new representations as follows:
Then, we feed and identity code into the decoder to synthesize images. Figure 6 shows two examples of expression morphing. We can see that in both cases, the facial expression changes gradually from left to right. These smooth semantic changes indicate the model is able to capture salient expression characteristics in .
Image completion: PPRL-VGAN can be also applied to an image completion task. We tested two different masks (Fig. 7): one covering the eyebrows, eyes and nose, and the other covering the mouth (each mask occupies of the image). To complete the missing content of a query image of subject , we first pass to the encoder to produce a latent representation . Then, we feed and to the decoder for synthesizing a new image . Finally, we replace the missing pixel values of with values from corresponding locations in .
Examples of both successful and unsuccessful image completions are shown in Fig. 7. Figure 6(a) shows examples for which our model was able to accurately estimate the missing image content. This demonstrates that our model learns correlations between different facial features, for example that opening the mouth is likely to appear jointly with raising eyebrows. However, our model occasionally fails (Fig. 6(b)). One possible reason for this is that some critical facial features (e.g., lowered eyebrows and narrowed eyes in the angry expression) are missing. A distortion may also occur when a face in the synthesized images is not accurately aligned with the one in the query image.
We presented a PPRL-VGAN for privacy-preserving facial expression recognition and face image synthesis. We proposed a novel architecture combining a VAE and a GAN to create an identity-invariant representation of a face image that also permits synthesis of an expression-preserving and realistic version. Experimental results on two public facial expression datasets demonstrate that our approach strikes a balance between privacy preservation and data utility. In addition, the proposed model can support a variety of applications like expression morphing and image completion. Generalizing the proposed framework to handle input images from unseen persons is part of our ongoing research.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research.
Modeling stylized character expressions via deep learning.In
Asian Conference on Computer Vision, pages 136–153. Springer, 2016.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, volume 1, page 4, 2017.
Neural Networks for Machine Learning, Coursera lecture 6e, 2012.