Facial expression is one of the most effective and natural approaches for human beings to express their emotions. Identifying and synthesizing facial expressions enable current human-computer interaction systems to better understand and simulate human behaviors. In spite of their great effectiveness, the facial expression recognition algorithms, based on deep neural networks, are highly reliant on large amounts of training samples with clean labels. The annotation of a large-scale database is not only time consuming, but also impractical. To overcome this limitation, a number of studies[Meng2017, Pons2018, Ranjan2017] have focused on multi-task learning for facial expression recognition (FER), which aims to train the models under the regularization from some auxiliary tasks. With the regularization effect, multi-task learning (MTL) becomes an effective learning strategy to tackle the overfitting problem, resulting from insufficient training samples. Moreover, learning multiple tasks simultaneously in a network can improve the performance by transferring beneficial knowledge to the main task from other relevant auxiliary tasks. Specifically, in FER, single-task networks are able to learn some very discriminative features with respect to facial expressions. However, the learnt features do not take the nuisance factors, such as subject identity, head pose, and illumination, into sufficient consideration, which results in poor generalization when applied to practical applications. Therefore, MTL contributes significantly to a more robust solution with better generalization for FER tasks.
There are two main underlying problems in current MTL-based FER algorithms, i.e. the design of the auxiliary tasks, and the building of the connections between different tasks. In terms of the auxiliary task design, numerous studies have been proposed for setting the tasks, such as expression-related hidden unit detection [ReedLearningTD2014], identity classification [ZhangFacialER2017], and landmark localization [DevriesMulti-taskLF2014], as the auxiliary tasks. In this paper, we propose to learn the facial expression synthesis (FES) as the auxiliary task for FER. FES aims to synthesize a facial expression image based on a guiding expression label. We employ a patch-based conditional generative adversarial network (cGAN) [isola2017image] to learn the FES task. In addition to the high correlation with FER, FES can also generate extra training samples and balance the training dataset, which can greatly enhance the performance of a deep facial expression recognition framework.
On the other hand, establishing task interaction is another important factor when building multi-task networks, because the interaction will directly affect the information flow between different tasks. Conventional algorithms apply the hard parameter-sharing approach, which shares the feature maps at the bottom layers of a network and separates different branches for different tasks at the top layers, such as the two-head structure in object detection for classification and localization [Ren-2015-FRT]. In spite of its simplicity, the hard parameter-sharing approach lacks the ability in differentiating helpful and harmful information between tasks. To address this issue, we propose a novel multi-task network for FER and FES, namely facial expression recognition and synthesis network (FERSNet), with a soft parameter-sharing mechanism, which contributes to effectively selecting useful features from different tasks and different layers. Therefore, the main contributions of this paper can be summarized as follows:
We propose a novel multi-task convolutional neural network, with the convolutional feature leaky unit, to selectively transfer the beneficial features between the facial expression recognition task and the facial expression synthesis task.
We employ the facial expression synthesis branch to enlarge and balance the training dataset for further improving the generalization ability of the proposed algorithm.
We conduct experiments to demonstrate that the proposed multi-task network achieves promising performance in recognizing and synthesizing facial expression images.
Ii Related Work
Ii-1 Multi-task learning for FER
Multi-task learning for FER has been widely studied over the past few decades. Previous works on multi-task-based FER attempt to combine FER with other facial image analysis tasks, in order to obtain a more robust representation of facial expressions in the feature space. Meng et al. [Meng2017] proposed a two-stream network to extract identity-invariant expression features for the emotion classification. Pons and Masip [Pons2018] suggested that jointly learning a model for FER and facial action units detection can significantly improve the FER performance. Moreover, Ranjan et al. [Ranjan2017] proposed a multi-branch network to solve diverse facial image analysis tasks simultaneously. Zhang et al. [ZhangFacialER2017] proposed a multi-signal CNN under the supervision of the FER and face verification tasks, which forces the model to learn more discriminative features with respect to facial expressions. Ming et al. [DMTL2019]
proposed a multi-task network with the dynamic weights for the FER and face recognition tasks to enhance the model performance. However, the above-mentioned studies have not considered feature selection when sharing information between different tasks, which may greatly degrade their performance, because useless, or even harmful information, is transferred.
Ii-2 Facial expression synthesis
Facial expression synthesis is another widely studied topic in the field of facial image analysis. With the development of generative adversarial network (GAN) [NIPS2014_5423], the facial expression edit/synthesis has achieved appealing performance. Zhang and Song [zhang2017age]
proposed a conditional adversarial autoencoder (CAAE) for synthesizing the facial images with different expressions and ages. ExprGAN[ding2017exprgan] adopted the conditional GAN strategy to produce facial images with different expression intensities. Moreover, Choi et al. 
proposed StarGAN to achieve multi-domain image-to-image translation for facial image synthesis. In addition, the geometric-guided methods[Geo1, Geo2, Geo3] employed the shape-aware supervision from the facial landmarks for expression editing, which achieved state-of-the-art performance in facial expression transference. However, limited existing work employs the soft parameter-sharing strategy to enhance the quality of the synthetic facial images with the FER regularization.
Ii-3 Feature selection mechanism
The selection mechanism plays an important role in multi-task learning, which has been widely applied to natural language processing (NLP). Ruder et al.[Ruder2017SluiceNL] presented Sluice Networks, in which a subspace combination approach was proposed to determine the information flow between different tasks. Moreover, Xiao et al. [xiao-etal-2018-learning]
took advantage of gated recurrent unit (GRU) and proposed a leaky unit with the property of remembering and forgetting information, which achieved state-of-the-art performance in text classification. However, different from the NLP tasks, the facial image analysis tasks usually fuse the local features to form a global representation of the image. Therefore, in this paper, we propose a convolutional feature leaky unit (ConvFLU) to perform feature selection between different tasks and layers.
Iii The Proposed Method
The proposed FERSNet solves the FER and FES tasks in parallel with the reliable knowledge transference. Fig. 1 illustrates the pipeline of FERSNet. The branch at the top aims to recognize the expression of the input facial image, while the branch at the bottom aims to generate a new facial expression image with the same identity as the input, based on the target-expression label. The two branches are connected by a set of soft parameter-sharing blocks, i.e. ConvFLUs. In this section, we first introduce the proposed multi-task framework for the FER and FES tasks, and then we present the details of ConvFLU for selective feature sharing. Finally, we show the learning strategy for the proposed multi-task network to solve the FER and FES problems simultaneously.
As presented in Fig. 1
, the proposed two-stream network consists of two branches for FER and FES, respectively. The first four convolutional blocks in the FER and FES branches are connected with ConvFLUs, which aim to extract discriminative features for the FER and FES tasks, respectively. Based on the extracted FER features, the classifierpredicts the emotion label of the input facial image. In the FES branch, another input variable , which controls the expression of the synthetic image, is fed to the transformer . encodes the information of the target-expression label to produce a feature map with the same size as the extracted FES features. The two feature maps are fused by channel concatenation. Finally, the decoder reconstructs the image based on the fused features to mislead a patch-based discriminator [isola2017image].
In summary, the proposed FERSNet takes a facial image and a target-expression label as inputs to predict the expression label of the input image and synthesize another facial image with the expected (target) expression.
Iii-B Convolutional Feature Leaky Unit
Inspired by GRU in [cho-2014-learning], we propose a convolutional feature leaky unit, which inherits the effective property of remembering and forgetting features. We revise the original structure of GRU and employ it to filter out useless and harmful features, when transferring information between the different tasks. The structure of the proposed ConvFLU is illustrated in Fig. 2. We consider transferring the features from task to task in the -st layer. It is worth noting that there are two ConvFLUs in one transference block, as shown in Fig. 1. Therefore, the information flow is bidirectional between FER and FES.
The leaky gate for transferring information from task to task is defined as follows:
where represents the trainable convolutional kernels for . “[ ]” and “” denote the concatenation and the convolution operation, respectively. and are the input feature maps from task and task in the -st layer, respectively. Then, a new feature map is generated based on as follows:
where and are the trainable convolutional kernels for combining and . denotes the element-wise multiplication. It is clear that controls the information leakage from task to task . Moreover, we further consider a memory gate , which determines the information that should be remembered from the previous feature map. is defined as follows:
where is the trainable convolutional kernels for generating . Thus, the final feature map , containing both the information from and , is computed as follows:
With the leaky gate and the memory gate, ConvFLU is able to select the beneficial features for FER and FES from each other. If the values in the leaky gates are close to , the branch tends to utilize more information from the other task. Similarly, if the values in the memory gates are close to , the branch tends to preserve its information for the corresponding task. We further visualize the leakage gate and the memory gate in Fig. 3 to better illustrate the selective feature-sharing strategy.
It can be seen from the figure that the FER task (task ) mainly focuses on the mouth and the eye regions to predict the expression label, while the FES task (task ) tends to preserve the identity information, when synthesizing a new expression. Therefore, the proposed ConvFLU can be regarded as an attention mechanism, based on task correlation, for performing selective feature sharing.
Iii-C Learning for FER and FES
In this paper, we solve FER and FES simultaneously with the proposed FERSNet. For the FER task, the classifier in the top branch consists of fully connected layers, under a standard cross entropy loss for training, which is defined as follows:
where denotes the output probability vector of the -th training sample, represents its corresponding ground-truth label, and denotes the total number of training samples.
In terms of the FES task in the bottom branch, we consider the loss from a patch-based discriminator [isola2017image], denoted as , and also adopt the reconstruction loss, , to synthesize visually pleasant facial images. Specifically, the GAN loss is defined as follows:
where and denote the real target and the original facial image, respectively. represents the target-expression label. and refer to the generator and the discriminator in the network, respectively. Specifically, the generator consists of all the modules in FERSNet for producing the synthetic face. As for the reconstruction loss, we define it as the mean squared error between the target and the synthetic facial image as follows:
In addition, we follow the learning strategy in [Geo1, Geo2, Geo3], and employ the cycle consistency loss and the identity preserving loss to further improve the synthesis accuracy. The cycle consistency loss guarantees the consistency between the source image and the cycle-reconstructed image, which is defined as:
In other words, if we transform the synthetic face back to the original expression via FERSNet, the same facial image as should be obtained. Moreover, the identity preserving loss
is defined as the distance between the features extracted from the original face and the synthetic face with the pre-trained model-B of the Light CNN[Wu2015ALC], which is formulated as:
where denotes the pre-trained Light CNN for feature extraction. As the Light CNN aims to recognize the identity information based on the facial images, it can extract the most prominent features for identity discrimination.
In this section, we present the implementation details and the experiment settings for evaluating FERSNet. We also compare FERSNet with other state-of-the-art methods for facial expression recognition and synthesis.
Iv-a Implementation Details
As shown in Fig. 1
, the FER branch employs four convolutional blocks. Each convolutional block consists of two convolutional layers, two batch normalization[BN]
layers, two ReLU[ReLU]
layers, and one max pooling layer. For the FES branch, we employ eights convolutional blocks. The first four blocks are established with the same structure as that of the FER branch. The next four blocks form the decoderin Fig. 1, in which we replace the max pooling layer with the deconvolutional layer to upscale the feature maps for reconstructing the target image. The classifier consists of two fully connected layers, which produces a -dimensional vector, indicating the predicted emotion probability. The transformer makes an inverse mapping to generate a feature map from the target -dimensional vector, and thus it consists of fully connected layers and deconvolutional layers. In the network, all the convolutional kernels are of size
, with padding
and stride, except for those in ConvFLUs, where the kernel size is with padding . The max pooling layers and the deconvolutional layers consist of kernels with stride for rescaling the feature maps. The number of filters is fixed to in the convolutional layers of the FER and FES feature extractors.
In the training phase, we randomly select a target label with the corresponding facial image for each training sample. We apply face alignment to all the facial images, based on the method [bulat2017far]. Each aligned face is resized to . We then randomly crop a region from the aligned facial images, and apply random mirroring and random rotation at to the cropped images to obtain the final training samples. For each testing image, we apply the same face alignment method [bulat2017far], and resize the aligned face to . We adopt the center-crop approach to generate the final testing sample with size from each testing image. It is worth noting that the above-mentioned pre-processing and augmentation approaches are commonly used in those methods [8038215, Jung_2015_ICCV, Yang2018, LBVCNN] compared in our experiments.
We implement FERSNet with PyTorch[paszke2017automatic]. During training, we set the batch size to , and the learning rate is set to decrease from to within epochs. We adopt the Adam [Adam] optimizer to minimize the objective function, defined in Eq. (10), with , , and empirically set to , , and , respectively. is set to 0.1 at the beginning, and is gradually increased to during the training process. We train the network on a Nvidia GEFORCE GTX 1080 Ti GPU, and it takes about 5 hours to train up one FERSNet model. The code will be available at https://github.com/RickZ1010/Deep-Multitask-Learning-For-FER-and-FES-based-on-Selective-Feature-Sharing.
Iv-B Evaluation on FER
As the proposed FERSNet aims to recognize and synthesize facial expression images, the training dataset is required to contain both the expression and the identity information. Therefore, we employ three commonly used facial expression benchmarks, which are the the Extended Cohn-Kanade dataset (CK+) [KanadeCK+], the Oulu-CASIA NIR&VIS facial expression database (Oulu-CASIA) [ZhaoOulu], and the MMI facial expression database (MMI) , to evaluate the emotion recognition performance of FERSNet. In addition, we mainly consider six standard facial expressions, i.e. anger (An), disgust (Di), fear (Fe), happiness (Ha), sadness (Sa) and surprise (Su) in our experiments, because these six expressions are universal among humans, irrespective of their age, gender and race [FACS]. Moreover, one additional emotion class, i.e. contempt (Co), is included, when evaluating FERSNet on CK+. The number of video sequences in each database is summarized in terms of the expression labels in Table I.
The Extended Cohn-Kanade (CK+) dataset [KanadeCK+] consists of 593 video sequences collected from 123 subjects. We use the video sequences with the provided seven expression labels, and select the last three peak frames as the emotional faces, which results in 981 images in total. We further split the images into 10 folds based on the identity, and perform the 10-fold identity-independent cross-validation to train FERSNet using 90% of the samples and test its performance using the remaining 10% of the samples. The final recognition results are obtained by averaging the accuracy over the 10 runs.
|LBP-TOP [zhao2007dynamic]||✗||Image sequence||88.99|
|HOG 3D [klaser2008]||✗||Image sequence||91.44|
|3DCNN [3DCNN]||✗||Image sequence||85.9|
|IACNN ||✓||Single image||95.37|
|DTAGN [Jung_2015_ICCV]||✓||Image sequence||97.25|
|IPA2LT [Zeng_2018_ECCV]||✗||Single image||91.67|
|DeRL [Yang2018]||✓||Single image||97.30|
|LBVCNN [LBVCNN]||✓||Image sequence||97.38|
|DMT-CNN [DMTL2019]||✓||Single image||97.55|
|FERSNet (BU-4DFE)||✓||Single image||97.85|
The comparison results on CK+ are listed in Table II. As DeRL [Yang2018] was pre-trained on the BU-4DFE dataset , we established two versions of FERSNet, which are trained from scratch and pre-trained on BU-4DFE, to make the comparison with DeRL fair and reasonable. The BU-4DFE dataset  contains images from 101 subjects with the six standard emotion labels. We follow the settings in Sec. IV-A to pre-train FERSNet on BU-4DFE. In addition, the original DMT-CNN is designed for recognizing the eight facial expressions, i.e the original seven expressions and the neutral faces, in CK+. To make a fair comparison, we established and re-trained the DMT-CNN model, based on the settings in [DMTL2019], to recognize the the original seven expressions in CK+. It can be observed from the table that the proposed FERSNet outperforms all the other methods on CK+. Compared to DeRL, which is also a “GAN + Classifier” method for FER, FERSNet obtains an accuracy improvement of about 0.55%.
The Oulu-CASIA [ZhaoOulu] database consists of video sequences under three different illumination conditions. In our experiments, we only use 480 video sequences, taken from 80 subjects under the strong illumination condition. There are six emotion labels in Oulu-CASIA, i.e. anger (An), disgust (Di), fear (Fe), happiness (Ha), sadness (Sa) and surprise (Su). For each video sequence, we select the last three peak frames with the provided emotion label to form the dataset, which results in images in total. Similar to CK+, the 10-fold identity-independent cross-validation is performed to evaluate FERSNet on Oulu-CASIA.
|LBP-TOP [zhao2007dynamic]||✗||Image sequence||68.13|
|HOG 3D [klaser2008]||✗||Image sequence||70.63|
|STM-Explet ||✗||Image sequence||74.59|
|DTAGN [Jung_2015_ICCV]||✓||Image sequence||81.46|
|IPA2LT [Zeng_2018_ECCV]||✗||Single image||61.02|
|DeRL [Yang2018]||✓||Single image||88.0|
|LBVCNN [LBVCNN]||✓||Image sequence||82.41|
|DMT-CNN [DMTL2019]||✓||Single image||87.5|
|ExprGAN [ding2017exprgan]||✓||Single image||84.72|
|FERSNet (BU-4DFE)||✓||Single image||89.23|
The comparison results on Oulu-CASIA are summarized in Table III. FERSNet pre-trained on BU-4DFE outperforms all the competitors, and it surpasses DeRL by about 1.2%. In addition, we observe that with the pre-training on BU-4DFE, FERSNet acquires a larger accuracy improvement of about 6% on Oulu-CASIA, and the FERSNet model trained from scratch achieves an accuracy of 83.47%.
The MMI database  consists of 236 video sequences, recorded from 31 subjects. Each video sequence is labelled as one of the six standard emotions. We select 208 video sequences with the frontal-view faces. Because the label is provided for each sequence and the peak expression face mainly appears in the middle, we further select the three frames in the middle of each sequence, which results in 624 images in total. We also follow the 10-fold identity-independent cross-validation strategy to evaluate the performance of FERSNet on MMI. The final accuracy is averaged over the 10 runs on MMI. We present the comparison results in Table IV.
|LBP-TOP [zhao2007dynamic]||✗||Image sequence||59.51|
|HOG 3D [klaser2008]||✗||Image sequence||60.89|
|STM-Explet ||✗||Image sequence||75.12|
|DTAGN [Jung_2015_ICCV]||✓||Image sequence||70.24|
|IACNN ||✓||Single image||71.55|
|DeRL [Yang2018]||✓||Single image||73.23|
|LBVCNN [LBVCNN]||✓||Image sequence||76.28|
|FERSNet (BU-4DFE)||✓||Single image||75.32|
It can be seen from the table that the sequence-based method, i.e. LBVCNN [LBVCNN], achieves higher accuracy than the proposed FERSNet. Those sequence-based methods employ the temporal information, while FERSNet only considers the spatial information from a static image. We achieve about 75.3% on MMI, which is very close to the sequence-based methods and still surpasses DeLR by about 2%.
In the above experiments, FERSNet consistently outperforms DeLR on CK+, Oulu-CASIA, and MMI, which shows the effectiveness of jointly learning for FER and FES. Compared to the de-expression strategy in DeRL [Yang2018], the proposed facial expression editing (synthesizing) strategy is a more general case for transferring expression information, and thus FERSNet produces better results.
Iv-C Evaluation on FES
We evaluate the performance of FERSNet on synthesizing facial images with the expected expression. In the FES experiments, we mainly consider the six basic emotions, and we present the visual results of FERSNet with the other generative models, including StarGAN , ExprGAN [ding2017exprgan], CycleGAN [CycleGAN2017] and CAAE [zhang2017age] on CK+ [KanadeCK+] and Oulu-CASIA [ZhaoOulu]. Among them, ExprGAN and StarGAN are the conditional generative frameworks, which are specially designed for facial expression synthesis. It is worth noting that the geometric-guided methods [Geo1, Geo2, Geo3] are not included in the comparison, because they obviously use much stronger supervision than the proposed algorithm. The qualitative results are presented in Fig. 4, which clearly shows that CAAE and CycleGAN fail to produce satisfactory facial images, as they create the images with visible distortions. ExprGAN and StarGAN generally produce comparable results with our proposed method. Nevertheless, the proposed FERSNet can better maintain the color consistency on Oulu-CASIA. In addition, it can be observed from Fig. 4 that FERSNet exhibits a better ability to generate the eye and the mouth regions, and preserve the identity information. This is because the leaky gate and the memory gate in ConvFLU can more effectively transfer and memorize the related features for the FES task. In summary, FERSNet synthesizes more visually pleasant images with less artefacts and blurs.
To further validate FERSNet on the FES task, we follow the evaluation approach in StarGAN  and ExprGAN [ding2017exprgan], and establish the quantitative comparison. Specifically, we train the above-mentioned FES methods on the training set, and then perform the expression synthesis on the unseen testing set. The synthetic facial images are fed to an expression recognition network. A higher recognition accuracy on the synthetic facial images indicates more realistic expression synthesis, because the generated images lie in the same manifold of natural expressions. It is worth noting that the expression recognition network is independently trained on the original training set, containing real facial images only. Then, we employ this network to recognize the synthetic images from the different generative models. The results are summarized in Table V. We also adopt the 10-fold identity-independent cross-validation strategy in this experiment, and the final recognition results are obtained by averaging the accuracy over the 10 runs.
As shown in the table, the proposed FERSNet achieves the highest accuracy, which demonstrates its superiority in synthesizing more realistic facial expression images.
Iv-D Ablation Study
In order to make a comprehensive analysis on the proposed method, we present the ablation study to investigate the effectiveness of the different novel designs in FERSNet. Specifically, we establish the FERSNet model without the FES regularization and without ConvFLU, respectively. We further employ the FES branch as a data augmentation approach for the FES task. The results are listed in Table VI.
|FERSNet w/o MTL||94.70||73.33||63.78|
|FERSNet w/o ConvFLU||95.21||77.92||69.07|
|FERSNet w/ FES-DA||97.75||87.64||73.87|
The FERSNet model, trained without FES regularization, becomes a single-task network, which does not adopt the multi-task learning strategy (FERSNet w/o MTL). The FERSNet model without ConvFLU (FERSNet w/o ConvFLU) becomes a hard parameter-sharing multi-task network, which does not obtain the ability to select beneficial information when transferring features. FERSNet with FES data augmentation (FERSNet w/ FES-DA) is the original FERSNet further fine-tuned on the synthetic samples produced by the FES branch. In other words, we employ the FES branch as a data augmentation approach for fine-tuning the network. As listed in Table I, MMI and CK+ are highly imbalanced datasets, and therefore we employ the FES branch to enlarge and balance the number of samples in each emotion class. For each training sample, we synthesize all the possible expression images from it to make the training set completely balanced. In Table VI, we present the results of the “FERSNet w/ FES-DA” model using 24K images for training. It is worth noting that all the methods in Table VI are based on the original FERSNet, without pre-training on BU-4DFE. To make a fair comparison, we enlarge the kernel size and network depth in “FERSNet w/o MTL” and “FERSNet w/o ConvFLU” to make their model capacity equal to or larger than the original FERSNet.
It is obvious from the table that jointly learning with FES contributes significantly to the FER performance, as the original FERSNet outperforms “FERSNet w/o MTL” by about 3%, 10%, and 8% on CK+, Oulu-CASIA, and MMI, respectively. In addition, the proposed soft parameter-sharing strategy can further enhance the performance of MTL, because, without ConvFLU, the performance will be degraded by more than 2% on the three datasets. More importantly, the FES branch can serve as a data augmentation method for FER. The original FERSNet, fine-tuned on the synthetic samples, acquires a better generalization ability for FER. With the FES data augmentation, the model can further obtain an accuracy gain of about 0.4%, 4%, and 2.5% on CK+, Oulu-CASIA, and MMI, respectively.
In this paper, we have proposed a multi-task network, namely FERSNet, for facial expression recognition (FER) and facial expression synthesis (FES). FERSNet aims to solve the two tasks in parallel with the proposed convolutional feature leaky units (ConvFLU). ConvFLU adopts a soft parameter-sharing strategy, in order to filter out the useless and harmful features, when transferring information between FER and FES. Moreover, we further employ the FES branch for data augmentation to enlarge and balance the training dataset. This augmentation approach contributes to a better generalization of FERSNet for recognizing facial expressions in real-world applications. We evaluate the proposed method on three commonly used benchmarks. The experimental results have demonstrated that the proposed method achieves state-of-the-art performance, which makes it a potential solution to practical facial image analysis problems.