Facial activity, as one of the most important emotion and intention sensing cues for human, has been extensively studied in the past decades. A robust system for facial activity analysis should be able to recognize the basic expressions [lucey2010extended], i.e., anger, disgust, fear, happy, sad, surprise, and/or compound expressions from facial images. With the development in human-computer interaction, a highly accurate and efficient facial expression recognition (FER) system has been desired in various real scenarios, such as remote education, online entertainment, and intelligent autonomous transportation.
As pointed out by [Liu2014], FER can be categorized as a standard image-level classification problem that consists of three major steps, i.e.
, feature learning, feature selection, and classifier construction. During the past decades, a great number of works have been conducted on those three aspects to construct a robust FER system. Before the emerging of deep learning methods, researchers manually design various hand-crafted features such as local binary patterns (LBP), histogram of gradient (HoG), and scale invariant feature transform (SIFT) to extract discriminative information from the given inputs. By modeling the appearance and geometry changes of faces when a target expression activates, those works[zhang1998comparison, zhang2005active, tian2002evaluation, eckhardt2009towards, yang2007boosting, hu2008multi, dahmane2011emotion, senechal2011combining, valstar2012meta] utilizing hand-crafted features achieve great advancement in recognizing facial expressions from images collected in controlled environment. However, those manually designed features in those previous works are heavily dependent on human expertise and task-specific, and therefore the generalization ability of those hand-crafted features is usually questioned. Recently, with the development of deep learning techniques, e.g.
, Convolutional Neural Networks (CNNs) to FER system constructions[meng2017identity, Zhang_2018_CVPR, yang2018facial], the above-mentioned problems can be greatly alleviated.
Comparing to previous works based on hand-crafted features, CNNs based methods for facial expression recognition bring significant benefits. First, CNNs integrate feature learning and classifier construction in a unified framework, rather than treating them individually and independently. The whole learning process becomes highly efficient, and the optimal feature and classifier can be located in an end-to-end manner. Second, CNNs are hierarchical structures, making the learned features from higher layers more structural and semantic.
However, the successful application of CNNs in various vision problems does not come at no cost. To train a robust network for a specific recognition task, a large amount of labeled data with high quality is required. For example, to train a CNN for an object localization task, the object class and the corresponding spatial position (bounding boxes) must be provided. In real scenarios, collecting enough number of labeled data with high-quality labels is quite difficult, if not impossible. On the one hand, it is not always easy to acquire enough data for some tasks due to privacy or safety reasons. On the other hand, manually label those collected data is time-consuming and labor-expensive.
In facial expression recognition, CNNs face similar aforementioned problems: the shortage of data with high-quality labels. Most released datasets [lucey2010extended, pantic2005web, lyons1998japanese] for facial expression recognition are collected in a controlled environment. In a controlled environment, the variations introduced by different factors, e.g., head poses, light conditions, and etc. can be limited and fails to represent the real scenarios. In other words, the facial expression data collected in a lab controlled setting is significantly different from the facial expression data collected in the wild. This well-known “domain shift” issue deteriorates the generality of the trained network for the “unseen” data. More than that, due to the privacy concern, collecting facial expression data in a lab controlled environment is difficult and time-consuming, which lessen the facial expression datasets collected to a small size (thousands or even hundreds).
Inspired by the progress of semi-supervised learning and unsupervised learning, researchers on facial expression recognition resort their eyes to the huge amounts of facial images from Internet in order to solve the data shortage problem has been bothering them. However, how to take advantage of those huge-scale data as well as the labeled small size data is still under exploration. One plausible way to make full use of those huge-scale data, is to utilize them to conduct unsupervised learning,e.g., pre-train the network [Li2020] to provide a good initialization for the following finetune step. An alternative, and (maybe) better way is to perform semi/omni-supervised learning by exploiting the current labeled data together with large-scale unlabeled data to train the network. As pointed out by [radosavovic2018data], the performance of an omni-supervised learner, which utilizes both large-scale unlabeled data and the small size labeled data, can outperform its counterpart utilizing labeled data alone.
To this end, we propose a simple yet effective omni-supervised baseline for facial expression recognition, where we exploit the useful knowledge from a large-scale unlabeled dataset to improve the performance of the facial expression learner. Under the omni-supervised learning setting, we annotate those unlabeled data in an automatic manner. After discarding unaccountable samples, which might bring the noise to training, a huge scale labeled data (near to 140K) with fine quality is constructed. Taking advantage of this constructed dataset as well as the manually labeled training set for training can greatly boost the recognition accuracy of the testing set, which has been proved by our experimental results.
Although the new constructed dataset can significantly improve the performance of FER, it also brings some new issues. First, the size of our constructed dataset is significantly large (hundreds of thousands), making the training time-consuming and computation resource exhausted. Thus, it may be not affordable by those research groups with limited computation resources. Second, since the constructed dataset is annotated automatically without human interventions, there might exist some incorrectly labeled samples, which may bring noisy information. To tackle the above mentioned issues, we propose to utilize the dataset distillation strategy [wang2018dataset] to distill the target task-related knowledge from the constructed dataset and compress them into a small set of images (in our case, one image represents one single category). Comparing to the vanilla collected dataset in large sizes, the additional computational cost from the distilled version can be almost ignored during training. Although with a small size, this distilled dataset is still capable of boosting the recognition accuracy to a higher stage.
In general, we list our main contributions in this work as follows:
We construct a large-scale dataset and exploit it for omni-supervised facial expression recognition, aiming to boost the recognition accuracy. Unlike previous works utilizing an unlabeled dataset for pre-train the network and provide initialization for downstream fine-tuning, our method employs omni-supervised learning setting by taking advantage of the constructed huge-scale dataset as well as the small size labeled dataset to strengthen the network.
To improve the generality, decrease the training cost and balance the class ratios, we propose to utilize the dataset distillation strategy to summarize the knowledge from the constructed dataset. Specifically, by the dataset distillation strategy, the constructed dataset, which has around 140K images, is compressed into a significant small image set, in which one image for each class. The highly summarized knowledge in the distilled small set can be exploited together with the small size labeled data to strengthen the network by boosting the generality power.
We conduct extensive experiments on challenging FER datasets collected in the wild, demonstrating the superiority of the simple yet effective baseline method. More importantly, we also conduct cross-dataset evaluations, i.e., training a CNN on a source dataset but test on a different target dataset, which is quite challenging due to the well-know “domain-shift” issues. Our performances in this cross-data setting demonstrate the superior generality of the proposed method.
2 Related Work
This section will focus on facial expression recognition and omni-supervised learning, which are the most related topics with this article.
2.1 Facial Expression Recognition
As pointed out in [Liu2014], facial expression recognition, as a standard image-level classification problem, “can be performed by three major steps: feature learning/extraction, feature selection, and classifier construction.” Most of previous works focus on how to extract or learn distinctive representations from the given input. Before 2014, majority of the existing methods utilized handcrafted features [zhang1998comparison, zhang2005active, tian2002evaluation, eckhardt2009towards, yang2007boosting, hu2008multi, dahmane2011emotion, senechal2011combining, valstar2012meta] or shallow features learned from data [zafeiriou2010sparse, ying2010facial, liu2013improving, zhong2012learning]. [zhang1998comparison, zhang2005active, tian2002evaluation, yang2007boosting] propose Gabor-wavelet-based features to capture the geometry features to achieve high sensitivity for facial expressions. [eckhardt2009towards, yang2007boosting] utilize boost methods to learn a set of discriminating information from given samples. [hu2008multi, dahmane2011emotion] take advantage of knowledge extracted by Histograms of Oriented Gradients (HOG) to model the appearance change when activating target expressions. [senechal2011combining, valstar2012meta] use Local Binary Pattern (LBP) to investigate the roles of feature representation in facial expression analysis. Complying a data-driven mechanism, [zafeiriou2010sparse, ying2010facial, liu2013improving, zhong2012learning] propose to learn underlying basic patterns from input to model the variations in facial expressions. Readers are suggested to refer [zeng2008survey, sariyanidi2014automatic, zhang2017facial], which conduct extensive literature reviews for the works utilizing traditional handcrafted features and shallow features learned from data.
With the development of modern Convolutional Neural Networks (CNN), various deep learning methods have been applied to facial expression recognition problems. Recent works based on deep learning methods [Liu2014, mollahosseini2016going, meng2017identity, Zhang_2018_CVPR, yang2018facial, li2018occlusion] to recognize facial expressions outperformed previous works without using deep learning methods. [Liu2014]
proposes to unify Deep Belief Network and Boosting methods to perform feature learning, feature selection, and classifier construction in a joint way.[mollahosseini2016going] constructs an Inception-wise network and achieves promising recognition rates on seven public facial expression databases. For focusing on variations introducing by expressions, [meng2017identity, Zhang_2018_CVPR, yang2018facial] propose to disentangle expression-sensitive knowledge from expression-nonsensitive one. To suppress the variations introduced by different identities,[meng2017identity] proposes identity-aware convotional neural network to differentiate the identity sensitive knowledge from expression sensitive knowledge from input. [Zhang_2018_CVPR] proposes to generative adversarial networks (GANs) to disentangle expression variations from pose variations. [yang2018facial] proposes to extract facial expressive information by their designed de-expression module. To recognize facial expressions in various unconstrained conditions, [li2018occlusion] proposes to utilize attention mechanisms in CNNs and [Pan2019] utilized a privileged learning mechanism for recognizing occluded facial expressions. Interested readers can read [Huang2020] for a systematic review of deep facial expression recognition works.
2.2 Semi/Omni-supervised Learning
A successful CNN based feature learning process is heavily dependent on large scale training data with high-quality labels. Unfortunately, manually labeling huge amounts of data in high quality is time-consuming and labor-expensive. Without enough labeled data, it is difficult to learn highly distinctive features with high generality. To deal with this limitation, semi-supervised learning [zhu2005semi]
is introduced to integrate CNNs learning process for final performance improvement. The first attempt of semi-supervised learning is to generate pseudo labels for unannotated samples by any trained machine learning model on hand,e.g.
, support vector machine, AdaBoost, CNN. For the samples with pseudo labels, if their confidence scores are high enough, they will be combined with labeled data to “refine” the learning model. Recently, a few works[tarvainen2017mean, laine2016temporal, DBLP:journals/corr/abs-1809-09925, radosavovic2018data] have introduced deep learning method into semi-supervised learning framework, trying to enlarge the size of data used for training the network. Without network architecture changes, [tarvainen2017mean] manages to achieve a strong model with fewer labeled data by averaging model weights temporally.The high level idea of [laine2016temporal] is similar with that of [tarvainen2017mean], where the difference is that [laine2016temporal] ensemble the outputs, rather than model weights temporally. [DBLP:journals/corr/abs-1809-09925] combines graph convolutional network (GCN)with Mean Teacher for capitalizing on the information from unlabeled data. [radosavovic2018data] successfully applies the proposed data distillation method to challenging data collected in real scenarios.
To boost the discriminative capability of the learned features, previous works either propose new loss terms and regularization terms to guide the learning process [meng2017identity], or introduce huge scale unlabeled data to pretrain the network and provide a good initialization for downstream finetuning using labeled data [cai2018probabilistic]. In this article, our work is to more related to the latter ones. Both of the aforementioned strategies have their limitations. Design a new loss term or regularization term to the network is not a trivial task and might complicate the network training process. Introduce huge-scale unlabeled data to pre-train the network occupies additional computation resources and time. More than that, there usually exists a domain gap between the unlabeled data and labeled data; for example, unlabeled data might are collected in real scenarios while labeled data might be collected in the lab. Our method is different from the two strategies. Our method conducts omni-supervised learning by utilizing a huge-scale dataset, which is constructed automatically by our method, as well as the original labeled dataset. The omni-supervised learning has been experimentally proved to outperform the supervised learning setting, which only uses the original labeled dataset in a small size. Further, to save the training cost and minimize the negative influence from noisy information from incorrectly labeled data, we utilize dataset distillation to compress the huge-scale collected data into a highly compressed and refined small set.
In this section, we illustrate the details of the proposed method step by step. As shown in Fig. 1, first, we utilize the labeled data in a small size to train a primitive classifier (Sec. 3.2). Second, this primitive classifier is then utilized to provide guidance to select the most confident unlabeled data from large-scale sets (Sec. 3.3
). Specifically, unlike previous works choosing the confident samples based on their category likelihood, this proposed baseline method is based on the similarity between high-level features extracted from the layer before the Softmax output layer. Those huge scale selected samples are constructed into a new dataset, which will collaborate with the original small size labeled dataset to strengthen the network. Further, to decrease the computation cost, save the training time, and increase the generality of the network, we propose to utilize the dataset distillation strategy to compress the huge-scale constructed dataset into a very small set, in which one image for one class (Sec.3.4).
3.1 Task Definition
Facial expression recognition aims to learn a mapping function to map the original input image to a latent representation. This latent representation should have discriminative capability to distinguish samples from different classes. In this article, the mapping process is conducted by a Convolutional Neural Network (CNN) () learned by given data with labels by the following formulations:
is the loss function,is the classifier parameter, is the regularization term.
In this work, we utilize knowledge distillation (Dis(.)) to compress the knowledge from the additional collected data labeled automatically, i.e., . We fuse the distilled knowledge into the training process to boost the recognition power of the network. Since the knowledge is distilled and highly compressed, the introduced computation cost becomes negligible. Now the formulation becomes:
3.2 Primitive Learner Training on Labeled Data
As shown in Fig. 1 (a), in the first step, we utilize labeled data available, i.e., , to train a primitive learner , where denotes model parameters. To make the following discussion convenient, we name the utilized dataset as Anchor Data, a.k.a, AD. Given an input , the trained primitive learner needs to predict the correct label, i.e., . To conduct a fair comparison, we follow the common practices as [meng2017identity, cai2018probabilistic], choose VGG-Face [Parkhi15] and ResNet [he2016deep] to train our primitive learner. Both architectures have demonstrated their effectiveness for visual recognition problems [Parkhi15, he2016deep, hu2018squeeze, luo2019significance]. To adapt those architectures to our target task, i.e., recognizing basic facial expressions, we replace their final classifier layers and change the output number to the number of target facial expression classes. In this article, the output number is set to since there are seven basic facial expressions to detect, i.e., anger, happy, disgust, fear, sad, surprise, and neutral. In this step, the primitive learner is trained only via annotated facial expression data. Because the manual annotation is highly labor-expensive, the number of the labeled samples is small, most of which are up to thousands. The network trained by a limited labeled data usually has a limited generality.
3.3 Auxiliary Samples Collection from Unlabeled Data
This work targets to improving the recognition power of the primitive learner by using the distilled knowledge from huge-scale unlabelled data, which is denoted as . One straightforward approach is to feed all the unlabeled data into to generate pseudo labels , as well as the latent representation in a high dimensional feature space. Then those original unlabeled data and their corresponding pseudo labels can be fed into to update the model parameters. Concretely, to generate the pseudo label for an unlabelled data , a softmax classifier can be used to map the corresponding latent features into a probability score by the following formulation:
where is the weight vector for the -th class, is the number of classes. In our experiment, is set as .
Utilizing unlabeled data with pseudo labels generated by probability scores to update the original network has been explored in previous works on semi-supervised learning [zhu2005semi] and unsupervised learning [Luo_2019_CVPR]. However, the labeled data and unlabeled data in FER are generally collected in different environments by varying cameras, which makes the conditional distributions between those databases different. Under this circumstance, directly combining the unlabeled data with generated pseudo labels with the original labeled data might introduce unexpected noisy information for training the network, and consequently deteriorate the final performance. The phenomenon is known as “domain shift”, which has been noticed and studied in previous works [Li2020]. As pointed out by [Li2020], in such case, the generated probability scores and corresponding pseudo-labels for the unlabeled data, might be error-prone and mislead the network updating process.
As concluded in [zhang2019category], for samples in the same class, the corresponding learned latent features have a tendency to cluster together. The dimension of is usually much higher than that of probability score , which makes them more robust in learning processes. Therefore, we follow the similar strategy in [zhang2019category] to distill clean and related knowledge from the unlabeled data with the guidance from labeled data.
For each labeled sample , we feed it into the primitive network to calculate its feature representation . Since we have the ground truth for those labeled data, we can calculate the centroid in the feature space for each class, by the following formulation:
where is 1 when is true, is the class index for basic facial expressions.
As shown in Fig. 1 (b) and (c), for each unlabeled data sample , we also input it into the primitive network to produce its feature representation , and then compute the distance between the calculated and each facial expression centorids, i.e., . Unlike [zhang2019category] employs the Euclidean distance, we utilize the cosine distance here , which can be calculated as follows:
Then, we set the pseudo label for each unlabeled sample based on the following criterion:
where is the class index for the facial expressions.
Comparing with the unlabeled data selected by their conditional likelihood score, generated pseudo labels based on high dimensional features are more robust and less error-prone. Those selected samples from the unlabeled database will be combined with those from the labeled database to help to update the model by “learning with more trustable data”. To make the following discussion simple, we name those selected samples from an unlabeled database as auxiliary samples (ASs). In our experiment, the number of auxiliary samples selected from the unlabeled MS1M-Celeb-1M is . We show examples of selected auxiliary samples in Figure. 3, in which selected samples are sorted by their confidence values..
3.4 Knowledge Distillation from Auxiliary Samples
The auxiliary samples provided by the internet-scale unlabeled dataset to improve the recognition performance are demonstrated in our experimental results in Section. 4.4.1 and 4.4.2. However, the number of the distilled auxiliary samples can be huge, e.g. it is in our case. This number will keep increasing if we continue introducing more unlabeled data samples grabbed from the internet. Under this circumstance, directly combining the auxiliary samples with the labeled data to assist the network training becomes computational expensive. As shown in Figure. 5
, the computation time for each epoch increases bytimes by directly employing the large number of auxiliary samples (e.g. in our case). Other than that, since noisy information could also be introduced in those auxiliary samples selected from unlabeled data, further efforts are required to find a more effective way to exploit useful knowledge and compress it to improve its generality.
As shown in Fig. 1 (d), to achieve the goal, we deploy dataset distillation [wang2018dataset] to distill and compress the useful knowledge from the selected auxiliary samples. Dataset distillation is to synthesize a small group of data, which may not be strictly consistent with the distribution of the original labeled data. Whereas, utilizing those synthesized data for training can still approximate the training process using the original data. A group of synthetic data is obtained by minimizing a modified objective function, which will be briefly illustrated as following.
We start our illustration from the standard training. At each training step , standard training with original datasets (
) utilize stochastic gradient descent or its variants to update the model parameters as:
where is the model parameters in training step , is the sampled data in step , and is the learning rate.
When the original dataset is in a large size (N is a big number), to achieve the convergence, millions of computation steps might be needed to update the model parameter , which costs heavy computation resources and time. To solve this limitation, dataset distillation instead proposes to distill a tiny group of training data , from the original large training dataset in the size of . The size of the distilled data is significantly smaller than the one of the original data, which means . Given the distilled data, it is now possible to locate the optimal model parameter in very few stochastic gradient descent steps, without sacrificing the performance much.
Assuming only one single SGD step is needed in an ideal case, as conducted in [wang2018dataset], the synthetic distilled data can be located by minimizing the following objective function:
where is an initialization, x is the original data, which is often with a huge size comparing to , and is the learning rate111In [wang2018dataset] is also optimized by the objective function. We omit it here to simplify our discussion.
In summary, utilize the distilled images rather than the original collected dataset in a large size has three advantages: 1). the distilled images are with a small size, and therefore, introducing them into training step will not increase the computation cost significantly; 2) comparing to the vanilla auxiliary collected samples, the distilled images are highly compressed and with high generality capability, which has been experimentally proved to benefit to the cross-dataset evaluation settings. 3) the sample numbers between different classes in distilled data are balanced, having the potential to stabilize the training process.
4 Experimental Results
This section evaluate the performance of the proposed method to demonstrate its effectiveness. First, an inner-database evaluation on two in-the-wild datasets is conducted to demonstrate the efficacy of our collected large-scale dataset as well as the distilled version; second, an cross-database evaluation by taking advantages of distilled auxiliary samples are conducted to demonstrate the generality ability of our method.
In this section, we conduct extensive experiments to demonstrate the effectiveness of the proposed omni-supervised learning baseline method. Specifically, two different architectures, i.e., ResNet-34, VGG-Face, are utilized to conduct evaluations on five datasets, including two real-world datasets, i.e., FER2013 [goodfellow2015challenges], Real-world Affective Face Database (RAF-DB) 2.0 [li2018reliable], and three lab-controlled datasets, i.e., Extended CohnKanade (CK+) [lucey2010extended], Japanese Female Facial Expression (JAFFE) [lyons1998japanese], and MMI [pantic2005web]. Auxiliary samples are selected from MS-Celeb-1M [guo2016ms], which does not have any expression labels and therefore is considered as an unlabeled database.
4.1 Experimental Settings
We employ two evaluation settings, i.e. inner-database evaluation, and cross-database evaluation.
Inner-database evaluation The training set and test set are from the same database; For example, if we conduct an inner-evaluation on RAF-DB 2.0, we use RAF-DB 2.0 training set and additional images to train the network, and then use the trained network for evaluating the RAF-DB 2.0 test set.
Cross-database evaluation The training set and test set are from different databases. We conduct the inner-database evaluation on two in-the-wild databases, i.e., RAF-DB 2.0, FER2013, and one lab-controlled database.i.e., CK+. For the cross-database evaluation, we conduct experiments by training on RAF-DB 2.0 and testing on CK+, JAFFE, MMI, which are following [Li2020] for a fair comparison.
We employ the following two real-world datasets and three lab-controlled datasets to conduct a comprehensive evaluation for validating the efficacy of our method. RAF-DB 2.0 [li2018reliable] contains 29,672 facial images which are collected from the Internet. It is a real-world database consisting of highly diverse samples. To achieve reliable labels for those samples, manually crowd-sourced annotation is conducted in [li2018reliable]. There are two emotion sets in this dataset, i.e. the basic expression label set, and the compound emotion label set. In this article, we focus on recognizing the basic expressions, and therefore we only utilize the basic expression label set, where there are 15,339 images divided into a training set (12,271 images) and a test set (3,068 images).
RAF-DB 2.0 [li2018reliable] contains 29,672 facial images which are collected from the Internet. It is a real-world database consisting of highly diverse samples. To achieve reliable labels for those samples, manually crowd-sourced annotation is conducted in [li2018reliable]. There are two emotion sets in this dataset, i.e. the basic expression label set, and the compound emotion label set. In this article, we focus on recognizing the basic expressions, and therefore we only utilize the basic expression label set, where there are 15,339 images divided into a training set (12,271 images) and a test set (3,068 images).
FER2013 [goodfellow2015challenges] was constructed for the ICML 2013 Challenges in Representation Learning, which contains 28,709 training images, 3,589 validation images, and 3,589 test images with basic expression labels. After being collected from the Internet automatically by Google search engine, all images are aligned and resized to pixels.
CK+ [lucey2010extended] is a laboratory-controlled database that has been widely used in previous works for FER, where there are 593 video sequences collected from 123 subjects. In each sequence, there is a shift from a neutral expression in the first frame to the peak expression in the last frame. Among the 593 video sequences, there are 327 sequences labeled with basic expressions based on the Facial Action Coding System (FACS). Since CK+ does not provide official training/validation/test sets split, to make a fair comparison, we follow the setting in the previous work [Liu2014] to prepare the data. Specifically, first, we utilize the first frame as the neutral face of each labeled sequence and the last three peak frames with corresponding labels, resulting 1,308 images in total. Then the 1,308 images are divided into 10 groups for n-fold cross-validation experiments.
MMI [pantic2005web] is constructed in a lab controlled environment, where there are 326 sequences in total, among which 213 sequences have basic expressions labels. Different from CK+, MMI is onset-apex-offset labeled, where the sequence in MMI begins with a neutral expression, reaches the peak expression in the middle, and then returns to a neutral expression. Pointed out by [Huang2020], the subjects in MMI might perform the same expression in different ways, and occlusions, e.g., glasses, mustache, exist in some of the subjects.
JAFFE [lyons1998japanese] is a laboratory-controlled database containing 213 samples from 10 Japanese females, where each subject performs basic expressions 3 or 4 times. Due to the limited number of samples in this database, we utilize a leave-one-subject-out experimental setting.
MS-Celeb-1M [guo2016ms] is a benchmark for the celebrity recognition. All images in MS-Celeb-1M are collected from the Internet. In version 1 of this dataset, there are around ten million images of one million celebrities captured in real scenarios. Due to its huge scale and variations, MS-Celeb-1M becomes one of the most challenging datasets. We utilize the data after pre-processing, which is provided on https://github.com/ZhaoJ9014/face.evoLVe.PyTorch. There are identities with images in this pre-processed version, in which all the images are aligned by MTCNN  and resized to . There is no expression label in this database, and therefore we can consider it as an unlabeled database for our target task, i.e., facial expression recognition.
In Figure. 2, we show some example images from the six databases utilized in this article. Those databases are collected in different scenarios, and therefore they have their own data bias. Specifically, there are huge variations between different datasets, which are introduced by various factors, e.g., illuminations, poses, subjects, scales. For example, all faces in CK+ and JAFFE are frontal, while in RAF-DB, FER-2013, large face pose variations exist. In FER-2013, CK+, and JAFFE, there is little color information to utilize. The way to express “anger” in RAF-DB, FER-2013 are quite different from that in CK+ and JAFFE. Based on the above observations, it is hard to directly utilize the data from the unlabeled database, i.e., MS-Celeb-1M, to provide auxiliary information to train a network to achieve higher recognition accuracy on the labeled databases. We choose RAF-DB 2.0 as the anchor database in order to make a fair comparison with previous work [Li2020].
4.3 Implementation Details
In our experiments, we utilize two different architectures, i.e. VGG-Face [Parkhi15] and ResNet-34 [he2016deep], to test the efficiency and generality of our method. We choose those two architectures for two reasons: first, since they are frequently used in previous FER works, to conduct a fair comparison, we follow the same architectures and choose those two architectures; second, those they are representatives of one-single branch architecture (VGG-Face) and multiple-branches architecture (ResNet). All the images utilized in our experiments are aligned by MTCNN  and resized to
. A data augmentation strategy of randomly horizontal flipping with 50% probability is utilized. During training, we utilize stochastic gradient descent with 0.9 momentum to optimize the network. The initial learning rate is set to 0.001, which will be multiplied by 0.1 every ten epochs. Unless noted, the total epoch number is set to 25. We implement the experiments by PyTorch[NEURIPS2019_9015] and run all settings on a workstation with four NVIDIA GTX 2080Ti GPU cards.
4.4 Performance Evaluation
This section verifies the efficacy of our method in both discriminative ability and generality capability. First, to prove the effectiveness of our method, we conduct the inner-database evaluation, demonstrating the collected dataset and the corresponding distilled version can boost the recognition accuracy to a higher stage; second, we conduct a cross-dataset evaluation for proving the generality capability of our method. The performance is evaluated by the mean classification accuracy.
Under inner-dataset setting, we train a primitive learner by the training set of the given source database, and then utilize the primitive learner to guide the auxiliary sample selections in the unlabeled dataset. The knowledge from selected auxiliary samples from the unlabeled dataset will be injected into the learning process to improve the recognition accuracy, which is demonstrated in our experimental results, i.e., Sec.4.4.1 and 4.4.2. To further improve the generality of the auxiliary samples while decreasing the introduced computational cost, we utilize dataset distillation strategy [wang2018dataset] to distill the knowledge from the selected auxiliary samples. The distilled auxiliary samples are highly expressive, which is one-image-one-class in our case. To make the discussion easier, we called the auxiliary samples before dataset distillation as vanilla auxiliary samples, the auxiliary samples after dataset distillation as distilled auxiliary samples.
4.4.1 Inner-dataset Evaluations
We conduct the inner-dataset evaluations on two in-the-wild databases, i.e., RAF-DB 2.0, FER-2013 and one lab-controlled database, i.e., CK+. We report the result comparisons with state-of-the-arts in Table.I, II, and III.
In Table. I, we conduct our method on RAF-DB 2.0, which is the latest in-the-wild dataset. RAF-DB is utilized as the anchor dataset for training the primitive learner. auxiliary samples are selected from MS1M. The auxiliary samples after distillation are named as Distilled auxiliary Samples (DAS), while the auxiliary samples before distillation are named as Vanilla auxiliary Samples (VAS). Performances on two architectures, i.e., VGG-Face, and ResNet-34 are reported. The selected auxiliary samples are shown in Figure. 3.
Based on Table. I, we can find that after the introduction of the auxiliary samples provided by unlabeled MS-Celeb-1M, the proposed method outperforms the previous works. Specifically, compared to the previous work utilizing new loss functions [li2017reliable], our method does not need any new loss funcitons but still achieves a better performance with both architectures. Comparing to [cai2018probabilistic] introducing more related labels into training, our method still outperforms it by almost in terms of average accuracy. More than that, it should be noted that the DAS is with a small size (7 in our experiment), and therefore the introduced additional computation cost is little, which is shown in Figure. 5. However, with little additional computational cost, the assistance brought by DAS is still comparable with that by VAS (86.55 vs. 85.84 on VGG-Face, 85.24 vs. 85.84 on ResNet-34).
Table. II gives results on another in-the-wild dataset, i.e., FER-2013, which was constructed for a expression recognition challenge. In this experiment setting, we test two anchor dataset cases, one is utilizing RAF-DB as the Anchor dataset, and the other one is utilizing FER-2013 as the Anchor dataset and the auxiliary samples are selected from MS-Celeb-1M in both cases. From Table. II, we can find that when introducing additional knowledge from MS-Celeb-1M, our vanilla and distilled performance on VGG-Face are better (72.59 on VAS+VGG-Face+RAF-DB 2.0 as the anchor, 73.27 on DAS+VGG-Face+FER-2012 as the anchor) or comparable (72.12 on DAS + VGG-Face + RAF-DB 2.0 as the anchor, 72.08 on VAS + VGG-Face + FER-2013 as the anchor) compared with previous works. We also notice that under two settings, the performances are inferior to previous works by around 1%, our conjectures is that this might comes from the large variations existed in this dataset, including poses, illuminations, and scales.
We also evaluate our method on CK+, which is the most widely used database collected in a lab-controlled environment. For CK+, we only conduct the experiment using VGG-Face and report the performance in Table. III. Since the data size of CK+ (around 1K) is much smaller than that of auxiliary samples (over 130K) by RAF-DB 2.0 anchor, we do not conduct the experiment on CK+ with vanilla auxiliary samples, but only report the accuracy using distilled auxiliary samples. From Table. III, without introduction of any new loss terms, our proposed method utilizing the knowledge from seven distilled auxiliary sample boosts the accuracy in this lab-controlled dataset from 93.42% (baseline) to 95.35%, and is comparable with the previous state-of-the-art method using additional attribute information [cai2018probabilistic].
4.4.2 Cross-dataset Evaluations
The cross-dataset evaluations and comparisons with previous state-of-the-arts are shown in Table. IV-VII. We conduct cross-dataset evaluations on three lab-controlled databases, i.e., CK+, JAFFE, and MMI, and one in-the-wild database, i.e., FER-2013. The difference between our method and previous cross-database works is that our method only employes seven distilled auxiliary samples and combines them with the source dataset for training, without modifying the architecture, introducing new loss terms or using tons of extra training data. Since the size of the distilled auxiliary samples is small, i.e., only seven, the introduced extra computational cost is negligible compared with the original training setting. In our experiment, VGG-Face is chosen as our backbone; RAF-DB 2.0 is utilized as the Anchor database, and distilled auxiliary samples (DAS) are selected from MS1M, RAF-DB, which are named as DAS-MS1M, DAS-RAF, respectively.
Based on experimental results, we can find that our method outperforms the previous works in most cases, from transferring to lab-controlled databases to transferring to in-the-wild databases, despite the fact that previous works utilize more datasets as the source domain or more complex network architectures. We will give more detailed analysis on the results for each dataset in the following paragraphs.
As shown in the Table. IV, our method achieves much better performance for facial expression recognition than previous works [mollahosseini2016going, hasani2017spatio, wen2017ensemble, wang2018unsupervised, Li2020]. Note that CK+ is collected in a lab-controlled environment, and is quite similar to other lab-controlled datasets, e.g., JAFFE, MMI, etc. Utilizing the lab-controlled datasets as source training datasets, as conducted in [mollahosseini2016going, hasani2017spatio], was expected to minimize the impact of domain shift. [wen2017ensemble, wang2018unsupervised, Li2020] tried to utilize the datasets covering more variations to boost the generality of the trained network. However, compared with those previous works, our method improves the cross-dataset recognition accuracy on CK+ with only seven additional distilled auxiliary samples introduced in the training set without any network architecture modification or any additional loss/regularization term. Comparing with the latest work [Li2020] using the most similar setting with us, our method achieves a higher performance ( 1.41%).
The performance on JAFFE dataset is reported in Table. V. As noted and pointed out by previous works [Li2020], there is a high bias in this extremely small dataset (213 images). Due to this bias, the cross-dataset evaluation on this dataset is comparatively lower than that on CK+. However, our method still outperforms most of the previous works by large margins.
As shown in Table. VI, for MMI dataset, our method imporves the performance from 60.82 to 63.34 after introducing distilled auxiliary samples from unlabeled dataset. This performance improvement comes at a negligible computation cost without any network architecture modifications. [zavarez2017cross] reported a higher accuracy using more extra data. In contrast, our method only employs source dataset for training.
We also conduct a cross-dataset evaluation on an in-the-wild dataset, i.e., FER-2013. FER-2013 was constructed for challenges and has huge variations introduced by expressive ways, head poses, scales, and etc. Due to the heavy variations existed in FER-2013, the performance under the cross-dataset setting is also inferior to that on CK+. From Table. VII, we can find that the knowledge extracted from 6 datasets [mollahosseini2016going] fails to provide the same assistance as the work using one in-the-wild database [Li2020].
|VAS + VGG-F (Ours)||85.84|
|VAS + ResNet-34 (Ours)||85.84|
|DAS + VGG-F (Ours)||86.55|
|DAS + ResNet-34 (Ours)||85.24|
|Ron et al..[breuer2017deep]||72.1|
|RAF-DB 2.0 as Anchor.|
|VAS + VGG-F (Ours)||72.59|
|VAS + ResNet-34 (Ours)||70.31|
|DAS + VGG-F (Ours)||72.12|
|DAS + ResNet-34 (Ours)||71.23|
|FER-2013 as Anchor.|
|VAS + VGG-F (Ours)||72.08|
|VAS + ResNet-34 (Ours)||70.98|
|DAS + VGG-F (Ours)||73.27|
|DAS + ResNet-34 (Ours)||71.20|
|VGG-F (baseline) [cai2018probabilistic]||93.42|
|DAS + VGG-F (Ours)||95.35|
|DAS-RAF + VGG-F||RAF-DB2.0||CK+||79.33|
|DAS-MS1M+DAS-RAF + VGG-F||RAF-DB2.0||CK+||79.41|
We visualize selected auxiliary examples from MS1M-Celeb-1M in Figure. 3, and the distilled auxiliary samples in Figure. 4. In total, MS-Celeb-1M provides auxiliary samples to auxiliary downstream learning process.
Each row of Figure. 3 corresponds to one facial expressions. The images in each row are sorted by the distance between their feature and the nearest facial expression centroid. The feacial expression centroid is calculated using labeled samples (RAF-DB 2.0) by Eq. 4- 8. Although there are large variations in MS-Celeb-1M, and large domain shift between MS-Celeb-1M and other labeled databases for FER, the auxiliary samples selected by our method from MS-Celeb-1M are highly related to facial expression recognition. From the images from MS-Celeb-1M shown in Fig. 3, despite there are heavy illumination changes in the first column, second row (anger), large pose in the first column and third column, sixth row (sad), occlusion in fourth column, first row (neutral), our method still assigns correct expression labels for those samples. The selected auxiliary samples from the unlabeled database are consequently playing a significant role to update the primitive learner, which has been demonstrated in Sec. 4.4.1 and 4.4.2.
Dose the distilled auxiliary samples really contain underlying patterns in each of them? Based on the visualized images in Figure. 4, it is difficult to find any semantic meaning in them. All of the distilled images are full of similar textures, which arouse us the question about whether those images can provide useful knowledge for our target tasks. We conduct a simple experiment to test whether or not there are any underlying patterns existing in those distilled images. We save the intermediate distilled images in every ten epoch, and in total, we have 42 distilled images on hand. We split them into a training set and testing set by a ratio: 5:1, and then feed the split training set and testing set to train a CNN. Our intuition is this: if those distilled auxiliary samples do not have any regular patterns corresponding to classes, then the CNN can not learn anything from them. What we observe is shown in Figure. 7. From this figure, we can find that the CNN trained on the distilled auxiliary samples converges very fast. The converged CNN can easily predict the images in the test set correctly, which means there indeed exist some underlying patterns corresponding to different classes in each image.
Visualization of failure case. In Fig. 6, we visualize some failure cases by our method. As can be seen in this figure, some of them, e.g., images in the first row, are misclassified due to huge head poses. The error introduced by large head poses can be solved by employing more advanced face alignment methods. For the images in the second row, poor lighting conditions make the network conduct incorrect predictions. In the third row, the occlusions on the face regions make them hard to recognize, some of them even for humans.
In this article, we propose a simple yet effective omni-supervised baseline for facial expression recognition. Unlike previous works requires a pretraining phase using unlabeled data sets and a different set of losses to achieve better performance, we exploit the useful knowledge from a large-scale unlabeled data set to enhance the final performance. The proposed method is simple yet effective with any modification in network architecture or introduction of new loss terms. We have demonstrated that the distilled knowledge from large-scale unlabeled data has high generality by achieving significant advancement in inner-dataset and cross-dataset evaluations for facial experssion recognition.