A Scalable Approach for Facial Action Unit Classifier Training UsingNoisy Data for Pre-Training

11/14/2019 ∙ by Alberto Fung, et al. ∙ 35

Machine learning systems are being used to automate many types of laborious labeling tasks. Facial actioncoding is an example of such a labeling task that requires copious amounts of time and a beyond average level of human domain expertise. In recent years, the use of end-to-end deep neural networks has led to significant improvements in action unit recognition performance and many network architectures have been proposed. Do the more complex deep neural network(DNN) architectures perform sufficiently well to justify the additional complexity? We show that pre-training on a large diverse set of noisy data can result in even a simple CNN model improving over the current state-of-the-art DNN architectures.The average F1-score achieved with our proposed method on the DISFA dataset is 0.60, compared to a previous state-of-the-art of 0.57. Additionally, we show how the number of subjects and number of images used for pre-training impacts the model performance. The approach that we have outlined is open-source, highly scalable, and not dependent on the model architecture. We release the code and data: https://github.com/facialactionpretrain/facs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Facial expressions can convey information about a person’s perceived emotional state [10], their intentions [14, 23, 47], and even physical state [44]. Proper understanding of facial expressions is a vital aspect of human interaction and social communication [11, 14]. Given the significance of facial actions, there is great interest in building assistive technologies and computer systems that can leverage signals from the face. Many of the benefits offered by facial coding are reliant on the ability to code large amounts of image or video data. For example, analyses of facial actions are being used to help drive increased understanding of psychology [16] and complex medical conditions, such as psychosis (e.g., schizophrenia) [56] and depression [51], and can even provide a means for objective measurement of pain [27]

. In each of these cases, individual differences and contextual information adds a lot of variability to what is displayed on the face. Relying on manual coding severely limits how effectively research can be translated into practice as the signal-to-noise ratio is often small. Whereas, with large-scale analyses, significant trends can be observed, even in the presence of noise 

[16, 38].

Fig. 1: An overview of our approach using automatic annotation to generate a large and diverse dataset for pre-training a FACS AU detector. The weights are then used to initialize the network for fine-tuning with a “clean” dataset of manually annotated images. We show that pre-training on 100,000s of images of automatically annotated data produces a final network that achieves state-of-the-art performance, even when the network architecture is relatively simple. We release the code and data.

The Facial Action Coding System (FACS) [11]

is the most widely used and comprehensive taxonomy for describing facial behavior as a specific combination of building blocks known as action units. However, FACS coding is a time consuming task. Estimates put the time required to code a one-minute video at 30 minutes or more 

[57]. The demands for utilizing FACS in commercial [37] and research [56, 51, 27]

applications are high. Machine learning and computer vision systems are being used to automate many types of laborious labeling tasks (e.g., object and scene labeling). Facial action coding is an example of a labeling task that requires copious amounts of time and domain specific expertise. Therefore, training computer vision algorithms for automating facial action coding are very attractive.

Automated FACS coding has a long history, comprehensive summaries of this work are available from Martinez et al. [33] and Cohn and de la Torre [8]

. More recently, research has been focused on using deep neural networks, specifically, convolutional neural networks (CNN), for detecting AUs 

[19, 26, 5, 42]

. These methods use deep representation learning to effectively detect the presence of facial action units. However, many machine learning methods (especially deep learning approaches) are “data hungry”, with performance monotonically increasing with the number of training samples 

[48]. Given the time consuming nature and expertise required to encode AUs, data sources that provide large numbers of training examples are limited [34, 8, 39]. Most publicly available AU datasets have at most examples from a few hundred subjects and some only feature a few dozen. Additionally, to achieve current state-of-the-art performance, most of the published methods have made adaptations to their CNN architectures to utilize additional features for the representation learning [9, 42, 31]. These modifications add complexity to model training as well as computational complexity during inference time. But are they really necessary to achieve good performance and are they the most efficient way to achieve generalization? As network architectures and training procedures become more complex there is a growing concern about the reproducibility and replicability of machine learning findings [25]. This is not helped by the lack of open data and published code, but perhaps most significantly the insufficient documentation of training parameters and environments [20, 53, 35].

Recognizing these limitations, we propose a simple end-to-end pipeline to train a FACS AU classifier using a standard Convolutional Neural Network (CNN) and automatic annotations for pre-training. We show that pre-training, with these noisy labels can be beneficial and after fine-tuning with a set of clean manually labeled data, features learned can be generalizable and discriminable towards the detection of AUs (see Figure 1), even performing better than the original algorithm used to generate the noisy labels.

The concept of pre-training is widely used in machine learning to help increase generalization [12] and has previously been applied in facial action unit classification [28]

. Often pre-training is performed with a proxy task like face recognition 

[24] as available datasets for these tasks are much larger. We show that using this simple and fundamental approach with a large set of automatically AU labeled images can perform better than the current state-of-the-art models. In addition, we systematically looked at the impact of final AU classification results for modulating the number of images as well as the number of individuals in the pre-training stage.

In summary, this paper has the following contributions:

  1. To present a large set of automatically FACS-annotated images with gender, nationality and biographical meta-data.

  2. To propose a simple pipeline of pre-training and fine-tuning a CNN classifier in an end-to-end fashion for detecting the presence of facial action units that produces state-of-the-art performance.

  3. To conduct experiments to systematically investigate the effect of (1) the number of pre-training images and (2) the number of pre-training images of different people.

The dataset, code, as well as relevant documentation is available for public use and can be found at: https://github.com/facialactionpretrain/facs

Ii Related Work

Companies now offer public software development kits (SDKs) and application programming interfaces (APIs) for FACS AU coding [40]. These computer vision techniques have been applied toward automating coding of FACS for a number of applications, see  [33, 52] for surveys. The performance of these algorithms is highly dependent on the volume of curated training data that is available [48, 59]. A number of public databases are available and have been used to progress the field of automated facial action detection systems in recent years  [8, 32, 55, 45, 34, 36]. However, many of these datasets were collected under controlled conditions with individuals performing posed expressions or have examples from a limited number of different individuals. A 2015 study [15], showed performance of AU detection increased with a greater number of training examples from different individuals emphasizing the need for diversity in training sets. Datasets like EmotioNet  [13] have tried to address the issue of scarcity. EmotioNet is comprised of roughly 1 Million images (950,000 labeled by algorithm and 50,000 labeled manually) labeled for 11 AUs. EmotioNet is an impressive resource and has been shown to be effective even though the accuracy of the annotations is only 80% as reported by the authors. However, the dataset is not readily available for all researchers (including those in industry labs).

Type

Filter, stride, (drop%)

Output (N, W, H)
Input - 3, 64, 64

Conv1-1/ReLu

3x3, stride = 1 64, 64, 64
Conv1-2/ReLu 3x3, stride = 1 64, 64, 64
MaxPool1/Drop1 2x2, stride = 2, (0.25) 64, 32, 32
Conv2-1/ReLu 3x3, stride = 1 128, 32, 32
Conv2-2/ReLu 3x3, stride = 1 128, 32, 32
MaxPool2/Drop2 2x2, stride = 2, (0.25) 128, 16, 16
Conv3-1/ReLu 3x3, stride = 1 256, 16, 16
Conv3-2/ReLu 3x3, stride = 1 256, 16, 16
Conv3-3/ReLu 3x3, stride = 1 256, 16, 16
MaxPool3/Drop3 2x2, stride = 2, (0.25) 256, 8, 8
Conv4-1/ReLu 3x3, stride = 1 256, 8, 8
Conv4-2/ReLu 3x3, stride = 1 256, 8, 8
Conv4-3/ReLu 3x3, stride = 1 256, 8, 8
MaxPool4/Drop4 2x2, stride = 2 (0.25) 256, 4, 4
FC5/ReLu + Drop5 (0.5) 1024
FC6/ReLu + Drop6 (0.5) 1024
Output/Sigmoid - 17(12*)
TABLE I:

A detailed architectural description of the modified VGG13 Network we used. Output layer * denotes the output vector size for fine-tuning stage.

No. Subjects No. Images
Total 1,995 162,070
Men 1,070 82,685
Women 925 79,385
N. America / Europe 1,575 128,031
S. Asia 97 9,349
S. Africa 21 1,439
Mid. East / C. Asia /
N. Africa 26 2,250
Central / South America 82 6,254
N.E Asia 178 13,214
S.E Asia 16 1533
TABLE II: The distribution of gender and geographical region (based on nationality) of the subjects in the initial pre-training dataset from the MSCeleb.
Fig. 2: Images of faces labeled via OpenFace 2.0 with positive examples of each action unit (for AUs: 1, 2, 4, 6, 9, 12, 25, 26). The labels are noisy, especially in the case of co-occurring AUs. Our experiments show that pre-training of this data can still lead to improvements in performance of the final AU classifier.

Traditional feature representation-based AU detection methods focus on discriminative handcrafted features like those from Gabor wavelet [54] and geometry features  [1]. However, the effectiveness of these features can be limited. More recently, approaches based on deep convolutional neural networks have been shown to outperform traditional approaches for AU detection [19, 58, 30, 42]. Starting from straight-forward end-to-end convolutional neural networks researchers have made adaptations to the architectures to achieve the current state-of-the-art performance. Zhao et al. [58] used a network (DRML) with a proposed region layer that captures local structural information in different facial regions. Li et al. [30] proposed using an adaptive region of interest cropping network to learn separate filters for different regions and merged them with an LSTM-based temporal fusion approach for AU detection. Shao et al. [49] proposed an end-to-end deep learning framework for joint AU detection and face alignment. The joint learning of the two tasks and sharing features was enabled by initializing the attention maps with the face alignment results. Additionally, they also proposed an adaptive attention learning module to localize ROIs of AUs adaptively to extract better local features. Recently, Niu et al. [42] proposed an end-to-end framework which consists of a stem network for shared feature learning, a local relationship learning network, and a person-specific shape regularization network. The combination of these three modules have shown to produce state-of-the-art performance.

Pre-training, even unsupervised pre-training, has been shown to be effective in many deep learning tasks as it supports better generalization of the resulting model [12]. Pre-training can become particularly important in contexts where labeled training data is sparse, which is the case for facial action unit recognition [28]. Inspired by this, we propose to use a large set of automatically annotated noisy AU data to pre-train a familiar and relatively simple CNN architecture, then fine-tune it with clean manually labeled data. We find that this approach can still outperform state-of-the-art models by a reasonable margin.

F1 Score AU01 AU02 AU04 AU06 AU09 AU12 AU25 AU26 Avg
OpenFace 2.0 [2] 41.9 [34.7] 66.7 42.6 36.3 60.8 90.3 55.8 53.6
LP-NET [42] 29.9 24.7 72.7 46.8 49.6 [72.9] 93.8 65.0 56.9
Ours [41.5] 49.5 [70.2] [46.2] [47.9] 75.6 [90.7] [57.6] 59.9
TABLE III: F1-frame score (in %) as reported by LP-NET [42], OpenFace 2.0 [2], and for our proposed methods on the DISFA dataset. The best score is in bold, and bracket for the second best.
N Examples AU01 AU02 AU04 AU06 AU09 AU12 AU25 AU26 Avg
1,000 0 0.1 35.4 33.7 0.4 69.3 71.8 13.5 28.0
2,000 19.8 20.7 54.1 41.0 11.7 73.0 79.6 17.9 39.7
10,000 8.5 35.9 61.6 46.7 34.7 73.7 83.0 12.9 44.6
41.5 49.5 70.2 46.2 47.9 75.6 90.7 57.6 59.9
TABLE IV: The F1-frame score on the DISFA dataset for each model using N number of image examples in the pre-training dataset.
N Individuals AU01 AU02 AU04 AU06 AU09 AU12 AU25 AU26 Avg
12 0 0 0 0 0 0 0 0 0
200 19.0 38.8 64.6 41.3 23.4 74.9 84.5 18.8 45.7
600 27.9 21.8 68.5 50.2 43.8 74.2 89.0 19.2 49.3
1,000 48.0 54.6 70.8 49.2 45.2 73.9 88.9 37.4 58.5
41.5 49.5 70.2 46.2 47.9 75.6 90.7 57.6 59.9
TABLE V: The F1-frame score on the DISFA dataset for each model using N number of unique people in the pre-training dataset.

Iii Data

It is well established that a multi-layer feed-forward network using non-linear activation function can be a universal approximator 

[22, 29, 18]. Networks with deeper architectures are beneficial for learning the kind of complicated functions that can represent high-level abstractions in vision, language, and other domains [4]. Training these models can be challenging since the objective function is a highly non-convex function of the parameters with a potential for having many distinct local minima in the model parameter space [3]

. During the training process these deep models are sensitive to the initial weights and if poorly initialized, can lead to slow training, “vanishing gradients”, “dead neurons”, and/or even numerical problems. While, methods for weight initialization have been proposed to help the training process 

[17, 41], they do not aim at improving generalization [43]. Our proposed method is to pre-train with noisy openly available automatically annotated AU data, the learned weights are then reused as initial weights for the fine-tuning stage with clean manually labeled data.

Iii-a Pre-Training Set

For the pre-training stage we employed the large scale publicly available MS-Celeb-1M dataset [21]. The dataset contains over 10 million images of 1 million unique individuals retrieved from popular search engines. We used biographical data, obtained from an Internet knowledge database, to obtain the gender and nationality of these individuals. From this dataset we then randomly sampled over 160K images for annotation to be used for pre-training our model. During the sampling an even gender split was maintained. Table II

shows the distribution of gender and geographical region (based on the subject’s nationality) from which the images were taken. As the distribution of subjects in the MS-Celeb dataset was heavily skewed towards those from North America and Europe, and we randomly uniformly sampled from this set, our data was also heavily skewed. Future work will consider the impact of balancing this set by region and gender, rather than just gender.

The set of images sampled from the MS-Celeb-1M dataset for pre-training was automatically annotated using OpenFace 2.0 [2]

. OpenFace 2.0 is an open-source toolkit capable of facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. OpenFace 2.0 gives estimates of both the intensity and presence of each action unit in a face image. Intensities of 17 AUs are given as a regression output from 0 to 100 while the presence of 18 AUs are given as a binary value (0 absent, 1 present). For our model development, we only focused on the presence of 17 AUs (excluding AU45). From the initial set of over 170,000 images, OpenFace 2.0 successfully annotated over 160,000. These were then used as the image set to pre-train our model. For training, we used a 95/5 training/test split. We have created a separate download link for these data with the automatic annotations. The data can be found at the URL:

https://github.com/facialactionpretrain/facs

Iii-B Finetuning and Testing

For fine-tuning our model we employed the DISFA dataset [34]. The dataset contains videos of 27 young adults (15 males and 12 females) who were asked to watch a 4-minute video clip intended to elicit spontaneous expressions of emotion. Each video frame was manually coded for the presence, absence, and intensity levels (0 to 5) of 12 AUs. For our experiment, frames with intensity levels equal or greater than 2 were labeled as positive examples and the rest are labeled as negative examples. About 130,000 frames were used in the final experiments.

Iv Proposed Method and Experimental Setup

Table LABEL:table:arch

shows a detailed overview of the modified VGG13 architecture used for our facial AU detection model. The model contains convolutional layers with max pooling. We used dropout to help avoid overfitting 

[50]. The convolutional layers were followed by two fully connected layers. As described below the final fully connected layer was replaced between the pre-training and fine-tuning steps.

Iv-a Pre-trainining

For the pre-training stage, we used an Atom Optimizer with a learning rate of 0.005, momentum rate of 0.9 and a batch size of 32. The network was allowed to train until either convergence or a maximum epoch of 500 was reached.

Iv-B Fine-Tuning

For the fine-tuning stage, we replaced the final fully connected (fc) network output layer from out pre-trained model with a new output layer mapping to the 12 annotated AUs in the DISFA dataset. For training, a subject independent 3-fold cross-validation protocol was used. The network was optimized with an Atom Optimizer with a learning rate of 0.0001, a momentum rate of 0.9 and a batch size of 32. The network was allowed to train and we employed an early stoppage criteria of 10 epochs. Both the pre-training and fine-tuning network used a binary cross entropy loss function. All implementations were created using CNTK 

[46].

Iv-C Image Pre-processing

Iv-C1 Pre-Training

All input images were mean-normalized, converted into a single channel grayscale format and resized to 64x64. Random horizontal flip, rotation, skew, and scale were used for data augmentation. Image resizing and data augmentation was performed online during training.

Fig. 3: ROC AUC, PR AUC and F1 scores of our model as a function of the number of automatically annotated images used at the pre-training step. The performance of LP-Net and OpenFace 2.0 are shown for comparison.
Fig. 4: ROC AUC, PR AUC and F1 scores of our model as a function of the number of automatically annotated images of different people used at the pre-training step. The performance of LP-Net and OpenFace 2.0 are shown for comparison.

Iv-C2 Fine-Tuning

OpenCV’s  [7]

deep neural network (DNN) face detector was used to locate and crop out the faces for each of the frames in the DISFA dataset. The cropped images were then zero padded to maintain a square (1:1) aspect ratio, mean-normalized, converted into a single channel grayscale format and resized to 64x64. Random horizontal flip, rotation, skew, and scale were used for data augmentation. As with the pre-training, image resizing and data augmentation was performed online during training.

Iv-D Evaluation Metrics

For comparison with other methods, performance was evaluated using the F1-score. The F1-score is the harmonic mean of precision and recall and was calculated by binarizing the output results at a threshold of 0.5. We also computed the areas under the precision recall (PR) curve, receiver operating characteristic (ROC) curve to capture the overall model performance and not just at a specific operating point. All metrics were computed per AU and also then averaged across AUs.

Iv-E Varying Number of Images & People during Pre-Training

Additional experiments were conducting to further understand the effects of independently varying the number of images as well as the number of individuals in the images during the pre-training stage. From the set of 162,070 OpenFace 2.0 annotated images, subsets of randomly sampled images containing 1,000, 2,000 and 10,000 face images were created, the images were uniformly sampled not considering the individuals within the face images. Then another set of data subsets were created with images specifically sampled from each of 12, 200, 600 and 1000 different people. These subsets were used to pre-train different VGG-13 models. As in the original 162,070 images, a gender balanced split was maintained for each of these image subsets. Identical fine-tuning procedures were performed on the different pre-trained models using the DISFA dataset and the subject independent 3-fold cross-validation scheme described above.

V Results

V-a Comparison with State-of-the-art

The results of our model were compared against the current state-of-the-art LP-Net [42] and OpenFace 2.0 [2] using the F1-score metric (as this is the consistent metric used across all the papers). To maintain consistency with the authors of LP-Net, 8 of the 12 AUs were used for the comparison. Our approach achieves the best performance in terms of overall average F1-score and achieves either the best or second best performance for the AUs annotated in DISFA. The overall average F1-score metrics, PR area, and ROC area achieved on the DISFA dataset with our model was 0.60, 0.45, and 0.86 respectively. Table III shows the results of the comparisons on the DISFA dataset. These results suggest that our method has good generalizability and can achieve state-of-the-art performance even when compared to methods using more complex network architectures. The performance improvement on AU01 and AU02 was particularly large using the pre-training approach. The F1-score for AU02 was over 100% greater compared to LP-Net [42]. Perhaps this is due to the fact that the pre-training allowed the algorithm to “see” more examples of people with different facial appearances.

V-B The Effect of Number of Examples on Performance

Results from our systematic experiments designed to understand the effects of varying the number of images and individuals can be seen in Figures 3 and 4. Numerical results can be seen in Tables IV and V, respectively.

Specifically, Figure 3 shows the F1, ROC and PR metrics when pre-training with different number of total image examples (with equal numbers of men and women). The F1 score performance of LP-Net and OpenFace 2.0 are shown for reference. The performance metrics for LP-Net are reported in  [42] while the F1 score performance of the OpenFace 2.0 classifier was calculated on the DISFA dataset to provide the appropriate comparison. The results show that as you increase the number of images performance increases monotonically in a close to linear fashion.

V-C The Effect of Number of Subjects on Performance

Figure 4 shows the F1, ROC and PR metrics when pre-training with images of different numbers of people (with equal numbers of men and women). The F1 score performance of LP-Net and OpenFace 2.0 are shown for reference. The performance metrics for LP-Net are reported in  [42] while the F1 score performance of the OpenFace 2.0 classifier was calculated on the DISFA dataset to provide the appropriate comparison. Again, the results show that as you increase the number of images of different people performance also increases monotonically, initially at a steeper rate than for the previous plot. This suggests that more images of different people is more beneficial than more images of the same people.

Vi Discussion and Conclusion

Facial action coding is an important tool in facial analysis and affective computing. FACS provides a useful and objective coding mechanism. However, coding images and videos is extremely labor intensive and requires a level of training that many researchers may not have. Machine learning techniques are being successfully utilized for FACS recognition. We show that pre-training on a diverse but noisy set of images can lead to simple network architectures out-performing more complex architectures and obtaining state-of-the-art results. We find that when the labels used for pre-training are generated with an existing set of AU detectors (in this case OpenFace 2.0) the final model even outperforms the original detectors. This suggests that the model is able to learn to generalize from these noisy data and form representations that are more effective. This is most likely due to the fact that the model is able to further separate the signal (AU appearances) from noise (different face shapes, head poses and other variations) by observing a very large number of different faces. Our results are supported by the fact that when we experiment with different numbers of unique individuals in the pre-training set (see Figure 4

) the performance increases dramatically with the diversity - even though the label quality, from the automated algorithm - OpenFace 2.0 - is still imperfect. One could describe this as a form of semi-supervised learning. It suggests that increasing the diversity (number of subjects) in a pre-training set, but losing some accuracy in the labels, is still beneficial compared to training only on a smaller set of more accurately labeled images. The process of creating this pre-training data is very scalable and we release our set with this paper.

Our results are in agreement with Girard et al. [15] who showed performance for an AU classifier trained on appearance based features greatly increased as the number of subjects in the training set increased. In their experiment increasing the training data with manually labeled AUs from 8 to 64 subjects resulted in a significant increase in classification performance. This further supports our hypothesis that pre-training from a dataset containing images of a very large number of unique individuals can help the model learn representations that separate signal from noise even when the signal quality is noisy.

The approach that we have outlined in this paper uses an open-source automated FACS annotation tool and face images scraped from the Internet. As such, it is a highly scalable method. While we showed the efficacy of this approach with a simple CNN, it is in theory network architecture independent. We hope that by releasing these data and code that other researchers can leverage the benefits of noisy pre-training.

As a proof-of-concept, we only used about 160,000 images from the MS-Celeb dataset for this work. However, the dataset contains a full 10 million images from 1 million individuals. The pre-training dataset could be extended and our results suggest that there may still be further performance gains that can be achieved by simply increasing the size of the pre-training set (see Figure 3).

Vii Distribution

We release the dataset use in this paper alongside the code. The dataset may be used for academic and commercial research purposes. The license details, the permissible use of the data and the appropriate citation, can be found at: https://github.com/facialactionpretrain/facs. The dataset is available for distribution to researchers online.

A summary of the dataset is included below:

Images. 162,070 RGB images of aligned and cropped faces of 1,995 subjects. The images are of celebrities that allows us to put biographical data as described below.

Biography. We searched the Google Knowledge database for the name of each celebrity in the images. This biographical text is included in the dataset in JSON format.

Gender. We used pronoun counts to infer gender from the biographical data and the text using NLTK [6] and named-entity analysis to extract the gender.

Nationality/Country of Origin. We queried the biographical data for each subject using NLTK and named-entity analysis to extract the nationality.

References

  • [1] T. Baltrušaitis, M. Mahmoud, and P. Robinson (2015) Cross-dataset learning and person-specific normalisation for automatic action unit detection. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 6, pp. 1–6. Cited by: §II.
  • [2] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. Morency (2018) Openface 2.0: facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. Cited by: TABLE III, §III-A, §V-A.
  • [3] Y. Bengio, Y. LeCun, et al. (2007) Scaling learning algorithms towards ai. Large-scale kernel machines 34 (5), pp. 1–41. Cited by: §III.
  • [4] Y. Bengio et al. (2009) Learning deep architectures for ai. Foundations and trends® in Machine Learning 2 (1), pp. 1–127. Cited by: §III.
  • [5] C. F. Benitez-Quiroz, Y. Wang, and A. M. Martinez (2017) Recognition of action units in the wild with deep nets and a new global-local loss.. In ICCV, pp. 3990–3999. Cited by: §I.
  • [6] S. Bird, E. Klein, and E. Loper (2009) Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media. Cited by: §VII.
  • [7] G. Bradski (2000) The OpenCV Library. Dr. Dobb’s Journal of Software Tools. Cited by: §IV-C2.
  • [8] J. F. Cohn and F. De la Torre (2014) Automated face analysis for affective. In The Oxford handbook of affective computing, pp. 131. Cited by: §I, §II.
  • [9] C. Corneanu, M. Madadi, and S. Escalera (2018) Deep structure inference network for facial action unit recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 298–313. Cited by: §I.
  • [10] P. Ekman (1965) Differential communication of affect by head and body cues.. Journal of personality and social psychology 2 (5), pp. 726. Cited by: §I.
  • [11] R. Ekman (1997) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (facs). Oxford University Press, USA. Cited by: §I, §I.
  • [12] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio (2010) Why does unsupervised pre-training help deep learning?. Journal of Machine Learning Research 11 (Feb), pp. 625–660. Cited by: §I, §II.
  • [13] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez (2016) Emotionet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5562–5570. Cited by: §II.
  • [14] A. J. Fridlund (2014) Human facial expression: an evolutionary view. Academic Press. Cited by: §I.
  • [15] J. M. Girard, J. F. Cohn, L. A. Jeni, S. Lucey, and F. De la Torre (2015) How much training data for facial action unit detection?. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 1, pp. 1–8. Cited by: §II, §VI.
  • [16] J. M. Girard and D. McDuff (2017) Historical heterogeneity predicts smiling: evidence from large-scale observational analyses. In 2017 12th IEEE International Conference on automatic face & gesture recognition (FG 2017), pp. 719–726. Cited by: §I.
  • [17] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §III.
  • [18] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §III.
  • [19] A. Gudi, H. E. Tasli, T. M. Den Uyl, and A. Maroulis (2015) Deep learning based facs action unit occurrence and intensity estimation. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Vol. 6, pp. 1–5. Cited by: §I, §II.
  • [20] O. E. Gundersen and S. Kjensmo (2018-02) State of the art: reproducibility in artificial intelligence. pp. . Cited by: §I.
  • [21] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao (2016) Ms-celeb-1m: a dataset and benchmark for large-scale face recognition. In European Conference on Computer Vision, pp. 87–102. Cited by: §III-A.
  • [22] K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §III.
  • [23] G. Horstmann (2003) What do facial expressions convey: feeling states, behavioral intentions, or actions requests?. Emotion 3 (2), pp. 150. Cited by: §I.
  • [24] Y. Huang and H. Lu (2016) Hybrid hypergraph construction for facial expression recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 4142–4147. Cited by: §I.
  • [25] M. Hutson (2018) Artificial intelligence faces reproducibility crisis.. Science (New York, NY) 359 (6377), pp. 725. Cited by: §I.
  • [26] S. Jaiswal and M. Valstar (2016) Deep learning the dynamic appearance and shape of facial action units. In 2016 IEEE winter conference on applications of computer vision (WACV), pp. 1–8. Cited by: §I.
  • [27] S. Kaltwang, O. Rudovic, and M. Pantic (2012) Continuous pain intensity estimation from facial expressions. In International Symposium on Visual Computing, pp. 368–377. Cited by: §I, §I.
  • [28] P. Khorrami, T. Paine, and T. Huang (2015) Do deep neural networks learn facial action units when doing expression recognition?. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 19–27. Cited by: §I, §II.
  • [29] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken (1993) Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks 6 (6), pp. 861–867. Cited by: §III.
  • [30] W. Li, F. Abtahi, and Z. Zhu (2017) Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1841–1850. Cited by: §II.
  • [31] Y. Li, J. Zeng, S. Shan, and X. Chen (2019) Self-supervised representation learning from videos for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10924–10933. Cited by: §I.
  • [32] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 94–101. Cited by: §II.
  • [33] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic (2017) Automatic analysis of facial actions: a survey. IEEE transactions on affective computing. Cited by: §I, §II.
  • [34] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn (2013) Disfa: a spontaneous facial action intensity database. IEEE Transactions on Affective Computing 4 (2), pp. 151–160. Cited by: §I, §II, §III-B.
  • [35] M. McDermott, S. Wang, N. Marinsek, R. Ranganath, M. Ghassemi, and L. Foschini (2019) Reproducibility in machine learning for health. arXiv preprint arXiv:1907.01463. Cited by: §I.
  • [36] D. McDuff, M. Amr, and R. El Kaliouby (2018) Am-fed+: an extended dataset of naturalistic facial expressions collected in everyday settings. IEEE Transactions on Affective Computing 10 (1), pp. 7–17. Cited by: §II.
  • [37] D. McDuff and R. El Kaliouby (2016) Applications of automated facial coding in media measurement. IEEE transactions on affective computing 8 (2), pp. 148–160. Cited by: §I.
  • [38] D. McDuff and J. M. Girard (2019) Democratizing psychological insights from analysis of nonverbal behavior. Cited by: §I.
  • [39] D. McDuff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, and R. Picard (2013) Affectiva-mit facial expression dataset (am-fed): naturalistic and spontaneous facial expressions collected. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 881–888. Cited by: §I.
  • [40] D. McDuff, A. Mahmoud, M. Mavadati, M. Amr, J. Turcot, and R. e. Kaliouby (2016) AFFDEX sdk: a cross-platform real-time multi-face expression recognition toolkit. In Proceedings of the 2016 CHI conference extended abstracts on human factors in computing systems, pp. 3723–3726. Cited by: §II.
  • [41] D. Mishkin and J. Matas (2015) All you need is a good init. arXiv preprint arXiv:1511.06422. Cited by: §III.
  • [42] X. Niu, H. Han, S. Yang, Y. Huang, and S. Shan (2019) Local relationship learning with person-specific shape regularization for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11917–11926. Cited by: §I, TABLE III, §II, §V-A, §V-B, §V-C.
  • [43] A. Y. Peng, Y. S. Koh, P. Riddle, and B. Pfahringer (2018) Using supervised pretraining to improve generalization of neural networks on binary classification problems. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 410–425. Cited by: §III.
  • [44] T. K. Pitcairn, S. Clemie, J. M. Gray, and B. Pentland (1990) Non-verbal cues in the self-presentation of parkinsonian patients. British Journal of Clinical Psychology 29 (2), pp. 177–184. Cited by: §I.
  • [45] A. Savran, N. Alyüz, H. Dibeklioğlu, O. Çeliktutan, B. Gökberk, B. Sankur, and L. Akarun (2008) Bosphorus database for 3d face analysis. In European Workshop on Biometrics and Identity Management, pp. 47–56. Cited by: §II.
  • [46] F. Seide and A. Agarwal (2016) CNTK: microsoft’s open-source deep-learning toolkit. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2135–2135. Cited by: §IV-B.
  • [47] E. Seidel, U. Habel, M. Kirschner, R. C. Gur, and B. Derntl (2010) The impact of facial emotional expressions on behavioral tendencies in women and men.. Journal of Experimental Psychology: Human Perception and Performance 36 (2), pp. 500. Cited by: §I.
  • [48] T. Senechal, D. McDuff, and R. Kaliouby (2015)

    Facial action unit detection using active learning and an efficient non-linear kernel approximation

    .
    In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 10–18. Cited by: §I, §II.
  • [49] Z. Shao, Z. Liu, J. Cai, and L. Ma (2018) Deep adaptive attention for joint facial action unit detection and face alignment. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 705–720. Cited by: §II.
  • [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §IV.
  • [51] G. Stratou, S. Scherer, J. Gratch, and L. Morency (2013) Automatic nonverbal behavior indicators of depression and ptsd: exploring gender differences. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 147–152. Cited by: §I, §I.
  • [52] M. Takalkar, M. Xu, Q. Wu, and Z. Chaczko (2018) A survey: facial micro-expression recognition. Multimedia Tools and Applications 77 (15), pp. 19301–19325. Cited by: §II.
  • [53] R. Tatman, J. VanderPlas, and S. Dane (2018) A practical taxonomy of reproducibility for machine learning research. Cited by: §I.
  • [54] M. Valstar and M. Pantic (2006) Fully automatic facial action unit detection and temporal analysis. In 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06), pp. 149–149. Cited by: §II.
  • [55] M. Valstar and M. Pantic (2010) Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In Proc. 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, pp. 65. Cited by: §II.
  • [56] S. Vijay, T. Baltrušaitis, L. Pennant, D. Ongür, J. T. Baker, and L. Morency (2016) Computational study of psychosis symptoms and facial expressions. In Computing and Mental Health Workshop at CHI, Cited by: §I, §I.
  • [57] K. Zhao, W. Chu, and A. M. Martinez (2018) Learning facial action units from web images with scalable weakly supervised clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2090–2099. Cited by: §I.
  • [58] K. Zhao, W. Chu, and H. Zhang (2016) Deep region and multi-label learning for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3391–3399. Cited by: §II.
  • [59] X. Zhu, C. Vondrick, D. Ramanan, and C. C. Fowlkes (2012) Do we need more training data or better models for object detection?.. In BMVC, Vol. 3, pp. 5. Cited by: §II.