Experiment files for the paper "Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition?", available here: http://arxiv.org/abs/1510.02969
Despite being the appearance-based classifier of choice in recent years, relatively few works have examined how much convolutional neural networks (CNNs) can improve performance on accepted expression recognition benchmarks and, more importantly, examine what it is they actually learn. In this work, not only do we show that CNNs can achieve strong performance, but we also introduce an approach to decipher which portions of the face influence the CNN's predictions. First, we train a zero-bias CNN on facial expression data and achieve, to our knowledge, state-of-the-art performance on two expression recognition benchmarks: the extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD). We then qualitatively analyze the network by visualizing the spatial patterns that maximally excite different neurons in the convolutional layers and show how they resemble Facial Action Units (FAUs). Finally, we use the FAU labels provided in the CK+ dataset to verify that the FAUs observed in our filter visualizations indeed align with the subject's facial movements.READ FULL TEXT VIEW PDF
The ability to recognize facial expressions automatically enables novel
Over the past few years, Convolutional Neural Networks (CNNs) have shown...
We present a new type of backdoor attack that exploits a vulnerability o...
Recognizing facial action units (AUs) from spontaneous facial expression...
Recognizing facial action units (AUs) during spontaneous facial displays...
Facial expressions are combinations of basic components called Action Un...
This paper is aimed at creating extremely small and fast convolutional n...
Experiment files for the paper "Do Deep Neural Networks Learn Facial Action Units When Doing Expression Recognition?", available here: http://arxiv.org/abs/1510.02969
Facial expressions provide a natural and compact way for humans to convey their emotional state to another party. Therefore, designing accurate facial expression recognition algorithms is crucial to the development of interactive computer systems in artificial intelligence. Extensive work in this area has found that only a small number of regions change as a human changes their expression and are located around the subject’s eyes, nose and mouth. In, Paul Ekman proposed the Facial Action Coding System (FACS) which enumerated these regions and described how every facial expression can be described as the combination of multiple action units (AUs), each corresponding to a particular muscle group in the face. However, having a computer accurately learn the parts of the face that convey emotion has proven to be a non-trivial task.
Previous work in facial expression recognition can be split into two broad categories: AU-based/rule-based methods and appearance-based methods. AU-based methods [29, 30] would detect the presence of individual AUs explicitly and then classify a person’s emotion based on the combinations originally proposed by Friesen and Ekman in . Unfortunately, each AU detector required careful hand-engineering to ensure good performance. On the other hand, appearance-based methods [1, 2, 31, 33] modeled a person’s expression from their general facial shape and texture.
In the last few years, many well-established problems in computer vision have greatly benefited from the rise of convolutional neural networks (CNNs) as an appearance-based classifier. Tasks such as object recognition, object detection 
, and face recognition have seen huge boosts in performance on several accepted benchmarks. Unfortunately, other tasks such as facial expression recognition have not experienced performance gains of the same magnitude. Little work has been done to see how much deep CNNs can help on accepted expression recognition benchmarks.
In this paper, we seek the answer to the following questions: Can CNNs improve performance on emotion recognition datasets/baselines and what do they learn? We propose to do this by training a CNN on established facial expression datasets and then analyzing what they learn by visualizing the individual filters in the network. In this work, we apply the visualization techniques proposed by Zeiler and Fergus  and Springenberg et al.  where individual neurons in the network are excited and their corresponding spatial patterns are displayed in pixel space using a deconvolutional network. When visualizing these discriminative spatial patterns, we find that many of the filters are excited by regions in the face that corresponded to Facial Action Units (FAUs). A subset of these spatial patterns is shown in Figure 1.
Thus, the main contributions of this paper are as follows:
We show that CNNs trained for the emotion recognition task learn features that correspond strongly with the FAUs proposed by Ekman . We demonstrate this result by first visualizing the spatial patterns that maximally excite different filters in the convolutional layers of our networks, and then using the ground truth FAU labels to verify that the FAUs observed in the filter visualizations align with the subject’s facial movements.
In most facial expression recognition systems, the main machinery matches quite nicely with the traditional machine learning pipeline. More specifically, a face image is passed to a classifier that tries to categorize it as one of several (typically 7) expression classes: 1. anger, 2. disgust, 3. fear, 4. neutral, 5. happy, 6. sad, and 7. surprise. In most cases, prior to being passed to the classifier, the face image is pre-processed and given to a feature extractor. Up until rather recently, most appearance-based expression recognition techniques relied on hand-crafted features, specifically Gabor wavelets[1, 2], Haar features  and LBP features , in order to make representations of different expression classes more discriminative.
For some time, systems based on hand-crafted features were able to achieve impressive results on accepted expression recognition benchmarks such as the Japanese Female Facial Expression (JAFFE) database , the extended Cohn-Kanade (CK+) dataset , and the Multi-PIE dataset . However, the recent success of deep neural networks has caused many researchers to explore feature representations that are learned from data. Not surprisingly, almost all of the methods used some form of unsupervised pre-training/learning to initialize their models. We hypothesize this may be because the scarcity of labeled data prevented the authors from training a completely supervised model that did not experience heavy overfitting.
, the authors trained a multi-layer boosted deep belief network (BDBN) and achieved state-of-the-art accuracy on the CK+ and JAFFE datasets. Meanwhile in, the authors used a convolutional contractive auto-encoder (CAE) as their underlying unsupervised model. They then performed a semi-supervised encoding function called Contractive Discriminant Analysis (CDA) to separate discriminative expression features from the unsupervised representation.
A few works based on unsupervised deep learning have also tried to analyze the relationship between FAUs and the learned feature representations. In[15, 16]
, the authors learned a patch-based filter bank using K-means as their low-level feature. These features were then used to select receptive fields corresponding to specific FAU receptive fields which were subsequently passed to multi-layer restricted Boltzmann machines (RBMs) for classification. The FAU receptive fields were selected using a mutual information criterion between the image feature and the expression label. An earlier work by Susskind et al., showed that the first layer features a deep belief network trained to generate facial expression images appeared to learn filters that were sensitive to face parts. We conduct a similar analysis except we use a CNN as our underlying model and we visualize the spatial patterns that excite higher-level neurons in the network.
To the authors’ knowledge, the only works that previously applied CNNs to expression data were that of Kahou et al. [13, 12] and Jung et al. . In [13, 12], the authors developed a system for doing audio/visual emotion recognition for the Emotion Recognition in the Wild Challenge (EmotiW) [6, 5] while in , the authors trained a network that incorporated both appearance and geometric features when doing recognition. However, one key point is that these works dealt with emotion recognition of video / image sequence data and therefore, actively incorporated temporal data when computing their predictions.
In contrast, our work deals with emotion recognition from a single image, and will focus on analyzing the features learned by the network. Thus, not only will we demonstrate the effectiveness of CNNs on existing emotion classification baselines but we will also qualitatively show that the network is able to learn patterns in the face images that correspond to Facial Action Units (FAUs).
For all of the experiments we present in this paper, we use a classic feed-forward convolutional neural network. The networks we use, shown visually in Figure 2 consist of three convolutional layers with 64, 128, and 256 filters, respectively, and with filter sizes of 5x5 followed by ReLU (Rectified Linear Unit) activation functions. Max pooling layers are placed after the first two convolutional layers while quadrant pooling  is applied after the third. The quadrant pooling layer is then followed by a full-connected layer with 300 hidden units and, finally, a softmax layer for classification. The softmax layer contains anywhere between 6-8 outputs corresponding to the number of expressions present in the training set.
One modification that we introduce to the classical configuration is that we ignore the biases of the convolutional layers. This idea was introduced first by Memisevic et al. in  for fully-connected networks and later extended by Paine et al. in  to convolutional layers. In our experiments, we found that ignoring the bias allowed our network to train very quickly while simultaneously reducing the number of parameters to learn.
When training our network, we train from scratch using stochastic gradient descent with a batch size of 64, momentum set to 0.9, and a weight decay parameter of 1e-5. We use a constant learning rate of 0.01 and do not use any form of annealing. The parameters of each layer are randomly initialized by drawing from a Gaussian distribution with zero mean and standard deviationwhere is the number of input connections to each layer and k is drawn uniformly from the range: .
We also use dropout and various forms of data augmentation to regularize our network and combat overfitting. We apply dropout to the fully-connected layer with a probability of 0.5 (i.e. each neuron’s output is set to zero with probability 0.5). For data augmentation, we apply a random transformation to each input image consisting of: translations, horizontal flips, rotations, scaling, and pixel intensity augmentation. All of our models were trained using the anna software library111https://github.com/ifp-uiuc/anna.
We use two facial expression datasets in our experiments: the extended Cohn-Kanade database (CK+)  and the Toronto Face Dataset (TFD) . The CK+ database contains 327 image sequences, each of which is assigned one of 7 expression labels: anger, contempt, disgust, fear, happy, sad, and surprise. For fair comparison, we follow the protocol used by previous works [15, 17], and use the first frame of each sequence as a neutral frame in addition to the last three expressive frames to form our dataset. This leads to a total of 1308 images and 8 classes total. We then split the frames into 10 subject independent subsets in the manner presented by  and perform 10-fold cross-validation.
TFD is an amalgamation of several facial expression datasets. It contains 4178 images annotated with one of 7 expression labels: anger, disgust, fear, happy, neutral, sad, and surprise. The labeled samples are divided into 5 folds, each containing a train, validation, and test set. We train all of our models using just the training set of each fold, pick the best performing model using each split’s validation set, then we evaluate on each split’s test set and average the results over all 5 folds.
In both datasets, the images are grayscale and are of size 96x96 pixels. In the case of TFD, the faces have already been detected and normalized such that all of the subjects’ eyes are the same distance apart and have the same vertical coordinates. Meanwhile for the CK+ dataset, we simply detect the face in the 640x480 image and resize it to 96x96. The only other pre-processing we employ is patch-wise mean subtraction and scaling to unit variance.
First, we analyze the discriminative ability of the CNN by assessing its performance on the TFD dataset. Table 1 shows the recognition accuracy obtained when training a zero-bias CNN from a random initialization with no other regularization as well as CNNs that have dropout (D), data augmentation (A) or both (AD). We also include recognition accuracies from previous methods. From the results in Table 1, there are two main observations: (i) not surprisingly, regularization significantly boosts performance (ii) data augmentation improves performance over the regular CNN more than dropout ( vs. ). Furthermore, when both dropout and data augmentation are used, our model is able to exceed the previous state-of-the-art performance on TFD by .
|Zero-bias CNN||78.2% 5.7%|
|Zero-bias CNN+D||82.3% 4.0%|
|Zero-bias CNN+A||94.6% 3.3%|
|Zero-bias CNN+AD||95.1% 3.1%|
|Zero-bias CNN+AD||95.7% 2.5%|
We now present our results on the CK+ dataset. The CK+ dataset usually contains eight labels (anger, contempt, disgust, fear, happy, neutral, sad, and surprise). However, many works [34, 24, 17] ignore the samples labeled as neutral or contempt, and only evaluate on the six basic emotions. Therefore, to ensure fair comparison, we trained two separate models. We present the eight class model results in Table 2 and the six class model results in Table 3. For the eight class model, we conduct the same study we did on the TFD and we observe rather similar results. Once again, regularization appears to play a significant role in obtaining good performance. Data augmentation gives a significant boost in performance () and when combined with dropout, leads to a increase. For the eight class and six class models, we achieve state-of-the-art and near state-of-the-art accuracy respectively on the CK+ dataset.
Now, with a strong discriminative model in hand, we will analyze which facial regions the neural network identifies as the most discriminative when performing classification. To do this, we employ the visualization technique presented by Zeiler and Fergus in .
For each dataset, we consider the third convolutional layer and for each filter, we find the N images in the chosen split’s training set that generated the strongest magnitude response. We then leave the strongest neuron high and set all other activations to zero and use the deconvolutional network to reconstruct the region in pixel space. For our experiments, we chose N=10 training images.
We further refine our reconstructions by employing a technique called ”Guided Backpropagation” proposed by Springenberg et al. in. ”Guided Backpropogation” aims to improve the reconstructed spatial patterns by not solely relying on the masked activations given by the top-level signal during deconvolution but by also incorporating knowledge of which activations were suppressed during the forward pass. Therefore, each layer’s output during the deconvolution stage is masked twice: (i) once by the ReLU of the deconvotional layer and (ii) again by the mask generated by the ReLU of the layer’s matching convolutional layer in the forward pass.
First, we will analyze patterns discovered in the Toronto Face Dataset (TFD). In Figure 3, we select 10 of the 256 filters in the third convolutional layer and for each filter, we present the spatial patterns of the top-10 images in the training set. From these images, the reader can see that several of the filters appear to be sensitive to regions that align with several of the Facial Actions Units such as: AU12: Lip Corner Puller (row 1), AU4: Brow Lowerer (row 4), and AU15: Lip Corner Depressor (row 9).
Next, we display the patterns discovered in the CK+ dataset. In Figure 4, we, once again, select 10 of the 256 filters in the third convolutional layer and for each filter, we present the spatial patterns of the top-10 images in the training set. The reader will notice that the CK+ discriminative spatial patterns are very clearly defined and correspond nicely with Facial Action Units such as: AU12: Lip Corner Puller (rows 2, 6, and 9), AU9: Nose Wrinkler (row 3) and AU27: Mouth Stretch (row 8).
|1||AU25: Lips Part|
|2||AU12: Lip Corner Puller|
|3||AU9: Nose Wrinkler|
|4||AU5: Upper Lid Raiser|
|5||AU17: Chin Raiser|
|6||AU12: Lip Corner Puller|
|7||AU24: Lip Pressor|
|8||AU27: Mouth Stretch|
|9||AU12: Lip Corner Puller|
|10||AU1: Inner Brow Raiser|
In addition to categorical labels (anger, disgust, etc.), the CK+ dataset also contains labels that denote which FAUs are present in each image sequence. Using these labels, we now present a preliminary experiment to verify that the filter activations/spatial patterns learned by the CNN indeed match with the actual FAUs shown by the subject in the image. Our experiment aims to answer the following question: For a particular filter i, which FAU j has samples whose activation values most strongly differ from the activations of samples that do not contain FAU j, and does that FAU accurately correspond with the visual spatial patterns that maximally excite filter i?
Given a training set of M images () and their corresponding FAU labels (), let be the activations of sample x at layer for filter . Since we are examining the 3rd convolutional layer in the network, we set . Then, for each of the 10 filters visualized in Figure 4, we do the following:
We consider a particular FAU j and place the samples that contain j in set S where:
We then build a histogram of the maximum activations of the samples that contained FAU j:
We then, similarly, build a distribution over maximum activations of the samples that do not contain FAU j:
We compute the KL divergence between and , , and repeat the process for all of the other FAUs.
Figure 5 shows the bar charts of the KL divergences computed for all of the FAUs for each of the 10 filters displayed in Figure 4. The FAU with the largest KL divergence value is denoted in red and its corresponding name is documented in Table 4 for each filter. From these results, we see that in the majority of the cases, the FAUs listed in Table 4 match the facial regions visualized in Figure 4. This means that the samples that appear to strongly influence the activations of these particular filters are indeed those that possess the AU shown in the corresponding filter visualizations. Thus, we show that certain neurons in the neural network implicitly learn to detect specific FAUs in face images when given a relatively ”loose” supervisory signal (i.e. emotion type: anger, happy, sad, etc.).
What is most encouraging is that these results appear to confirm our intuitions about how CNNs work as appearance-based classifiers. For instance, filter 2, 6, and 9 appear to be very sensitive to patterns that correspond to AU 12. This is not surprising as AU 12 (Lip Corner Puller) is almost always associated with smiles and from the visualizations in Figure 4, a subject often shows their teeth when smiling, a highly distinctive appearance cue. Similarly, for filter 8, it is not surprising that FAU 25 (Lips Part) and FAU 27 (Mouth Stretch) had the most different activation distributions given that the filter’s spatial patterns corresponded to the ”O” shape made by the mouth region in surprised faces, another visually salient cue.
In this work, we showed both qualitatively and quantitatively that CNNs trained to do emotion recognition are indeed able to model high-level features that strongly correspond to FAUs. Qualitatively, we showed which portions of the face yielded the most discriminative information by visualizing the spatial patterns that maximally excited different filters in the convolutional layers of our learned networks. Meanwhile, quantitatively, we correlated the numerical activations of the visualized filters with the subject’s actual facial movements using the FAU labels given in the CK+ dataset. Finally, we demonstrated how a zero-bias CNN can achieve state-of-the-art recognition accuracy on the extended Cohn-Kanade (CK+) dataset and the Toronto Face Dataset (TFD).
This work was supported in part by MIT Lincoln Laboratory. The Tesla K40 GPU used for this research was donated by the NVIDIA Corporation. The authors would also like to thank Dr. Kevin Brady, Dr. Charlie Dagli, Professor Yun Fu, and Professor Usman Tariq for their insightful comments and suggestions with regards to this work.
Zero-bias autoencoders and the benefits of co-adapting features.stat, 1050:10, 2014.
Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2857–2864. IEEE, 2011.