Coherence Constraints in Facial Expression Recognition

10/17/2018 ∙ by Lisa Graziani, et al. ∙ Università di Siena UNIFI 0

Recognizing facial expressions from static images or video sequences is a widely studied but still challenging problem. The recent progresses obtained by deep neural architectures, or by ensembles of heterogeneous models, have shown that integrating multiple input representations leads to state-of-the-art results. In particular, the appearance and the shape of the input face, or the representations of some face parts, are commonly used to boost the quality of the recognizer. This paper investigates the application of Convolutional Neural Networks (CNNs) with the aim of building a versatile recognizer of expressions in static images that can be further applied to video sequences. We first study the importance of different face parts in the recognition task, focussing on appearance and shape-related features. Then we cast the learning problem in the Semi-Supervised setting, exploiting video data, where only a few frames are supervised. The unsupervised portion of the training data is used to enforce three types of coherence, namely temporal coherence, coherence among the predictions on the face parts and coherence between appearance and shape-based representation. Our experimental analysis shows that coherence constraints can improve the quality of the expression recognizer, thus offering a suitable basis to profitably exploit unsupervised video sequences. Finally we present some examples with occlusions where the shape-based predictor performs better than the appearance one.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Facial expression recognition is the problem of detecting emotions in facial images or videos. The research activity on this problem involves the scientific community that is about psychology but also the one that is about computer science and artificial intelligence. Although this task is widely studied and much progress has been made, it still remains a challenging problem, due to the variability and complexity of facial expressions. As a matter of fact, facial expressions can be categorized with respect to multiple classes of emotions. The most widely followed approach consists in considering six basic emotions plus the neutral case, and it is due to the studies of Paul Ekman

[2], while other scientists provided more fine grained descriptions [14]. Facial features of expressions are mostly located around mouth, nose, and eyes, and their locations are essential in explaining and categorizing expressions [1]. Despite the large number of advanced psychological experiments about the human perception and recognition of emotions, we can trivially figure out that different face parts have a different impact in the way humans recognize emotions: the role of eyebrows when we are angry or the way we treat our mouth when we are happy or surprised, for example.

We can find several approaches that exploit Machine Learning with the aim of learning to categorize emotions from examples. Most of them are about using still images

[10, 13], while several more recent works also consider video sequences where actors start with a neutral expression and generate a non-neutral one [9, 16]. The learning framework is usually fully supervised, and supervision is either about each training image or about each video sequence. Works that exploit video data focus on the importance of the temporal evolution of the input face. The system proposed by Fan and Tjahjadi [3]

processes four sub-regions of the face: forehead, eyes/eyebrows, nose and mouth. They used an extension of the spatial pyramid histogram of gradients and dense optical flow to extract spatial and dynamic features from video sequences, and adopted a multi-class SVM-based classifier with one-to-one strategy to recognise facial expressions. Jung et al.

[7] propose a neural-network-based method where two different networks are exploited: the first one extracts appearance features from image sequences, learning temporal correlations, while the other network extracts shape features from a set of facial landmarks. The two nets are combined to yield the final decision on the emotion class. Happy and Routray [5] identify salient areas with generalized discriminative features for expression classification. They only use appearance-based features, and they do not consider the time domain. The framework from Jain et al. [6] recognizes facial expressions from video sequences by modeling temporal variations within shapes. They show that shape provides important information that is sometimes hard to grasp from appearance only. Zhang and Huang [16] propose a mixed model which include a “temporal” and a “spatial” network. The former captures dynamic features from consecutive frames, while the latter is about extracting static features from still frames. More generally, we can roughly characterize the popular trends in the existing literature by the usage of (i.) appearance-related (i.e., visual) features, (ii.) shape-related features, (iii.) features from face parts, (iv.) the temporal domain (i.e., video data).

This paper investigates the application of a pool of Convolutional Neural Networks (CNNs) with the aim of building recognizers of expressions in static images, that can be further applied to video sequences. We consider both (i.) appearance and (ii.) shape features, but, differently from most of the existing works, we do not hand-engineer shape features, and we let the CNNs learn the right representations from special shape-only images. We show that shape-based representation can help the expression recognition when there are some occlusions on the face. We propose a model that considers (iii.) sub-parts of the face in addition to the entire face, motivated by the need of gaining deeper insights in the role of each component. Then, we move to the Semi-Supervised setting, exploiting (iv.) video data. The unsupervised portion of the training data is used to enforce “temporal coherence” among consecutive frames, “part coherence” in each frame, i.e., a coherent prediction among the CNNs that operate on the different face parts, and coherence between appearance and shape-based representation for each face part. Our experimental analysis shows that coherence constraints can improve the quality of the expression recognizer, thus offering a suitable basis to profitably exploit unsupervised video sequences.

Finally we present some examples to show that the shape-based representation can help to detect the right expression when the appearance-based representation fails, such as in presence of occlusions in some face parts as mouth or nose.

This paper is organized as follows. The next Section formalizes the problem of facial expression recognition. Section 3 introduces our model. The role of coherence is described in Section 4, while experiments are collected in Section 5. Section 6 reports some experiments on frames with occlusions and finally in Section 7 are reported the conclusions (and future work).

2 Facial Expression Recognition

The task of facial expression recognition that we consider in this paper consists in building a classifier that predicts one of the six universal emotions [2], that are anger, disgust, fear, happiness, sadness, surprise, plus the neutral case, and that we collect into the set , codified with indices from to . The most popular inputs of the recognizer are images of faces, represented in foreground, usually with frontal orientation. When video data is considered, the recognition problem focusses on short video clips where a transition from the neutral state toward one the six emotions is recorded. Processing videos instead of still images can improve the recognition performance because facial expressions involve variations of the facial muscles along the temporal dimension. However, classifiers that are specifically trained to build a latent representation from a video clip before taking a decision [7], cannot be immediately applied to classify images. Differently, image-based classifiers can process single frames of a video (being the time index) to produce a final decision over a time window, so they are more versatile from the point of view of easiness of deployment in different real-world applications. The facial expression recognition problem is usually faced in the “Fully-Supervised” setting, and, in the case of videos, the available datasets are composed of labeled video clips where we do not have access to the labelings of the single frames111See CK+ http://www.consortium.ri.cmu.edu/ckagree/, Oulu-CASIA http://www.cse.oulu.fi/CMV/Downloads/Oulu-CASIA, MMI https://mmifacedb.eu/.. Nonetheless, obtaining supervised data is costly, while nowadays is pretty easy to have access to collections of unsupervised frontal view faces (web, social networks, smartphones, …) or unsupervised video recordings (video conference/call applications). This suggests that studying the “Semi-Supervised” setting, where a portion of the training data is labeled and a larger portion is unsupervised, can be a promising way to approach the recognition task.

Motivated by the need of building a versatile emotion recognition system, we focus on a predictor that operates on still images and that we can use to make predictions on video data. The system can be trained exploiting both video and image data in a Semi-Supervised setting, taking advantage of the temporal evolution described by the video format. In detail, we consider a classifier that produces a decision for each input image , or for a set of consecutive frames belonging to a time window (that covers a video clip, for example),

(1)
(2)

where majority is the majority-voting function, that returns the most frequent prediction in the time window . Differently from the existing approaches, our system can be trained using labeled and unlabeled image databases, collected in , or labeled and unlabeled frames extracted from the previously described labeled video sequences, collected in . Due to the aforementioned properties of the existing video datasets (containing transitions from neutral to a certain emotion), we can artificially generate by labeling as neutral the very first frames of each video clip, and by assigning the provided video label to the last frames of the sequence. The frames in the internal portion of the sequence are not labeled. Formally, we have

where is the image label, and the rightmost set is fully unlabeled. Then,

where is the number of available video clips and is a sequence extracted from the -th clip,

being the sequence concatenation operator, the -th frame of the -th video, and , arbitrarily chosen. In this case is the label provided with the video clip (neutral is the identifier of the neutral class). We notice that is more informed than , since it also stores the image/frame order and the frame grouping with respect to the videos. For this reason, we can consider to be an instance of the more general representation , and in the rest of the paper we will focus on data represented as in without reducing the generality of what we described so far, and we will compactly indicate it with .

3 Model

Our model is based on CNNs that process two categories of representations of the input image/frame . Such categories consist in appearance-based (i.e, visual) representations and a shape-based representations.

In both the cases, we do not consider the whole , but only the rectangular area that is covered by the target face. We localize the face first, and then we crop the image accordingly. This choice is crucial when processing inputs with multiple faces or when the face is not well positioned at the center of the image (or more generally, at a position incoherent with the training data). The appearance-based representation of the face is simply a grayscale instance of the cropped face. In the case of the shape

-based representation, we still focus on the same cropped region, but we extract a set of shape features that essentially describe the contours of the face parts, and that, in this work, consist of a set of facial landmark points. However, instead of stacking their 2D coordinates into a vector (that is only possible if the set of points is consistent among different faces), we consider a more generic approach in which the shape is simply represented by an artificial image with uniform background and in which the landmarks points are depicted at their coordinates. This allows us to treat the shape in a way that is similar to what we do with the appearance, and it opens the possibility of providing different shape “sketches” that are not only based on landmark points (but also on contour lines, for example).

In order to study the effects of the different face parts in the recognition process, we computed the appearance and shape representations for the face (as just described) and for all the face parts: mouth, nose, eyes, eyebrows. We localized the face area and a set of 68 landmark points using the localizer of Viola and Jones [15] and a landmark detector [8]222We used OpenCV https://opencv.org/ and the “dlib” library http://dlib.net/

. The detector uses the classic Histogram of Oriented Gradients (HOG) features combined with a linear classifier, an image pyramid, and a sliding window detection scheme. Cropping around each set of part-related landmarks (adding a small padding), we obtained

instances of appearance-based representations of the input and shape-based ones, since in the case of shape we also included the landmarks associated to the jaw contour. Figure 1 shows the overall representations that we generate. We resized these representations to the following sizes: face area , mouth area , eye area , eyebrow area , nose area pixels, jaw area .

Figure 1: Representations extracted from an input image. On the left there are the 7 appearance-based representations. On the right there are the 8 shape-based representations, that we implement by sketching landmark points in artificial images.

We implemented a pool of 15 CNNs, each of them processing one of the aforementioned representations (Figure 2). The generic CNN associated to the

-th representation has two convolutional layers followed by max pooling, and some fully connected layers terminated with a softmax activation that outputs a probability distribution over the emotions in

. We indicate with the function computed by such CNN

. All the hidden neural units have ReLu activation functions. The face-related CNNs have 32 and 64 filters on the two convolutional layers, respectively, and two fully connected layers (64 and

neurons). The other CNNs, that are based on inputs with smaller sizes, exploit 16 and 32 filters, and a single fully connected layer ( neurons).

The output of each of the 15 CNNs, when followed by an operation (assuming 1-based indexing), is a possible instance of the function in Eq. (1) and Eq. (2). Formally, for a given ,

where is the -representation of the input, and outputs a vector of size that sums to . Even if our final goal is to focus on the case in which is the index of the full-face-based classifier, in Section 5 we will evaluate the quality of multiple instances of , considering the predictors on the face parts too. In the next Section we will introduce a link between the full-face and face parts.

Figure 2: Structure of CNNs employed.

4 Learning by Enforcing Coherence

We trained the pool of CNNs by minimizing an objective function involving the cross-entropy

between the outputs of the networks and the available labels (one-hot encoding), considering the training data

. The cross-entropy only exploits the labeled pairs in . However, our objective function is also composed by the penalties associated to the fulfilment of “coherence constraints” that we enforce on all the samples of , being them labeled or not. We have considered three types of coherence, namely “temporal coherence”, “coherence among the predictions on the face parts” and “coherence between appearance and shape”. The former enforces the CNNs to be coherent over time for each video sequence, i.e., it enforces the predictions to smoothly change along the time axis. This constraint introduces a regularizing effect, since it prevents the system from developing unstable models that abruptly change their decisions among consecutive frames333We remark that the enforcement of both the coherence constraints only happens at training time.. The part-based coherence enforces each full-face-representation-based classifiers to take decisions that are coherent with the ones taken (on average) by the other part-based classifiers (and vice-versa). The idea behind this constraint is that the committee of the local (i.e. part-based) predictors could provide important fine-grained information that the global (face-based) predictor might not have been able to capture. The coherence between appearance and shape enforces the prediction of the appearance-based classifier to be coherent with the prediction of the shape-based classifier for each part (excluded jaw).

We already experimented some related constraints in the case of multi-view object recognition [12], and these ideas are borrowed by the generic framework of “Learning from Constraints” [4], where a predictor is constrained exploiting high-level knowledge on the task at hand, bridging the symbolic and sub-symbolic worlds.

In detail, given three scalars that weigh the importance of the coherence (soft) constraints, we define our objective function as the sum of the contributions (cross-entropy, temporal coherence, part coherence) of the appearance and shape representation and of the coherence between appearance and shape. We write each contribution for the appearance-based representation (for the shape-based one it is equivalent):

(3)

The index spans over the 7 appearance-based classifiers (or the 8 shape-based classifiers). The index spans over all the pairs in , and, for the sake of simplicity, we used the notation to indicate that we consider only the labeled examples. The scalar weights are used to give custom weights to the examples, and we used them to give more importance to the classes that are less represented in . The notation is the index of the -th frame in the -th video sequence belonging to . Finally, face is used to indicate the index associated with the full-face input, and is the transpose operator.

We notice that since is a probability distribution, the dot products involving two instances of are when such instances are equivalent (and the coherence constraints are fulfilled). The temporal constraint involves dot products between the predictions on pairs of consecutive frames in the same video clip. We kept the same structure to build the part-based constraint, where the averaging operation on the part-based classifiers is evident when is moved right before the second term of the dot product .

Then we define the loss for the appearance-based representation (equivalently for the shape) as the sum of the three contributes defined above:

Now we introduce the coherence between appearance and shape:

(4)

We excluded jaw, because we don’t have appearance representation of it. The final loss is the sum of all contributes just defined:

(5)

5 Experimental Results

In order to validate our model, we used the popular Extended Cohn-Kanade dataset (CK+) [11]. It consists of 593 frames belonging to a set of short video sequences, where 120 subjects (different age and gender) generate expressions belonging to the following list: anger, contempt, disgust, fear, happiness, sadness and surprise. We excluded the sequences associated to “contempt”, which is not included into the six universal emotions. The video sequences are composed of 10-60 frames, they start with a neutral expression and they end with the peak of one of the previously listed expressions. Each sequence is associated with an emotion label.

In order to build the Semi-Supervised set described in Section 2, we selected and . We generated 5 randomizations of the whole dataset, and divided each of them into training (), validation (), and test sets (

), keeping the original distribution of the classes in each set. The validation data was used to validate the model parameters and excluded from training. The test partition was used to measure the quality of the model, and the results presented in this Section are averaged over the 5 test partitions (when available, we also report the standard deviation in brackets). Each collection of training data consists of about

frames, out of which are labeled, and they are organized into sequences, while the validation data is composed of frames, out of which are labeled, and organized into sequences. Since examples from the “neutral” class are much more represented with respect to other examples, we set in Eq. (3) if is an example from the neutral class, otherwise. Initially we excluded coherence between appearance and shape, setting . We selected the optimal by a grid-search in

, measuring frame-level accuracy (i.e., only the labeled validation frames are considered). We implemented our model using TensorFlow, and we minimized Eq. (

5) by the Adam-based optimizer (starting learning rate

), mini-batches of size 96, and we have trained the model for multiple epochs, stopping the procedure when the validation error started increasing.

We performed experiments comparing a system with no-coherence-constraints () with other models that include either temporal or part-based coherence. We compared the cases of single-frame-level predictions (where only the labeled portion of the test set is considered) and the case of video-sequence-level predictions, following the decision rules of Eq. (1) and Eq. (2), respectively (where covers the full video sequence). Since examples of the different classes are not balanced in the given dataset, and in order to provide a more informative set of results, we measured two types of accuracies, namely Micro and Macro accuracies. The former is simply the percentage of correctly labeled frames/sequences, while the latter is the average of the percentages of correctly labeled frames/sequences in each emotion class.

Table 1 shows the results we obtain when testing the classifiers that operate on the full-face inputs, considering both appearance and shape representations. We also report results of an additional classifier obtained by averaging the outputs of the full set of classifiers (thus mixing appearance and shape data).

Images
% Micro Acc % Macro Acc
None Part Temp None Part Temp
Face 78.9 (3.6) 78.0 (2.0) 81.1 (3.0) 71.2 (2.8) 72.8 (2.2) 72.2 (7.4)
Face 71.8 (3.0) 71.9 (3.1) 72.5 (2.9) 61.1 (2.9) 61.3 (3.0) 62.1 (2.7)
Avg 73.7 (4.1) 71.4 (3.1) 72.1 (4.8) 71.9 (3.9) 70.2 (3.3) 69.7 (3.7)
Videos
% Micro Acc % Macro Acc
None Part Temp None Part Temp
Face 75.3 (5.1) 77.0 (3.4) 80.0 (2.9) 64.0 (3.2) 66.8(3.1) 64.4 (10.3)
Face 68.5 (3.0) 68.1 (3.1) 69.4(2.9) 54.0 (2.9) 53.5 (3.0) 55.5 (2.7)
Avg 78.3 (4.9) 77.9 (2.5) 80.4 (5.5) 65.6 (6.5) 65.9 (3.9) 64.8 (7.4)
Table 1: Micro and macro accuracies (std dev. in brackets) at image and video (sequence) level of the full-face-based classifiers (appearance and shape representations) and of an ensemble of the classifiers (average of outputs, both shape and appearance). Results without coherence constraints (None), with Part-based coherence and Temp-oral coherence (results where coherence improves the accuracy are in bold).

Temporal coherence always improves the quality of the face-based classifiers, up to in the case of sequences (micro). In the case of macro-accuracy we observe larger standard deviations, that are due to the effects of the predictions on the classes with a smaller number of examples. Such classes are less-frequently predicted, and asking for a strong temporal regularization sometimes further reduces such frequency. Coherence among parts helps in a less evident manner, especially when using shapes. Shape is less informative than appearance, resulting in a performance drop of . The average-based classifier is only in some cases better that the face-based ones. Constraints are less effective in this case (even if we get a strong micro accuracy in videos + temporal coherence). This suggests that mixing the classifiers together is not a promising direction, mostly because some of them have low performances that can degrade the average quality of the system.

Images
% Micro Acc % Macro Acc
None Part Temp None Part Temp
Mouth 70.5 (3.5) 68.6 (3.0) 72.8 (2.6) 71.5 (6.7) 70.8 (5.8) 73.3 (4.4)
Left-eye 42.3 (6.0) 41.4 (6.0) 40.0 (4.2) 41.3 (6.5) 39.1 (4.9) 38.5 (3.9)
Right-eye 42.0 (5.6) 42.0 (7.3) 40.6 (5.2) 40.8 (5.7) 40.5 (5.7) 38.8 (5.6)
Left-eyebrow 40.5 (6.8) 37.7 (7.3) 38.4 (9.1) 40.1 (6.1) 37.4 (7.5) 37.6 (8.4)
Right-eyebrow 40.1 (2.5) 39.7 (2.4) 40.4 (2.9) 40.1 (3.5) 39.5 (2.8) 40.3 (3.1)
Nose 43.6 (2.9) 44.1 (5.5) 43.4 (4.0) 41.6 (3.4) 42.4 (4.8) 42.0 (3.7)
Mouth 64.3 (2.3) 63.8 (3.5) 63.4 (3.2) 64.4 (4.7) 63.4 (4.8) 66.2 (4.9)
Left-eye 35.8 (3.4) 34.5 (3.7) 35.2 (2.6) 33.2 (3.9) 33.0 (3.4) 32.5 (2.3)
Right-eye 40.7 (3.2) 40.6 (2.7) 41.5 (3.0) 36.9 (2.4) 37.2 (2.1) 37.9 (2.0)
Left-eyebrow 31.2 (4.4) 31.0 (3.8) 30.1 (3.5) 31.8 (1.8) 31.9 (2.0) 31.7 (3.7)
Right-eyebrow 34.3 (4.2) 33.9 (3.7) 34.1 (3.5) 34.3 (5.2) 33.4 (4.5) 33.6 (4.9)
Nose 30.8 (3.7) 30.4 (3.2) 30.9 (4.2) 30.6 (5.6) 31.0 (5.0) 31.6 (5.2)
Jaw 37.4 (3.7) 37.2 (3.7) 37.0 (3.5) 34.1 (4.6) 34.9 (4.3) 33.8 (4.0)
Videos
% Micro Acc % Macro Acc
None Part Temp None Part Temp
Mouth 77.5 (7.7) 72.3 (9.0) 75.7 (6.4) 73.0 (9.5) 66.4 (8.4) 69.9 (8.7)
Left-eye 49.4 (8.4) 50.6 (4.1) 47.2 (5.9) 42.7 (5.8) 41.3 (2.7) 40.2 (6.3)
Right-eye 46.8 (2.3) 47.2 (4.9) 47.7 (2.9) 39.8 (1.7) 39.2 (3.0) 38.9 (3.7)
Left-eyebrow 43.0 (9.7) 41.7 (9.2) 42.1 (11.1) 35.2 (7.7) 34.3 (9.1) 34.3 (9.6)
Right-eyebrow 43.4 (4.6) 42.5 (5.5) 43.8 (3.2) 36.5 (6.6) 35.6 (6.8) 35.9 (4.0)
Nose 44.3 (4.9) 47.7 (5.1) 47.2 (2.8) 35.4 (4.3) 38.8 (4.3) 38.9 (3.1)
Mouth 71.9 (2.5) 74.0 (3.7) 70.6 (2.8) 64.3 (4.2) 66.1 (6.0) 67.3 (5.0)
Left-eye 45.1 (5.8) 44.7 (8.5) 45.1 (4.5) 36.6 (7.1) 37.2 (6.1) 38.3 (4.1)
Right-eye 51.9 (2.2) 52.8 (3.7) 56.2 (3.7) 39.4 (3.1) 41.5 (3.3) 44.9 (3.9)
Left-eyebrow 36.2 (6.7) 34.5 (3.4) 34.9 (3.5) 28.7 (5.1) 28.7 (3.0) 29.3 (4.1)
Right-eyebrow 40.4 (5.0) 40.0 (5.9) 41.3 (6.7) 33.9 (5.6) 33.1 (5.0) 33.8 (7.0)
Nose 37.5 (5.0) 35.7 (3.7) 34.0 (1.4) 31.4 (5.4) 28.5 (5.6) 31.8(4.4)
Jaw 40.9 (2.5) 40.9 (2.1) 40.0 (3.7) 30.5 (2.5) 31.3 (2.7) 29.8 (2.7)
Table 2: Micro and macro accuracies (std dev. in brackets) at image and video level of all the part-based classifiers (appearance and shape representation). Results without coherence constraints (None), with Part-based coherence and Temp-oral coherence (results where coherence improves the accuracy are in bold).

To gain better insights about the last comment, Table 2 reports the accuracies for all the part-based classifiers. The mouth area is a very effective input for facial expression recognition, that can sometimes compete with the full-face. This is more evident in the case of videos, when comparing shape-based representations of face and mouth. As expected, the other parts are worse than the full-face, since they are just local views. The addition of both coherences sparsely helps in improving the local classifiers, with a preference toward temporal coherence. The worst results are obtained by eyebrows and nose in shape-based classification. Interestingly, the eye-based predictors score the most effective results after face and mouth in video sequences. While their appearance representation is altered when eyes get closed, their shape representation is more stable. The results on left eye and right eye are a bit different and this due to the fact that wrinkles can be asymmetric, or that an eye can be closed, or to the variation of lighting and pose. This analysis suggests that an accurate choice of a sub-portion of the face parts could significantly help the part-based coherence constraint (since some of the parts are not very informative).

We deepened the analysis on the temporal-constrained classifiers in the case of making predictions in video sequences. Since the number of sequences is small, we selected the optimal using image-level predictions on the validation data (as already stated), leading to and in the case of micro and macro accuracy, respectively. Figure 3 reports the performances on videos with different values of (appearance only). We can see that the distributions of the performances are multimodal, and if we focus on the macro accuracy we observe that we could have obtained much better results with different values of . This suggests that the validation procedure has room for being improved in the case of video data.

Figure 3: Micro and macro accuracies in the case of video data, full-face-based classifier (appearance), for different values of . The black-bordered bars are the results we reported in Table 1.

Temporal coherence yields homogeneous predictions over the sequences, without oscillations along the temporal axis. In Figure 4 we represent an example showing that temporal coherence produces an uniform trend in the predictions on the sequence (“surprise” emotion is sketched). In fact the model with the best predicts “neutral” in the first frames of the sequence and “surprise” in the last ones. Differently, the model without temporal coherence produces an oscillating trend on the sequence, predicting also wrong emotions as “disgust” and “anger”.

Figure 4: Predictions in a sequence that starts with neutral expression and develops in surprise. For each frame we report the prediction of the model without coherences (top) and the prediction with temporal coherence (bottom). The wrong predictions are in red.

In Table 3 we show the results on single emotion classes for face and mouth appearance-based classification, focussing on the case where no-coherence is introduced and the ones with a selection of the best and from the previously described experiments. “Fear” and “sadness” classes are difficult to classify because they do not involve strong facial movements, while “happiness” and “surprise” are easy to recognize. The mouth-based model has difficulties in the “neutral” class, since some emotions do not evidently alter the mouth area (the face model does not show this issue). In the “sadness” class, where the face-based model scores low accuracies, the mouth-based classifier is much more performant. This suggests that the face-related network has difficulties in developing a generalizable representation for the whole face to identify “sadness”. Larger training data could help in this case.

Images
Anger Disgust Fear Happiness Sadness Surprise Neutral
face None 73.7 69.2 56.1 92.5 29.5 96.1 81.1
face Part 68.0 78.2 75.2 98.2 24.3 97.4 68.6
face Temp 77.1 81.8 50.0 97.5 26.2 95.5 81.8
mouth None 66.4 69.6 59.4 92.7 75.6 96.6 40.2
mouth Part 66.4 81.8 65.1 95.0 59.6 95.5 32.2
mouth Temp 67.8 80.4 58.8 94.8 72.0 95.2 44.1
Videos
Anger Disgust Fear Happiness Sadness Surprise Neutral
face None 77.1 62.2 33.3 90.9 25.0 95.4
face Part 68.6 71.1 53.3 90.9 20.0 96.9
face Temp 77.1 73.3 40.0 98.2 25.0 96.9
mouth None 77.1 73.3 46.7 78.2 75.0 87.7
mouth Part 62.9 77.8 46.7 74.6 55.0 81.5
mouth Temp 74.3 77.8 40.0 74.6 65.0 87.7
Table 3: Accuracies on each class of full-face and mouth classifiers (appearance). Results without coherence constraints, with Part-based coherence and Temporal coherence (results where coherence improves the accuracy are in bold).

Temporal coherence shows better performance in “neutral”, “anger” (image-level only), and “disgust” emotions. It is also helpful in the “happiness” class, where the face model performs a close-to-flawless classification. Introducing coherence among parts improves the recognition of “disgust”, “fear” (face only), “happiness” (image-level only), and it slightly improves the accuracy of “surprise” for the face-based predictor.

In addiction to these results, we report that eye-based recognition reaches very good results for the “surprise” class; the accuracy of right-eye classifier with temporal coherence is . This is due to the fact that the eyes in surprise expressions are wide open, so easily recognizable. Differently, the “neutral” class is not recognizable at all from the eyebrows. Nose-based classification (appearance) reaches an accuracy of 79.4% with temporal coherence in the “disgust” class, where the nose is wrinkled.

At a later stage, considering the previously reported experiments, we made other experiments with the coherence between appearance and shape (4), changing some values of . We obtained that the new coherence helps further the accuracies, with respect to the best model with only temporal coherence. As we can see in Table 4 the model with the best and () for full-face classification on appearance-based representation is sometimes better than the model with the best only. Coherence between appearance and shape improves the micro and macro accuracy and the classification of some emotions as “anger”, “disgust” (even 6.7%), “fear” and “surprise” at sequence level. At frame level it improves the accuracy in “neutral” class even of 4%.

Images
Micro Macro Anger Disgust Fear Happiness Sadness Surprise Neutral
face None 78.9 71.2 73.7 69.2 56.1 92.5 29.5 96.1 81.1
face Temp 81.1 72.9 77.1 81.8 50.0 97.5 26.2 95.5 81.9
face Temp+app-shape 80.7 72.9 73.7 81.2 49.0 94.8 31.3 94.6 85.8
Videos
Micro Macro Anger Disgust Fear Happiness Sadness Surprise Neutral
face None 75.3 64.0 77.1 62.2 33.3 90.9 25.0 95.4
face Temp 80.0 68.4 77.1 73.3 40.0 98.2 25.0 96.9
face Temp+app-shape 80.4 69.9 80.0 80.0 46.7 89.1 25.0 98.5
Table 4: Micro and macro accuracies and accuracies on each class of full-face classifier (appearance). Results without coherence constraints, with Temporal coherence only and with Temporal and coherence between appearance and shape (results where coherence between appearance and shape improves the accuracy respect to temporal coherence only are in bold).

As a final comment, we have also tried to perform some experiments involving both temporal and part-based coherences activated, and others involving the three coherences together, but they were not better than the “best” ones that we obtained by activating temporal coherence and coherence between appearance and shape.

6 Occlusions

Shape-based representation can help to recognize emotions when there are occlusions or different illuminations on the face. We did some tests in images with occlusions: we took the last frame (the more expressive) of each sequence of the CK+ dataset, so we obtained 309 images, and we covered some parts of the face, as mouth or nose. We made predictions on this modified images (appearance and shape-based representations) and we found that sometimes the shape-based predictor on face is better than the appearance one. In Table 5 we report the accuracies associated with the cases in which the shape-based representation performs better than appearance (full-face classifier). For each emotion and for each part the accuracies are the percentage of the right predictions on the frames with the part covered. For “anger”, when the mouth is covered, the accuracy for the appearance-based classifier is only 35.6% while for the shape-based one is 75.6%. As we have seen in Section 5, this emotion is not easy to recognize, and covering an important part as the mouth makes the task more difficult. Differently, the shape-based classifier can capture more robust features that go beyond the appearance even when the mouth is occluded. An other considerable improvement of the shape respect to appearance happens when nose is covered in images with happy expressions. The appearance-based classifier sometimes is confused with “fear” where the mouth is in general open as in “happiness”.

emotion covered part acc. acc.
anger mouth 35.6 75.6
disgust mouth 61.0 78.0
disgust nose 81.4 83.1
happiness nose 66.7 98.6
sadness mouth 67.9 71.4
sadness nose 53.6 75.0
surprise nose 95.2 98.8
Table 5: Accuracies of full-face classifier (appearance and shape) on images with occlusions.

In Figure 5 we report some examples with occlusions in which the shape-based classifier predicts the right emotion, while the appearance-based gets wrong. The first example from the left represents “anger”, whereas the appearance-based classifier predicts “fear” when the mouth is covered. Anger and fear show most of their differences in the mouth area. In angry expression the lips are tight, while in fear the mouth is slightly open. In the second example representing “disgust”, where the wrinkled mouth is covered, the appearance classifier predicts “fear”. In the third instance the occlusion is on the wrinkled nose typical of the disgusted expression. In the following, covering the nose, the appearance-based classifier predicts “fear” instead of “happiness”, because it focuses on the slightly open mouth, not considering the more relaxed nose. In the last example depicting “sadness” with mouth covered the appearance classifier, which focuses on the open eyes and on the slightly raised eyebrows, predicts “surprise” not seeing if the mouth is wide open or down.

Figure 5: Examples of images with occlusions where the shape-based classifier predicts the right emotion whereas the appearance-based classifier gets wrong. From top to bottom: the original image (appearance), the images with occlusion (appearance, shape), the right prediction of the shape-based classifier (green), and the wrong prediction of the appearance-based classifier (red).

7 Conclusions and Future Work

We presented a Convolutional Neural Network (CNN)-based approach to Facial Expression Recognition. Our model is based on a pool of CNNs that process distinct face parts, represented using visual (appearance) or shape-only features. In the latter case, we treated shape as a generic input of the learnable model, without manually engineering its representation. Shape-based representation can help to detect the right expression when the appearance-based representation fails, for example in presence of occlusions on the face or of different illumination levels.

We studied the importance of the different representations on the task at hand, showing an analysis that involved all the considered face parts, and reporting results of experiments on a popular dataset composed of six basic emotions, plus the neutral case. We proposed the introduction of coherence constraints among the face-part predictors, between predictions on consecutive time instants, and between appearance and shape representation, casting the learning problem in the Semi-Supervised setting and using video data. Our results have shown that using unsupervised training data paired with coherence constraints improves the quality of the recognizer, especially in the case of temporal coherence combining with coherence between appearance and shape. Our future work will include a more detailed study on the face-part coherence, selecting only on the most promising face parts, according to the results of this study. We will use a larger collections of data, to grasp the importance of large-scale unsupervised data obtained from video conferences.

References

  • [1] Duchenne, G.B., de Boulogne, G.B.D.: The mechanism of human facial expression. Cambridge University press (1990)
  • [2] Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. Journal of Personality and Social Psychology 17(2), 124 (1971)
  • [3]

    Fan, X., Tjahjadi, T.: A spatial-temporal framework based on histogram of gradients and optical flow for facial expression recognition in video sequences. Pattern Recognition 48(11), 3407–3416 (2015)

  • [4] Gnecco, G., Gori, M., Melacci, S., Sanguineti, M.: Foundations of support constraint machines. Neural Computation 27(2), 388–480 (2015)
  • [5] Happy, S., Routray, A.: Automatic facial expression recognition using features of salient facial patches. IEEE Transactions on Affective Computing 6(1), 1–12 (2015)
  • [6]

    Jain, S., Hu, C., Aggarwal, J.K.: Facial expression recognition with temporal modeling of shapes. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. pp. 1642–1649. IEEE (2011)

  • [7] Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: Computer Vision (ICCV), 2015 IEEE International Conference on. pp. 2983–2991. IEEE (2015)
  • [8] Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1867–1874 (2014)
  • [9] Long, F., Bartlett, M.S.: Video-based facial expression recognition using learned spatiotemporal pyramid sparse coding features. Neurocomputing 173, 2049–2054 (2016)
  • [10] Lopes, A.T., de Aguiar, E., De Souza, A.F., Oliveira-Santos, T.: Facial expression recognition with convolutional neural networks: coping with few data and the training sample order. Pattern Recognition 61, 610–628 (2017)
  • [11] Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on. pp. 94–101. IEEE (2010)
  • [12]

    Melacci, S., Maggini, M., Gori, M.: Semi–supervised learning with constraints for multi–view object recognition. In: International Conference on Artificial Neural Networks. pp. 653–662. Springer (2009)

  • [13] Mollahosseini, A., Chan, D., Mahoor, M.H.: Going deeper in facial expression recognition using deep neural networks. In: Applications of Computer Vision (WACV), 2016 IEEE Winter Conference on. pp. 1–10. IEEE (2016)
  • [14] Plutchik, R.: The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American Scientist 89(4), 344–350 (2001)
  • [15] Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on. vol. 1, pp. I–I. IEEE (2001)
  • [16] Zhang, K., Huang, Y., Du, Y., Wang, L.: Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Transactions on Image Processing 26(9), 4193–4203 (2017)