Learning Grimaces by Watching TV

10/07/2016 ∙ by Samuel Albanie, et al. ∙ 0

Differently from computer vision systems which require explicit supervision, humans can learn facial expressions by observing people in their environment. In this paper, we look at how similar capabilities could be developed in machine vision. As a starting point, we consider the problem of relating facial expressions to objectively measurable events occurring in videos. In particular, we consider a gameshow in which contestants play to win significant sums of money. We extract events affecting the game and corresponding facial expressions objectively and automatically from the videos, obtaining large quantities of labelled data for our study. We also develop, using benchmarks such as FER and SFEW 2.0, state-of-the-art deep neural networks for facial expression recognition, showing that pre-training on face verification data can be highly beneficial for this task. Then, we extend these models to use facial expressions to predict events in videos and learn nameable expressions from them. The dataset and emotion recognition models are available at http://www.robots.ox.ac.uk/ vgg/data/facevalue

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

Code Repositories

LearningGrimacesByWatchingTV

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Humans make extensive use of facial expressions in order to communicate. Facial expressions are complementary to other channels such as speech and gestures, and often convey information that cannot be recovered from the other two alone. Thus, understanding facial expressions is often necessary to properly understand images and videos of people.

The general approach to facial expression recognition is to label a dataset of faces with either nameable expressions (e.ghappiness, sadness, disgust, anger, etc.) or facial action units

(movements of facial muscles such as tightening the lips or raising an upper eyelid) and then learn a corresponding classifier, for example by using a deep neural network. In contrast, humans need not to be

explicitly told what facial expressions means, but can learn that by associating facial expressions to how people react to particular events or situations.111Generating certain facial expressions is an innate ability; however, recognizing facial expression is a learned skill.

Figure 1: FaceValue dataset. We study facial expressions from objectively-measurable events occurring in the “Deal or No Deal” gameshow. Top: detection of an event at round in the game. Left: a box is opened, revealing to the contestant that her prize is not the one of value . Since this is a low amount, well below the expected value of the prize of , this is a “good” event for the contestant. Right: the contestant’s face, intuitively expressing happiness, is detected. Note also the overlay for disappearing from a frame to the next; our system can automatically read such cues to track the state of the game. Bottom: four example tracks, the top two for “good” events and the bottom two for “bad” events, as defined in the text.

In order to investigate whether algorithms can also learn facial expressions by establishing similar associations, in this paper we look at the problem of relating facial expressions to objectively-quantifiable contextual events in videos

. The main difficulty of this task is that there is only a weak correlation between an event occurring in a video and a person showing a particular facial expression. However, learning facial expressions in this manner has three important benefits. The first one is that it grounds the problem on objectively-measurable quantities, whereas labelling emotions or even facial action units is often ambiguous. The second benefit is that contextual information can often be labelled in videos fully or partially automatically, obviating the cost of collecting large quantities of human-annotated data for data-hungry machine learning algorithms. Finally, the third advantage is that the ultimate goal of face recognition in applications is not so much to describe a face, but to infer from it information about a situation or event, which is tackled directly by our study.

Concretely, our first contribution (Sect. 2; Fig. 1) is to develop a novel dataset, FaceValue, of faces extracted from videos together with objectively-measurable contextual events. The dataset is based on the “Deal or No Deal” TV program, a popular game where contestants can win or lose significant sums of money. Using a semi-automatic procedure, we extract significant events in the game along with the player (and public) reaction. We use this data to predict from facial expressions whether events are “good” or “bad” for the contestant. To the best of our knowledge, this is the first example of leveraging gameshows in facial expression understanding and the first study aiming to relate facial expressions to people’s activities.

Our second contribution is to carefully assess the difficulty of this problem by establishing a human baseline and by extending the latter to existing expression recognition datasets for comparison (Sect. 3). We also develop a number of state-of-the-art expression recognition models (Sect. 4) and show that excellent performance can be obtained by transferring deep neural networks from face verification to expression recognition. Our final contribution is to extend such systems to the problem of recognising FaceValue events from facial expressions (Sect. 5). We develop simple but effective pooling strategies to handle face tracks, integrating them in deep neural network architectures. With these, we show that it is not only possible to predict events from facial expressions, but also to learn nameable expressions by looking at people spontaneously reacting to events in TV programs.

1.1 Related work

Dataset Size Labelling Technique Expressions Labels
FER 35,887 Faces Internet search Mixed 6+1 emotions
AFEW 5.0 1,426 Clips Subtitles Acted 6+1 emotions
SFEW 2.0 1,635 Faces Subtitles Acted 6+1 emotions
AM-FED 168,359 Faces Human experts Spontaneous FACS
FaceValue (ours) 192,030 Faces Metadata extraction Spontaneous Event Outcome
Table 1: Comparison of emotion-based datasets of faces in challenging conditions.

Facial expressions are a non-verbal mode of communication complementary to speech and gestures [Ekman and Friesen(1969b), Attardo et al.(2003)Attardo, Eisterhold, Hay, and Poggi]. They can be produced unintentionally [Ekman and Friesen(1969a)], revealing hidden states of the actor in pain or deception detection [Besel and Yuille(2010)]. Facial expressions are commercially valuable, attracting increasing investment from advertising agencies that seek to understand and manipulate the consumer response to a product [El Kaliouby et al.(2012)El Kaliouby, Dreisch, England, and Kodra] and corresponding regulatory attention [Rep. Capuano and Rep. Jones(Introduced in US House of Representatives, 02/27/2015)].

Face-related tasks such as face detection, verification and recognition have long been researched in computer vision with the creation of several labelled datasets: FDDB [Jain and Learned-Miller(2010)], AFW [Zhu and Ramanan(2012)] and AFLW [Koestinger et al.(2011)Koestinger, Wohlhart, Roth, and Bischof] for face detection; and LFW [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller] and VGG-Face [Parkhi et al.(2015a)Parkhi, Vedaldi, and Zisserman] for face recognition and verification. Face detectors and identity recognizers can now rival the performance of humans [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin]. Facial expression recognition has also received significant attention in computer vision, but it presents a number of additional subtleties and difficulties which are not found in face detection or recognition. The main challenge is the consistent labelling of facial expressions which is difficult due to the subjective nature of the task. A number of coding systems have been developed in an attempt to label facial expressions objectively, usually at the level of atomic facial movements, but even human experts are not infallible in generating such annotations. Furthermore, getting these experts to annotate a dataset is expensive and difficult to scale [McDuff et al.(2013)McDuff, Kaliouby, Senechal, Amr, Cohn, and Picard]. Another issue is the “authenticity” of facial expressions, arising from the fact that several datasets are acted [Sebe et al.(2007)Sebe, Lew, Sun, Cohen, Gevers, and Huang], either specifically for data collection [Lyons et al.(1998)Lyons, Akamatsu, Kamachi, and Gyoba] [Lucey et al.(2010)Lucey, Cohn, Kanade, Saragih, Ambadar, and Matthews] [Gross et al.(2010)Gross, Matthews, Cohn, Kanade, and Baker] or indirectly as data is extracted from movies [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon]. Our FaceValue dataset sidesteps these problems by recording spontaneous reactions to objectively-occurring events in videos.

Examples of datasets which contain challenging variations in pose, lighting conditions and subjects are given in Table 1. Of these, two in particular have received significant research interest as popular benchmarks for facial expression recognition. The Static Facial Expression in the Wild 2.0 (SFEW-2.0) data [Dhall et al.(2011b)Dhall, Goecke, Lucey, and Gedeon] (used in the  EmotiW challenges [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon]) consists of images from movies which collectively contain 1,635 faces labelled with seven emotions (this dataset was constructed by selectively extracting individual frames from AFEW-5.0 [Dhall et al.(2012)]). The Facial Expression Recognition 2013 (FER-2013) dataset [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio], which formed the basis of a large Kaggle competition, contains 35k images labelled with the same seven emotions. These datasets were used to develop several state-of-the-art emotion recognition systems. Among the top-performing ones, the authors of [Yu and Zhang(2015)] and [Kim et al.(2016)Kim, Roh, Dong, and Lee] propose ensembles of deep network trained on the FER and SFEW-2.0 data. There are also several commercial implementations of expression recognition, such as CMU’s IntraFace [de la Torre et al.(2015)de la Torre, Chu, Xiong, Vicente, Ding, and Cohn] and the Affectiva face software.

2 FaceValue: expressions in context

In this section we describe the FaceValue dataset (Fig. 1) and how it was collected.

Data source.

The “Deal or No Deal” TV game show222Outside of computer vision, the interesting decision making dynamics of contestants in a high-stakes environment during the “Deal or No Deal” game show have attracted research by economists [Post et al.(2008)Post, Van den Assem, Baltussen, and Thaler]. was selected as the basis for our data for a number of reasons. First, it contains a very significant amount of data. The show has been running nearly daily in the UK for the past eleven years, totalling 2,929 episodes. Each episode focuses on a different player and lasts for about forty minutes. Furthermore, the same or very similar shows are or were aired in dozens of other countries. Second, the game is based on simple rules and a sequence of discrete events that are in most cases easily identifiable as positive or negative for the player, and hence can be expected to induce a corresponding emotion and facial expression. Furthermore, these events are easily detectable by parsing textual overlays in the show or other simple patterns. Thirdly, since there is a single player, it is easy to identify the person that is directly affected by the events in the video and the camera tends to focus on his/her face.

An example of the in-game footage and data extraction pipeline is shown in Fig. 1. The rules of the game are easily explained. There are possible cash prizes where prizes range from 1p up to £250,000. Initially the player is assigned a prize but does not know its value. Then, at each round of the game the player can randomly extract (realised as opening a box, see Fig. 1 top-left) one of the prizes from and reveal it, resulting in a smaller set of possible prizes. Through this process of elimination the player obtains information about his/her prize . Occasionally the player is offered the opportunity to leave the game with a prize (“deal”) determined by the game’s host or to continue playing (“no deal”) and eventually leave with .

The expected value of the win at time is . When a prize is removed from , the player perceives this as a “good” event if , which requires , and a “bad” event otherwise. In practice we conservatively require for a good event, where . Interestingly, the game is continued even after the player has taken a “deal”; in this case the roles of “good” and “bad” events are reversed as the player hopes that the accepted deal is higher than the prize he/she gave up.

Dataset content.

The data in FaceValue is defined as follows. Faces are detected right after a new prize is revealed for about seven seconds. These faces are collected in a “face track” . Furthermore, the face track is assigned the binary label:

where is if the deal was not taken so far, and otherwise. Note that there are several levels of indirection between and a particular expression being shown in . For example, a player may not perceive a good or bad event according to this simple model, or could be responding to a stroke of bad luck with an ironic smile. The labels themselves, however, are completely objective.

Data is extracted from 102 episodes of the show, resulting in 192,030 frames distributed over 2,118 labelled face tracks. Shows are divided into training, validation and test sets, which also means that mostly different identities are contained in the different subsets.

Data extraction.

One advantage of studying facial expressions from contextual events is that these are often easy to detect automatically. In our case, we take advantage of two facts. First, when a prize is removed from the set , this is shown in the game as a box being opened (Fig. 1 top-left). This scene, which occurs systematically, is easy to detect and is used to mark the start of an event. Next, the camera moves onto the contestant (Fig. 1 top-middle) to capture his/her reaction. Faces are extracted from the seven seconds that immediately follow the event using the face detector of [King(2009)] and are stored as part of the face track . Occasionally the camera may capture the reaction of a member of the public; while it would be easy to distinguish different identities (e.g. by using the VGG-Faces model of Sect. 4), we prefer not to as the public is sympathetic with the contestant and tends to react in a similar manner, improving the diversity of the collected data. Finally, the value of the prize being removed can be extracted either from the opened box using a text spotting system or, more easily, by looking at which overlay is removed (Fig. 1 top-right). After automatic extraction, the data was fully checked manually for errors to ensure its quality.

3 Benchmark data and human baselines

As FaceValue defines a new task in facial expression interpretation, in this section we establish a human baseline as a point of comparison with computer vision algorithm performance. In order to compare FaceValue to existing facial expression recognition problems we establish similar baselines for two standard expression recognition datasets, FER and SFEW 2.0, introduced below.

Benchmark datasets: FER and SFEW 2.0.

The FER-2013 data [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio] contains pixel images obtained by querying Google image search for 184 emotion-related keywords. The dataset contains 35,887 images divided into 4,953 “anger”, 547 “disgust”, 5,121 “fear”, 8,989 “happiness”, 6,077 “sadness”, 4,002 “surprise” and 6,198 “neutral” further split into training (28,709), public test (3,589) and private test (3,589) sets. Goodfellow et al [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio] note that this data is likely to contain label errors. However, their own human study obtained an average prediction accuracy of , which is comparable to the performance obtained by expert annotators on a smaller but manually-curated subset of 1,500 acted images.

The SFEW-2.0 data [Dhall et al.(2011b)Dhall, Goecke, Lucey, and Gedeon] contains selected frames from different videos of the Acted Facial Expressions in the Wild (AFEW) dataset [Dhall et al.(2011a)Dhall, Goecke, Lucey, and Gedeon] assigned to either: 225 “angry”, 75 “disgust”, 124 “fear”, 256 “happy”, 228 “neutral”, 234 “sad” and 150 “surprise”. The training, validation and test splits are provided as part of the EmotiW challenge [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon] and are adopted here. The AFEW data was collected by searching movie close captions for emotion-related keywords and then manually curating the results, generating a smaller number of labelled instances than FER.

Human baselines.

For each dataset we consider a pool of annotators, most of which are not computer vision experts, and ask them to predict the label associated with each face. In order to motivate annotators to be as accurate as possible, we pose the annotation process as a challenge. The goal is to guess the ground-truth label of an image and a score displaying the annotators’ prediction accuracy is constantly updated. Ultimately, annotators performances are entered in a leaderboard. We found that this simple idea significantly improved the annotators’ performance.

The dataset instances selected for the annotation tasks were constructed as follows. From FER, a random sample of 500 faces was extracted from the Public Test set. From SFEW 2.0, the full Validation set (383 samples) was used (faces were extracted from each image as described in section 4). From FaceValue, a random sample of 250 face tracks was extracted from the validation set, each of which was transformed into an animated GIF to allow annotators to see the face motion. Performance on each dataset was evaluated by partitioning into five folds, each of which was annotated by a separate pool. Every face instance across the three datasets received at least four annotations.

On FER, our annotators achieved lower performance than results previously reported in [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio]

(58.2% overall accuracy vs 65%). However, we also noted a significant variance between annotators (

), which means that at least some of them were able to match or exceed the mark. The unevenness of the annotators shows how difficult or ambiguous this task can be even for motivated humans. The annotators found SFEW-2.0 a more challenging task, obtaining an average accuracy of overall. One possible reason for this difference is the manner in which the datasets were constructed. FER faces were retrieved using Internet search queries which likely returned fairly representative examples of each expression; in contrast SFEW images were extracted from movies. On FaceValue, the average annotator accuracy was . Since the classification task was binary, to facilitate a comparison with algorithmic approaches, the ROC-AUC was also computed for each annotator, resulting in an annotator average of . The relatively low scores of humans on each dataset illustrate the particularly challenging nature of the task. This difficulty is underlined by the low levels of inter-annotator agreement (measured using Fleiss’ kappa) on the three datasets of 0.574, 0.424 and 0.491 respectively.

4 Expression recognition networks

In this section we develop state-of-the-art models for facial expression recognition in the two popular emotion recognition benchmarks of Sect. 3, namely FER and SFEW 2.0. Deep networks are currently the state-of-the-art models for emotion recognition, topping two of the last three editions of the Emotion recognition in the Wild (EmotiW) contest [Levi and Hassner(2015)]. While the standard approach is to learn large ensembles of deep networks [Kim et al.(2016)Kim, Roh, Dong, and Lee, Yu and Zhang(2015)], here we show that a single network can in fact be competitive or better than such ensembles if trained effectively. In order to do so we expand the available training data by pre-training models on other face recognition tasks, and in particular face identity verification, using the recent VGG-Faces dataset [Parkhi et al.(2015b)Parkhi, Vedaldi, and Zisserman].

Architectures and training.

We base our models on four standard CNN architectures: AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], VGG-M [Chatfield et al.(2014)Chatfield, Simonyan, Vedaldi, and Zisserman], VGG-VD-16 [Simonyan and Zisserman(2015)] and ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun]

. AlexNet is used as a reference baseline and is pre-trained on the ImageNet ILSVRC data 

[Russakovsky et al.(2014)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei]. VGG-VD-16 is pre-trained on a recent dataset for face verification called VGG-Faces [Parkhi et al.(2015b)Parkhi, Vedaldi, and Zisserman]. This model achieves near state-of-the-art verification performance on the LFW [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller]

benchmark; however, it is also extremely expensive. Thus, we train also a smaller network, based on the VGG-M configuration. All models are trained with batch normalization 

[Ioffe and Szegedy(2015)] and are implemented in the MatConvNet framework [Vedaldi and Lenc(2015)].

Statistics such as image resolution and the usage of colour in the target datasets, and FER in particular, differ substantially from LFW and VGG-Faces. Nevertheless, we found that simply rescaling the smaller FER images to the higher VGG-Faces resolution together with duplicating the grayscale intensities for the three colour channels produced excellent results. We also experimented with the other approach of pretraining by reducing the resolution and removing colour information from VGG-Faces; while this resulted in very competitive and more efficient networks, the full resolution models were still a little more accurate and are used in the rest of the work.

After pre-training, each model is trained on the FER or SFEW 2.0 training set with a fine tuning ratio of 0.1. This is obtained by retaining all but the last layer, performing -way classification, where is the number of possible facial expression classes.

Figure 2: Accuracy on FER-2013 of different CNN models and training strategies.
Model Pretraining Test (Public) Test (Private)
AlexNet ImageNet 62.44% 63.28%
VGG-M ImageNet 66.04% 67.57%
Resnet-50 ImageNet 67.79% 69.02%
VGG-VD-16 ImageNet 66.92% 70.38%

AlexNet VGGFaces 70.47% 71.44%
VGG-M VGGFaces 71.08% 72.08%
Resnet-50 VGGFaces 69.23% 70.33%
VGG-VD-16 VGGFaces 72.05% 72.89%
 [Kim et al.(2016)Kim, Roh, Dong, and Lee] - - 70.58%
 [Kim et al.(2016)Kim, Roh, Dong, and Lee] - - 72.72%
Figure 3: Accuracy on SFEW-2.0 of different CNN models and training strategies
Model Pretraining Val Test
AlexNet VGGFaces 37.67% -
VGG-M VGGFaces 42.90% -
Resnet-50 VGGFaces 47.48% -
VGG-VD-16 VGGFaces 43.58% -
AlexNet VGGFaces+FER 38.07% 50.81%
VGG-M VGGFaces+FER 47.02% 53.49%
Resnet-50 VGGFaces+FER 50.91% 45.97%
VGG-VD-16 VGGFaces+FER 54.82% 59.41%
 [Yu and Zhang(2015)] FER combined 52.29% 58.06%
 [Kim et al.(2016)Kim, Roh, Dong, and Lee] FER + TFD 52.50% 57.3%
 [Yu and Zhang(2015)] FER combined 55.96% 61.29%
 [Kim et al.(2016)Kim, Roh, Dong, and Lee] FER + TFD 52.80% 61.6%
Anger Disgust Fear Happiness Neutral Sadness Surprise
Figure 4: Visualizations of the FER emotions for the VGG-VD-16 model.
Results.

Table 3 compares the different architecture and the state-of-the-art on FER. When reporting ensemble models, denotes the best single CNN and denotes the ensemble. The best previous results on FER is 72.72% accuracy, obtained using the hierarchical committee of deep CNNs described in [Kim et al.(2016)Kim, Roh, Dong, and Lee], combining more than 36 different models. By comparison, VGG-VD-16 pre-trained on VGG-Faces achieves a slightly superior performance at 72.89%. VGG-M achieves nearly the same performance () at a substantially reduced computational cost. We also note the importance of choosing a face-related pre-training set, as pre-training in ImageNet loses 3-4% of performance.

Table 3 reports the results on the SFEW-2.0 dataset instead. Since the dataset itself consists of labelled scene images, we use the faces extracted by the accurate face detection pipeline described in [Yu and Zhang(2015)] which applies an ensemble of face detectors [Zhang and Zhang(2014), Chen et al.(2014)Chen, Ren, Wei, Cao, and Sun, Zhu and Ramanan(2012)]. As SFEW is much smaller than FER, pre-training is in this case much more important. The best result achieved by any of the four models pre-trained with ImageNet only was . Pre-training on VGG-Faces produced substantially better results (+10%), and pre-training on VGG- Faces and FER-Train produced better still (+18%). The best single model, VGG-VD-16, achieves better performance than existing single and ensemble networks (+2.5%) on the validation set, and better performance than all but the ensembles of [Yu and Zhang(2015), Kim et al.(2016)Kim, Roh, Dong, and Lee] on the test set (-2%).

Visualizations.

While CNNs perform well, it is often difficult to understand what they are learning given their black-box nature. Here we use the technique of [Mahendran and Vedaldi(2016)] to visualize the the best FER/SFEW model. This technique seeks to find an image which, under certain regularity assumptions, maximizes the CNN confidence that represents emotion . Results are reported in Fig 4 for the VGG-VD-16 model trained on the FER dataset. Notably, the reconstructed pictures are mosaics of parts representative of the corresponding emotions.

5 Relating facial expressions to events in videos

In this section we focus on the main question of the paper i.e. whether facial expressions can be used to extract information about events in videos.

Baselines: individual frame prediction and simple voting.

As baseline, a state-of-the-art emotion recognition CNN is applied to each frame in the face track. The faces in a face track are individually classified by and results are pooled to predict whether the event is positive or negative . Positive emotions (happiness) vote for the first case, negative emotions (sadness, fear, anger, disgust) for the second and neutral/surprise emotions are ignored. The label with the largest number of votes in the track wins.

Pooling architectures.

There are two significant shortcomings in the baseline. First, it assumes a particular map between emotions in existing datasets and positive and negative events in FaceValue. Second, it integrates information across frames using an ad-hoc voting procedure which may be suboptimal. In order to address these shortcomings we learn on FaceValue a new model that explicitly pools information across frames in a track. A pre-trained network

is split in two parts. Then, the first part is run independently on each frame, the results are pooled by either average or max pooling across time and the result is fed to

for binary classification: . The resulting architecture is fine-tuned on the FaceValue training set.

In practice, we found that the best results were obtained by using the emotion recognition networks such as VGG-VD-16 trained on the FER data (Sect. 4

). All layers up to fc7, producing 4,096 dimensional feature vectors, are retained in

. The best pooling function was found to be averaging followed by normalization of the 4,096 dimensional features. The last layer is fully connected (in practice, this layer is a linear predictor). CNNs are trained using hinge loss, which generally performs better than softmax for binary classification.

Table 2: ROC-AUC on FaceValue
Model Pre-training Method Val. Test
VGG-M VGGFace+FER voting 0.656 0.592
VGG-VD VGGFace+FER voting 0.653 0.618
VGG-M VGGFace pooling arch. 0.764 0.691
VGG-VD VGGFace pooling arch. 0.726 0.671

VGG-M VGGFace+FER pooling arch. 0.794 0.722
VGG-VD VGGFace+FER pooling arch. 0.741 0.675
Table 3: FER expressions from FaceValue.

[]

Table 4: Comparison of human vs machine performance across benchmarks
Dataset Metric Human Human Committee Machine
FER (public test) Accuracy 0.57 0.66 0.72
SFEW 2.0 (val) Accuracy 0.53 0.63 0.56 [Yu and Zhang(2015)]
FaceValue (val) ROC-AUC 0.71 0.78 0.79
Results.

Table 3 reports the performance of different model variants on FaceValue. Similarly to Table 3, pre-training on VGG-Face+FER is preferable than pre-training on VGG-Face only. This is required for the voting classifier, but beneficial also when fine-tuning a pre-trained pooling architecture, which handily outperforms voting. VGG-M is in this case better than VGG-VD (

), probably due to the fact that VGG-VD is overfitted to the pre-training data. Finally, temporal average pooling is always better than max pooling.

Learning nameable facial expressions from events in videos.

So far, we have shown that it is possible to predict events in videos by looking at facial expressions. Here we consider the other direction and ask whether nameable facial expressions can be learned by looking at people in TV programs reacting to events. To answer this question we applied the VGG-M pooling architecture to the FER images after pre-trained it on VGG-Faces (a verification task) and fine-tuning it on FaceValue. In this manner, this CNN is never trained with manually-labelled emotions. Fig. 3 shows the distribution of FER nameable expressions for faces associated to “good” and “bad” FaceValue events by this model. There is a marked difference in the resulting distributions, with a significant peak for happiness for predicted “good” events and surprise and negative emotions for “bad” ones. This suggests that it is indeed possible to learn nameable expressions from their weak association to events in video without explicit and dedicated supervision as commonly done.

Comparison with human baselines.

Table 4 compares the performance of humans and of the best models on the three datasets FER, SFEW 2.0, and FaceValue. Remarkably, in all cases networks outperform individual humans by a substantial margin (e.g+15% on FER and +8% on FaceValue). While this result is perhaps surprising, we believe the reason is that, in such ambiguous tasks, machines learn to respond as humans would on average whereas the performance of individual annotators, as reflected in Table 4, can be low due to poor inter-annotator agreement. To verify this hypothesis, we combined multiple human annotators in a committee and found that this gap either closes or disappears. In particular, on FaceValue the performance of the committee is just a hair’s breadth lower than that of the machine (78% vs 79%).

6 Summary

In this paper we have investigated the problem of relating facial expressions with objectively-measurable events that affect humans in videos. We have shown that gameshows are a particularly useful data source for this type of analysis due to their simple structure, easily detectable events and emotional impact on the participants and have constructed a corresponding dataset FaceValue.

In order to analyze emotions in FaceValue, we have trained state-of-the-art neural networks for facial expression recognition in existing datasets showing that, if pre-trained on face verification, single models are competitive or better than the multi-network committees commonly used in the literature. Then, we have shown that such networks can successfully understand the relationship between certain events in TV programs and facial expressions better than individual human annotators, and as well as a committee of several human annotators. We have also shown that networks trained to predict such events from facial expressions correlate very well to nameable expressions in standard datasets.

Acknowledgements

The authors gratefully acknowledge the support of the ESPRC EP/L015897/1 (AIMS CDT) and the ERC 677195-IDIU. We also wish to thank Zhiding Yu for kindly sharing his preprocessed SFEW dataset.

References

  • [Attardo et al.(2003)Attardo, Eisterhold, Hay, and Poggi] Salvatore Attardo, Jodi Eisterhold, Jennifer Hay, and Isabella Poggi. Multimodal markers of irony and sarcasm. Humor, 16(2):243–260, 2003.
  • [Besel and Yuille(2010)] Lana DS Besel and John C Yuille. Individual differences in empathy: The role of facial expression recognition. Personality and Individual Differences, 49(2):107–112, 2010.
  • [Chatfield et al.(2014)Chatfield, Simonyan, Vedaldi, and Zisserman] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. 2014.
  • [Chen et al.(2014)Chen, Ren, Wei, Cao, and Sun] Dong Chen, Shaoqing Ren, Yichen Wei, Xudong Cao, and Jian Sun. Joint cascade face detection and alignment. In European Conference on Computer Vision, pages 109–122. Springer, 2014.
  • [de la Torre et al.(2015)de la Torre, Chu, Xiong, Vicente, Ding, and Cohn] Fernando de la Torre, Wen-Sheng Chu, Xuehan Xiong, Francisco Vicente, Xiaoyu Ding, and Jeffrey Cohn. Intraface. In Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, volume 1, pages 1–8. IEEE, 2015.
  • [Dhall et al.(2011a)Dhall, Goecke, Lucey, and Gedeon] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Acted Facial Expressions in the Wild Database. Technical report, Australian National University, 2011a.
  • [Dhall et al.(2011b)Dhall, Goecke, Lucey, and Gedeon] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In Proc. ICCV Workshop, 2011b.
  • [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon] Abhinav Dhall, O.V. Ramana Murthy, Roland Goecke, Jyoti Joshi, and Tom Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw 2015. In Proc. ACM Int. Conf. on Multimodal Interaction, 2015.
  • [Dhall et al.(2012)] Abhinav Dhall et al. Collecting large, richly annotated facial-expression databases from movies. 2012.
  • [Ekman and Friesen(1969a)] Paul Ekman and Wallace V Friesen. Nonverbal leakage and clues to deception. Psychiatry, 32(1):88–106, 1969a.
  • [Ekman and Friesen(1969b)] Paul Ekman and Wallace V Friesen. The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1(1):49–98, 1969b.
  • [El Kaliouby et al.(2012)El Kaliouby, Dreisch, England, and Kodra] Rana El Kaliouby, Andrew Edwin Dreisch, Avril England, and Evan Kodra. Affect based concept testing, December 27 2012. US Patent App. 13/728,303.
  • [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio] Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, Yingbo Zhou, Chetan Ramaiah, Fangxiang Feng, Ruifan Li, Xiaojie Wang, Dimitris Athanasakis, John Shawe-Taylor, Maxim Milakov, John Park, Radu Ionescu, Marius Popescu, Cristian Grozea, James Bergstra, Jingjing Xie, Lukasz Romaszko, Bing Xu, Zhang Chuang, and Yoshua Bengio. Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64:59 – 63, 2015.
  • [Gross et al.(2010)Gross, Matthews, Cohn, Kanade, and Baker] Ralph Gross, Iain Matthews, Jeffrey Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image and Vision Computing, 28(5):807–813, 2010.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on

    , 2016.
  • [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller] Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
  • [Ioffe and Szegedy(2015)] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, 2015.
  • [Jain and Learned-Miller(2010)] Vidit Jain and Erik G Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. UMass Amherst Technical Report, 2010.
  • [Kim et al.(2016)Kim, Roh, Dong, and Lee] Bo-Kyeong Kim, Jihyeon Roh, Suh-Yeon Dong, and Soo-Young Lee.

    Hierarchical committee of deep convolutional neural networks for robust facial expression recognition.

    Journal on Multimodal User Interfaces, pages 1–17, 2016.
  • [King(2009)] Davis E. King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10:1755–1758, 2009.
  • [Koestinger et al.(2011)Koestinger, Wohlhart, Roth, and Bischof] Martin Koestinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In ICCV Workshop on Benchmarking Facial Image Analysis Technologies, 2011.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [Levi and Hassner(2015)] Gil Levi and Tal Hassner. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In Proc. ACM Int. Conf. on Multimodal InteractionP, 2015.
  • [Lucey et al.(2010)Lucey, Cohn, Kanade, Saragih, Ambadar, and Matthews] Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 94–101. IEEE, 2010.
  • [Lyons et al.(1998)Lyons, Akamatsu, Kamachi, and Gyoba] Michael Lyons, Shota Akamatsu, Miyuki Kamachi, and Jiro Gyoba. Coding facial expressions with gabor wavelets. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pages 200–205. IEEE, 1998.
  • [Mahendran and Vedaldi(2016)] Aravindh Mahendran and Andrea Vedaldi. Visualizing deep convolutional neural networks using natural pre-images. 2016.
  • [McDuff et al.(2013)McDuff, Kaliouby, Senechal, Amr, Cohn, and Picard] Daniel McDuff, Rana Kaliouby, Thibaud Senechal, May Amr, Jeffrey Cohn, and Rosalind Picard. Affectiva-mit facial expression dataset (am-fed): Naturalistic and spontaneous facial expressions collected. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 881–888, 2013.
  • [Parkhi et al.(2015a)Parkhi, Vedaldi, and Zisserman] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015a.
  • [Parkhi et al.(2015b)Parkhi, Vedaldi, and Zisserman] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In British Machine Vision Conference, 2015b.
  • [Post et al.(2008)Post, Van den Assem, Baltussen, and Thaler] Thierry Post, Martijn J Van den Assem, Guido Baltussen, and Richard H Thaler. Deal or no deal? decision making under risk in a large-payoff game show. The American economic review, 98(1):38–71, 2008.
  • [Rep. Capuano and Rep. Jones(Introduced in US House of Representatives, 02/27/2015)] Michael E. Rep. Capuano and Walter B. Jr. Rep. Jones. We Are Watching You Act, H.R.1164, Introduced in US House of Representatives, 02/27/2015.
  • [Russakovsky et al.(2014)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge, 2014.
  • [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
  • [Sebe et al.(2007)Sebe, Lew, Sun, Cohen, Gevers, and Huang] Nicu Sebe, Michael S Lew, Yafei Sun, Ira Cohen, Theo Gevers, and Thomas S Huang. Authentic facial expression analysis. Image and Vision Computing, 25(12):1856–1863, 2007.
  • [Simonyan and Zisserman(2015)] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. 2015.
  • [Vedaldi and Lenc(2015)] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd ACM international conference on Multimedia, pages 689–692. ACM, 2015.
  • [Yu and Zhang(2015)] Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multiple deep network learning. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 435–442. ACM, 2015.
  • [Zhang and Zhang(2014)] Cha Zhang and Zhengyou Zhang. Improving multiview face detection with multi-task deep convolutional neural networks. In IEEE Winter Conference on Applications of Computer Vision, pages 1036–1041. IEEE, 2014.
  • [Zhu and Ramanan(2012)] Xiangxin Zhu and Deva Ramanan.

    Face detection, pose estimation, and landmark localization in the wild.

    In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.