Differently from computer vision systems which require explicit supervision, humans can learn facial expressions by observing people in their environment. In this paper, we look at how similar capabilities could be developed in machine vision. As a starting point, we consider the problem of relating facial expressions to objectively measurable events occurring in videos. In particular, we consider a gameshow in which contestants play to win significant sums of money. We extract events affecting the game and corresponding facial expressions objectively and automatically from the videos, obtaining large quantities of labelled data for our study. We also develop, using benchmarks such as FER and SFEW 2.0, state-of-the-art deep neural networks for facial expression recognition, showing that pre-training on face verification data can be highly beneficial for this task. Then, we extend these models to use facial expressions to predict events in videos and learn nameable expressions from them. The dataset and emotion recognition models are available at http://www.robots.ox.ac.uk/ vgg/data/facevalueREAD FULL TEXT VIEW PDF
In this paper, facial expressions of the three Turkish presidential
Recognizing facial expression in a wild setting has remained a challengi...
Social robots able to continually learn facial expressions could
In this work, user's emotion using its facial expressions will be detect...
Studying facial expressions is a notoriously difficult endeavor. Recent
We introduce Fat Pad cages for posing facial meshes. It combines cage
We present a computational framework for automatically quantifying verba...
Humans make extensive use of facial expressions in order to communicate. Facial expressions are complementary to other channels such as speech and gestures, and often convey information that cannot be recovered from the other two alone. Thus, understanding facial expressions is often necessary to properly understand images and videos of people.
The general approach to facial expression recognition is to label a dataset of faces with either nameable expressions (e.ghappiness, sadness, disgust, anger, etc.) or facial action units
(movements of facial muscles such as tightening the lips or raising an upper eyelid) and then learn a corresponding classifier, for example by using a deep neural network. In contrast, humans need not to beexplicitly told what facial expressions means, but can learn that by associating facial expressions to how people react to particular events or situations.111Generating certain facial expressions is an innate ability; however, recognizing facial expression is a learned skill.
In order to investigate whether algorithms can also learn facial expressions by establishing similar associations, in this paper we look at the problem of relating facial expressions to objectively-quantifiable contextual events in videos
. The main difficulty of this task is that there is only a weak correlation between an event occurring in a video and a person showing a particular facial expression. However, learning facial expressions in this manner has three important benefits. The first one is that it grounds the problem on objectively-measurable quantities, whereas labelling emotions or even facial action units is often ambiguous. The second benefit is that contextual information can often be labelled in videos fully or partially automatically, obviating the cost of collecting large quantities of human-annotated data for data-hungry machine learning algorithms. Finally, the third advantage is that the ultimate goal of face recognition in applications is not so much to describe a face, but to infer from it information about a situation or event, which is tackled directly by our study.
Concretely, our first contribution (Sect. 2; Fig. 1) is to develop a novel dataset, FaceValue, of faces extracted from videos together with objectively-measurable contextual events. The dataset is based on the “Deal or No Deal” TV program, a popular game where contestants can win or lose significant sums of money. Using a semi-automatic procedure, we extract significant events in the game along with the player (and public) reaction. We use this data to predict from facial expressions whether events are “good” or “bad” for the contestant. To the best of our knowledge, this is the first example of leveraging gameshows in facial expression understanding and the first study aiming to relate facial expressions to people’s activities.
Our second contribution is to carefully assess the difficulty of this problem by establishing a human baseline and by extending the latter to existing expression recognition datasets for comparison (Sect. 3). We also develop a number of state-of-the-art expression recognition models (Sect. 4) and show that excellent performance can be obtained by transferring deep neural networks from face verification to expression recognition. Our final contribution is to extend such systems to the problem of recognising FaceValue events from facial expressions (Sect. 5). We develop simple but effective pooling strategies to handle face tracks, integrating them in deep neural network architectures. With these, we show that it is not only possible to predict events from facial expressions, but also to learn nameable expressions by looking at people spontaneously reacting to events in TV programs.
|FER||35,887 Faces||Internet search||Mixed||6+1 emotions|
|AFEW 5.0||1,426 Clips||Subtitles||Acted||6+1 emotions|
|SFEW 2.0||1,635 Faces||Subtitles||Acted||6+1 emotions|
|AM-FED||168,359 Faces||Human experts||Spontaneous||FACS|
|FaceValue (ours)||192,030 Faces||Metadata extraction||Spontaneous||Event Outcome|
Facial expressions are a non-verbal mode of communication complementary to speech and gestures [Ekman and Friesen(1969b), Attardo et al.(2003)Attardo, Eisterhold, Hay, and Poggi]. They can be produced unintentionally [Ekman and Friesen(1969a)], revealing hidden states of the actor in pain or deception detection [Besel and Yuille(2010)]. Facial expressions are commercially valuable, attracting increasing investment from advertising agencies that seek to understand and manipulate the consumer response to a product [El Kaliouby et al.(2012)El Kaliouby, Dreisch, England, and Kodra] and corresponding regulatory attention [Rep. Capuano and Rep. Jones(Introduced in US House of Representatives, 02/27/2015)].
Face-related tasks such as face detection, verification and recognition have long been researched in computer vision with the creation of several labelled datasets: FDDB [Jain and Learned-Miller(2010)], AFW [Zhu and Ramanan(2012)] and AFLW [Koestinger et al.(2011)Koestinger, Wohlhart, Roth, and Bischof] for face detection; and LFW [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller] and VGG-Face [Parkhi et al.(2015a)Parkhi, Vedaldi, and Zisserman] for face recognition and verification. Face detectors and identity recognizers can now rival the performance of humans [Schroff et al.(2015)Schroff, Kalenichenko, and Philbin]. Facial expression recognition has also received significant attention in computer vision, but it presents a number of additional subtleties and difficulties which are not found in face detection or recognition. The main challenge is the consistent labelling of facial expressions which is difficult due to the subjective nature of the task. A number of coding systems have been developed in an attempt to label facial expressions objectively, usually at the level of atomic facial movements, but even human experts are not infallible in generating such annotations. Furthermore, getting these experts to annotate a dataset is expensive and difficult to scale [McDuff et al.(2013)McDuff, Kaliouby, Senechal, Amr, Cohn, and Picard]. Another issue is the “authenticity” of facial expressions, arising from the fact that several datasets are acted [Sebe et al.(2007)Sebe, Lew, Sun, Cohen, Gevers, and Huang], either specifically for data collection [Lyons et al.(1998)Lyons, Akamatsu, Kamachi, and Gyoba] [Lucey et al.(2010)Lucey, Cohn, Kanade, Saragih, Ambadar, and Matthews] [Gross et al.(2010)Gross, Matthews, Cohn, Kanade, and Baker] or indirectly as data is extracted from movies [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon]. Our FaceValue dataset sidesteps these problems by recording spontaneous reactions to objectively-occurring events in videos.
Examples of datasets which contain challenging variations in pose, lighting conditions and subjects are given in Table 1. Of these, two in particular have received significant research interest as popular benchmarks for facial expression recognition. The Static Facial Expression in the Wild 2.0 (SFEW-2.0) data [Dhall et al.(2011b)Dhall, Goecke, Lucey, and Gedeon] (used in the EmotiW challenges [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon]) consists of images from movies which collectively contain 1,635 faces labelled with seven emotions (this dataset was constructed by selectively extracting individual frames from AFEW-5.0 [Dhall et al.(2012)]). The Facial Expression Recognition 2013 (FER-2013) dataset [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio], which formed the basis of a large Kaggle competition, contains 35k images labelled with the same seven emotions. These datasets were used to develop several state-of-the-art emotion recognition systems. Among the top-performing ones, the authors of [Yu and Zhang(2015)] and [Kim et al.(2016)Kim, Roh, Dong, and Lee] propose ensembles of deep network trained on the FER and SFEW-2.0 data. There are also several commercial implementations of expression recognition, such as CMU’s IntraFace [de la Torre et al.(2015)de la Torre, Chu, Xiong, Vicente, Ding, and Cohn] and the Affectiva face software.
In this section we describe the FaceValue dataset (Fig. 1) and how it was collected.
The “Deal or No Deal” TV game show222Outside of computer vision, the interesting decision making dynamics of contestants in a high-stakes environment during the “Deal or No Deal” game show have attracted research by economists [Post et al.(2008)Post, Van den Assem, Baltussen, and Thaler]. was selected as the basis for our data for a number of reasons. First, it contains a very significant amount of data. The show has been running nearly daily in the UK for the past eleven years, totalling 2,929 episodes. Each episode focuses on a different player and lasts for about forty minutes. Furthermore, the same or very similar shows are or were aired in dozens of other countries. Second, the game is based on simple rules and a sequence of discrete events that are in most cases easily identifiable as positive or negative for the player, and hence can be expected to induce a corresponding emotion and facial expression. Furthermore, these events are easily detectable by parsing textual overlays in the show or other simple patterns. Thirdly, since there is a single player, it is easy to identify the person that is directly affected by the events in the video and the camera tends to focus on his/her face.
An example of the in-game footage and data extraction pipeline is shown in Fig. 1. The rules of the game are easily explained. There are possible cash prizes where prizes range from 1p up to £250,000. Initially the player is assigned a prize but does not know its value. Then, at each round of the game the player can randomly extract (realised as opening a box, see Fig. 1 top-left) one of the prizes from and reveal it, resulting in a smaller set of possible prizes. Through this process of elimination the player obtains information about his/her prize . Occasionally the player is offered the opportunity to leave the game with a prize (“deal”) determined by the game’s host or to continue playing (“no deal”) and eventually leave with .
The expected value of the win at time is . When a prize is removed from , the player perceives this as a “good” event if , which requires , and a “bad” event otherwise. In practice we conservatively require for a good event, where . Interestingly, the game is continued even after the player has taken a “deal”; in this case the roles of “good” and “bad” events are reversed as the player hopes that the accepted deal is higher than the prize he/she gave up.
The data in FaceValue is defined as follows. Faces are detected right after a new prize is revealed for about seven seconds. These faces are collected in a “face track” . Furthermore, the face track is assigned the binary label:
where is if the deal was not taken so far, and otherwise. Note that there are several levels of indirection between and a particular expression being shown in . For example, a player may not perceive a good or bad event according to this simple model, or could be responding to a stroke of bad luck with an ironic smile. The labels themselves, however, are completely objective.
Data is extracted from 102 episodes of the show, resulting in 192,030 frames distributed over 2,118 labelled face tracks. Shows are divided into training, validation and test sets, which also means that mostly different identities are contained in the different subsets.
One advantage of studying facial expressions from contextual events is that these are often easy to detect automatically. In our case, we take advantage of two facts. First, when a prize is removed from the set , this is shown in the game as a box being opened (Fig. 1 top-left). This scene, which occurs systematically, is easy to detect and is used to mark the start of an event. Next, the camera moves onto the contestant (Fig. 1 top-middle) to capture his/her reaction. Faces are extracted from the seven seconds that immediately follow the event using the face detector of [King(2009)] and are stored as part of the face track . Occasionally the camera may capture the reaction of a member of the public; while it would be easy to distinguish different identities (e.g. by using the VGG-Faces model of Sect. 4), we prefer not to as the public is sympathetic with the contestant and tends to react in a similar manner, improving the diversity of the collected data. Finally, the value of the prize being removed can be extracted either from the opened box using a text spotting system or, more easily, by looking at which overlay is removed (Fig. 1 top-right). After automatic extraction, the data was fully checked manually for errors to ensure its quality.
As FaceValue defines a new task in facial expression interpretation, in this section we establish a human baseline as a point of comparison with computer vision algorithm performance. In order to compare FaceValue to existing facial expression recognition problems we establish similar baselines for two standard expression recognition datasets, FER and SFEW 2.0, introduced below.
The FER-2013 data [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio] contains pixel images obtained by querying Google image search for 184 emotion-related keywords. The dataset contains 35,887 images divided into 4,953 “anger”, 547 “disgust”, 5,121 “fear”, 8,989 “happiness”, 6,077 “sadness”, 4,002 “surprise” and 6,198 “neutral” further split into training (28,709), public test (3,589) and private test (3,589) sets. Goodfellow et al [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio] note that this data is likely to contain label errors. However, their own human study obtained an average prediction accuracy of , which is comparable to the performance obtained by expert annotators on a smaller but manually-curated subset of 1,500 acted images.
The SFEW-2.0 data [Dhall et al.(2011b)Dhall, Goecke, Lucey, and Gedeon] contains selected frames from different videos of the Acted Facial Expressions in the Wild (AFEW) dataset [Dhall et al.(2011a)Dhall, Goecke, Lucey, and Gedeon] assigned to either: 225 “angry”, 75 “disgust”, 124 “fear”, 256 “happy”, 228 “neutral”, 234 “sad” and 150 “surprise”. The training, validation and test splits are provided as part of the EmotiW challenge [Dhall et al.(2015)Dhall, Ramana Murthy, Goecke, Joshi, and Gedeon] and are adopted here. The AFEW data was collected by searching movie close captions for emotion-related keywords and then manually curating the results, generating a smaller number of labelled instances than FER.
For each dataset we consider a pool of annotators, most of which are not computer vision experts, and ask them to predict the label associated with each face. In order to motivate annotators to be as accurate as possible, we pose the annotation process as a challenge. The goal is to guess the ground-truth label of an image and a score displaying the annotators’ prediction accuracy is constantly updated. Ultimately, annotators performances are entered in a leaderboard. We found that this simple idea significantly improved the annotators’ performance.
The dataset instances selected for the annotation tasks were constructed as follows. From FER, a random sample of 500 faces was extracted from the Public Test set. From SFEW 2.0, the full Validation set (383 samples) was used (faces were extracted from each image as described in section 4). From FaceValue, a random sample of 250 face tracks was extracted from the validation set, each of which was transformed into an animated GIF to allow annotators to see the face motion. Performance on each dataset was evaluated by partitioning into five folds, each of which was annotated by a separate pool. Every face instance across the three datasets received at least four annotations.
On FER, our annotators achieved lower performance than results previously reported in [Goodfellow et al.(2015)Goodfellow, Erhan, Carrier, Courville, Mirza, Hamner, Cukierski, Tang, Thaler, Lee, Zhou, Ramaiah, Feng, Li, Wang, Athanasakis, Shawe-Taylor, Milakov, Park, Ionescu, Popescu, Grozea, Bergstra, Xie, Romaszko, Xu, Chuang, and Bengio]
(58.2% overall accuracy vs 65%). However, we also noted a significant variance between annotators (), which means that at least some of them were able to match or exceed the mark. The unevenness of the annotators shows how difficult or ambiguous this task can be even for motivated humans. The annotators found SFEW-2.0 a more challenging task, obtaining an average accuracy of overall. One possible reason for this difference is the manner in which the datasets were constructed. FER faces were retrieved using Internet search queries which likely returned fairly representative examples of each expression; in contrast SFEW images were extracted from movies. On FaceValue, the average annotator accuracy was . Since the classification task was binary, to facilitate a comparison with algorithmic approaches, the ROC-AUC was also computed for each annotator, resulting in an annotator average of . The relatively low scores of humans on each dataset illustrate the particularly challenging nature of the task. This difficulty is underlined by the low levels of inter-annotator agreement (measured using Fleiss’ kappa) on the three datasets of 0.574, 0.424 and 0.491 respectively.
In this section we develop state-of-the-art models for facial expression recognition in the two popular emotion recognition benchmarks of Sect. 3, namely FER and SFEW 2.0. Deep networks are currently the state-of-the-art models for emotion recognition, topping two of the last three editions of the Emotion recognition in the Wild (EmotiW) contest [Levi and Hassner(2015)]. While the standard approach is to learn large ensembles of deep networks [Kim et al.(2016)Kim, Roh, Dong, and Lee, Yu and Zhang(2015)], here we show that a single network can in fact be competitive or better than such ensembles if trained effectively. In order to do so we expand the available training data by pre-training models on other face recognition tasks, and in particular face identity verification, using the recent VGG-Faces dataset [Parkhi et al.(2015b)Parkhi, Vedaldi, and Zisserman].
We base our models on four standard CNN architectures: AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], VGG-M [Chatfield et al.(2014)Chatfield, Simonyan, Vedaldi, and Zisserman], VGG-VD-16 [Simonyan and Zisserman(2015)] and ResNet-50 [He et al.(2016)He, Zhang, Ren, and Sun]
. AlexNet is used as a reference baseline and is pre-trained on the ImageNet ILSVRC data[Russakovsky et al.(2014)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Fei-Fei]. VGG-VD-16 is pre-trained on a recent dataset for face verification called VGG-Faces [Parkhi et al.(2015b)Parkhi, Vedaldi, and Zisserman]. This model achieves near state-of-the-art verification performance on the LFW [Huang et al.(2007)Huang, Ramesh, Berg, and Learned-Miller]
benchmark; however, it is also extremely expensive. Thus, we train also a smaller network, based on the VGG-M configuration. All models are trained with batch normalization[Ioffe and Szegedy(2015)] and are implemented in the MatConvNet framework [Vedaldi and Lenc(2015)].
Statistics such as image resolution and the usage of colour in the target datasets, and FER in particular, differ substantially from LFW and VGG-Faces. Nevertheless, we found that simply rescaling the smaller FER images to the higher VGG-Faces resolution together with duplicating the grayscale intensities for the three colour channels produced excellent results. We also experimented with the other approach of pretraining by reducing the resolution and removing colour information from VGG-Faces; while this resulted in very competitive and more efficient networks, the full resolution models were still a little more accurate and are used in the rest of the work.
After pre-training, each model is trained on the FER or SFEW 2.0 training set with a fine tuning ratio of 0.1. This is obtained by retaining all but the last layer, performing -way classification, where is the number of possible facial expression classes.
|Model||Pretraining||Test (Public)||Test (Private)|
|[Kim et al.(2016)Kim, Roh, Dong, and Lee]||-||-||70.58%|
|[Kim et al.(2016)Kim, Roh, Dong, and Lee]||-||-||72.72%|
|[Yu and Zhang(2015)]||FER combined||52.29%||58.06%|
|[Kim et al.(2016)Kim, Roh, Dong, and Lee]||FER + TFD||52.50%||57.3%|
|[Yu and Zhang(2015)]||FER combined||55.96%||61.29%|
|[Kim et al.(2016)Kim, Roh, Dong, and Lee]||FER + TFD||52.80%||61.6%|
Table 3 compares the different architecture and the state-of-the-art on FER. When reporting ensemble models, denotes the best single CNN and denotes the ensemble. The best previous results on FER is 72.72% accuracy, obtained using the hierarchical committee of deep CNNs described in [Kim et al.(2016)Kim, Roh, Dong, and Lee], combining more than 36 different models. By comparison, VGG-VD-16 pre-trained on VGG-Faces achieves a slightly superior performance at 72.89%. VGG-M achieves nearly the same performance () at a substantially reduced computational cost. We also note the importance of choosing a face-related pre-training set, as pre-training in ImageNet loses 3-4% of performance.
Table 3 reports the results on the SFEW-2.0 dataset instead. Since the dataset itself consists of labelled scene images, we use the faces extracted by the accurate face detection pipeline described in [Yu and Zhang(2015)] which applies an ensemble of face detectors [Zhang and Zhang(2014), Chen et al.(2014)Chen, Ren, Wei, Cao, and Sun, Zhu and Ramanan(2012)]. As SFEW is much smaller than FER, pre-training is in this case much more important. The best result achieved by any of the four models pre-trained with ImageNet only was . Pre-training on VGG-Faces produced substantially better results (+10%), and pre-training on VGG- Faces and FER-Train produced better still (+18%). The best single model, VGG-VD-16, achieves better performance than existing single and ensemble networks (+2.5%) on the validation set, and better performance than all but the ensembles of [Yu and Zhang(2015), Kim et al.(2016)Kim, Roh, Dong, and Lee] on the test set (-2%).
While CNNs perform well, it is often difficult to understand what they are learning given their black-box nature. Here we use the technique of [Mahendran and Vedaldi(2016)] to visualize the the best FER/SFEW model. This technique seeks to find an image which, under certain regularity assumptions, maximizes the CNN confidence that represents emotion . Results are reported in Fig 4 for the VGG-VD-16 model trained on the FER dataset. Notably, the reconstructed pictures are mosaics of parts representative of the corresponding emotions.
In this section we focus on the main question of the paper i.e. whether facial expressions can be used to extract information about events in videos.
As baseline, a state-of-the-art emotion recognition CNN is applied to each frame in the face track. The faces in a face track are individually classified by and results are pooled to predict whether the event is positive or negative . Positive emotions (happiness) vote for the first case, negative emotions (sadness, fear, anger, disgust) for the second and neutral/surprise emotions are ignored. The label with the largest number of votes in the track wins.
There are two significant shortcomings in the baseline. First, it assumes a particular map between emotions in existing datasets and positive and negative events in FaceValue. Second, it integrates information across frames using an ad-hoc voting procedure which may be suboptimal. In order to address these shortcomings we learn on FaceValue a new model that explicitly pools information across frames in a track. A pre-trained network
is split in two parts. Then, the first part is run independently on each frame, the results are pooled by either average or max pooling across time and the result is fed tofor binary classification: . The resulting architecture is fine-tuned on the FaceValue training set.
In practice, we found that the best results were obtained by using the emotion recognition networks such as VGG-VD-16 trained on the FER data (Sect. 4
). All layers up to fc7, producing 4,096 dimensional feature vectors, are retained in. The best pooling function was found to be averaging followed by normalization of the 4,096 dimensional features. The last layer is fully connected (in practice, this layer is a linear predictor). CNNs are trained using hinge loss, which generally performs better than softmax for binary classification.
|FER (public test)||Accuracy||0.57||0.66||0.72|
|SFEW 2.0 (val)||Accuracy||0.53||0.63||0.56 [Yu and Zhang(2015)]|
Table 3 reports the performance of different model variants on FaceValue. Similarly to Table 3, pre-training on VGG-Face+FER is preferable than pre-training on VGG-Face only. This is required for the voting classifier, but beneficial also when fine-tuning a pre-trained pooling architecture, which handily outperforms voting. VGG-M is in this case better than VGG-VD (
), probably due to the fact that VGG-VD is overfitted to the pre-training data. Finally, temporal average pooling is always better than max pooling.
So far, we have shown that it is possible to predict events in videos by looking at facial expressions. Here we consider the other direction and ask whether nameable facial expressions can be learned by looking at people in TV programs reacting to events. To answer this question we applied the VGG-M pooling architecture to the FER images after pre-trained it on VGG-Faces (a verification task) and fine-tuning it on FaceValue. In this manner, this CNN is never trained with manually-labelled emotions. Fig. 3 shows the distribution of FER nameable expressions for faces associated to “good” and “bad” FaceValue events by this model. There is a marked difference in the resulting distributions, with a significant peak for happiness for predicted “good” events and surprise and negative emotions for “bad” ones. This suggests that it is indeed possible to learn nameable expressions from their weak association to events in video without explicit and dedicated supervision as commonly done.
Table 4 compares the performance of humans and of the best models on the three datasets FER, SFEW 2.0, and FaceValue. Remarkably, in all cases networks outperform individual humans by a substantial margin (e.g+15% on FER and +8% on FaceValue). While this result is perhaps surprising, we believe the reason is that, in such ambiguous tasks, machines learn to respond as humans would on average whereas the performance of individual annotators, as reflected in Table 4, can be low due to poor inter-annotator agreement. To verify this hypothesis, we combined multiple human annotators in a committee and found that this gap either closes or disappears. In particular, on FaceValue the performance of the committee is just a hair’s breadth lower than that of the machine (78% vs 79%).
In this paper we have investigated the problem of relating facial expressions with objectively-measurable events that affect humans in videos. We have shown that gameshows are a particularly useful data source for this type of analysis due to their simple structure, easily detectable events and emotional impact on the participants and have constructed a corresponding dataset FaceValue.
In order to analyze emotions in FaceValue, we have trained state-of-the-art neural networks for facial expression recognition in existing datasets showing that, if pre-trained on face verification, single models are competitive or better than the multi-network committees commonly used in the literature. Then, we have shown that such networks can successfully understand the relationship between certain events in TV programs and facial expressions better than individual human annotators, and as well as a committee of several human annotators. We have also shown that networks trained to predict such events from facial expressions correlate very well to nameable expressions in standard datasets.
The authors gratefully acknowledge the support of the ESPRC EP/L015897/1 (AIMS CDT) and the ERC 677195-IDIU. We also wish to thank Zhiding Yu for kindly sharing his preprocessed SFEW dataset.
Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, 2016.
Hierarchical committee of deep convolutional neural networks for robust facial expression recognition.Journal on Multimodal User Interfaces, pages 1–17, 2016.
Face detection, pose estimation, and landmark localization in the wild.In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.