EEV Dataset: Predicting Expressions Evoked by Diverse Videos

01/15/2020 ∙ by Jennifer J. Sun, et al. ∙ 5

When we watch videos, the visual and auditory information we experience can evoke a range of affective responses. The ability to automatically predict evoked affect from videos can help recommendation systems and social machines better interact with their users. Here, we introduce the Evoked Expressions in Videos (EEV) dataset, a large-scale dataset for studying viewer responses to videos based on their facial expressions. The dataset consists of a total of 4.8 million annotations of viewer facial reactions to 18,541 videos. We use a publicly available video corpus to obtain a diverse set of video content. The training split is fully machine-annotated, while the validation and test splits have both human and machine annotations. We verify the performance of our machine annotations with human raters to have an average precision of 73.3 We establish baseline performance on the EEV dataset using an existing multimodal recurrent model. Our results show that affective information can be learned from EEV, but with a MAP of 20.32 This gap motivates the need for new approaches for understanding affective content. Our transfer learning experiments show an improvement in performance on the LIRIS-ACCEDE video dataset when pre-trained on EEV. We hope that the size and diversity of the EEV dataset will encourage further explorations in video understanding and affective computing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Videos can be described their semantic content and affective content. The semantic content is concerned with “what is in the video?” while the affective content focuses on “what does the video make people feel?” [44]. An example is illustrated in Figure 1. Our work will focus on understanding the affective content evoked by visual and audio information from videos. The ability to automatically process affective content is important to close the gap in video understanding between humans and machines. This ability can help recommendation systems and social machines to better interact with their users.

Figure 1: Examples of semantic and affective content in videos. Our work will focus on developing a dataset for a better understanding of affective content. Video credit in figure to [12].

Recent studies have shown promising results in describing videos automatically using their semantic content; however, there have been relatively few studies on affective content. This imbalance could partially be due to the lack of large-scale affective content datasets. For studying semantic content, datasets such as Sports1M [26], Kinetics [8], YouTube8M [1]

, and Moments in Time

[38] have enabled promising results in a wide range of videos. In contrast, datasets studying affective video content are much smaller [30, 5, 45] and typically focus on one video type, such as films. These limitations can be attributed to the challenge of collecting affective labels. It is difficult to find reliably affect-evoking stimuli [21] and affective labels are more subjective given they depend on viewer background and context [44].

To build a large-scale dataset for understanding affective content, we introduce a scalable method for automatic annotation of evoked viewer facial expressions in videos. We further verify a subset of the labels with human annotations. Our dataset uses publicly available videos with associated viewer reactions to generate facial expression labels. The result is the Evoked Expressions in Videos (EEV) dataset, with 18,541 videos densely annotated each second for a total of 4.8 million annotations. The diversity of our dataset is shown in Figure 2. To the best of our knowledge, the EEV dataset is currently the largest dataset for studying affective content from evoked viewer facial expressions.

The size and diversity of our dataset enables us to study unique questions relating to affective content. In particular: how well can we predict evoked viewer facial expressions directly from video content? Are some evoked facial expressions easier to predict? How do expression distribution vary across themes? We will explore these areas by analyzing the characteristics of the EEV dataset and establishing baseline expression prediction benchmarks.

Figure 2: Distribution of the top 20 themes in EEV, annotated using an automatic system described in section 4.1. The themes are not mutually exclusive.

Our contributions are:

  • The Evoked Expressions in Videos (EEV) dataset, a large-scale dataset annotated at 1 Hz for studying evoked facial expressions from diverse, online videos. The annotated facial expressions are: amusement, concentration, contentment, interest, and surprise. We are planning to release this dataset.

  • A scalable method for annotating evoked viewer facial expressions in online videos.

  • A baseline for model performance on the EEV dataset based on [47].

2 Related Work

2.1 Affective Video Content Analysis

Our work is closely related to the field of affective video content analysis, which aims to predict evoked viewer reactions from videos [3, 51]. Hanjalic & Xu [20] proposed one of the earliest efforts in the field, mapping movies to continuously evoked valence and arousal. Often, both viewer and video information are incorporated to predict video affect [43, 45, 30, 6, 45]. Physiological signals can be measured from a viewer as they watch the video, and combined with visual features for prediction. Other studies [5, 23, 53, 56] uses the direct approach: they aim to predict viewer responses directly from video content. Given that we do not use any additional viewer information, our work most closely relates to the direct approach.

Models in this area are often multimodal, given that both visual and auditory features contribute to a video’s affective content [50]. A review by Wang & Ji [51]

found that modeling frameworks often consist of video feature extraction, modality fusion, and classification or regression (depending on the model of affect). Recent models generally follow this framework using neural networks

[39, 31, 47, 2, 4, 25]. In particular, the LIRIS-ACCEDE dataset [4, 5], part of the MediaEval benchmark [13] provides a way to compare model performances for affective content analysis. In 2018, the top performing models [31, 47] applied RNNs to model the temporal aspects of video with a multi-modal approach. We establish the performance baseline on the EEV dataset based on the top performing models on the MediaEval benchmark.

Facial response to media. The EEV dataset is designed to match video content with viewer facial expression and our work is related to the subset of literature that focuses on facial responses. Videos have often been used to evoke viewer facial expressions in studies across psychology and affective computing [11, 3, 21, 35, 42] (e.g. choosing a funny video so viewers laugh). Our corpus of videos can be used to identify content that evoke distinct facial expressions and thus may be useful for these studies.

Facial expressions have been used as predictors in studies on self-reported emotions [42, 36], facial landmark locations [15] and viewer video preferences [54, 33, 34]. These studies have also used automated systems to annotate facial expressions in videos. However, rather than using facial expressions as the predictors, we predict evoked viewer facial expressions directly from online videos.

2.2 Facial Expression Recognition

The expressions in the EEV dataset are annotated by machines and humans. The automated annotations uses a facial expression recognition model. Traditionally, automated methods predict facial action units from the facial action coding system (FACS) [17, 32]

. However, it remains difficult to obtain accurate FACS annotations in natural contexts, such as user uploaded videos, that have uncontrolled conditions of viewpoint, lighting, occlusion, and demographics. We instead apply a semantic space approach to classifying facial expressions, recently introduced by Cowen & Keltner

[10]. This approach has revealed that in natural images, people reliably recognize a much wider array of facial expressions than have traditionally been studied using FACS. Information conveyed by facial expressions can be represented efficiently in terms of categories such as “amusement” and “surprise”. Based on these findings, we have developed automated methods that capture the information conveyed by natural facial expressions.

Figure 3: Size of datasets with affective labels for affective video content analysis. The EEV dataset is highlighted in green.

2.3 Affective Video Datasets

Datasets for studying affective content are small relative to the size of other video benchmark datasets [1, 26, 8]. Existing datasets such as DEAP [30], VideoEmotion [23], and FilmStim [40] are labelled with one annotation per video clip, and the largest (VideoEmotion) contains 1,101 videos. While these datasets are useful for other applications, they are too small to test complex models, and cannot be used to understand temporal changes in affect. Some datasets annotate evoked viewer affect over time, such as LIRIS-ACCEDE [4, 5], COGNIMUSE [56], and HUMAINE [16]. These have frame level annotations based on viewer self-reports. The largest, LIRIS-ACCEDE [4, 5], consists of 160 films with annotations every second. A subset of LIRIS-ACCEDE (66 films) is annotated continuously with valence and arousal from self-report used in the MediaEval benchmark [13]. These datasets often focus on one category of video (films or music videos).

Compared to existing datasets, the EEV dataset is significantly larger, as illustrated in Figure 3

. EEV-H indicates the human annotated subset of EEV. Although we use facial expressions instead of self-report, both methods for measuring affect have been found to share significant variance

[27]. The results by Kassam [27] also found that analyzing facial expressions can provide unique insights into affective experiences. Another characteristic of the EEV dataset is that it contains diverse video themes, shown in Figure  2, enabling affective content to be studied across categories.

Subjectivity. We recognize that there is a need for personalization in understanding affective content due to the fact that viewer response depends on the experience of the subject [3, 51, 44]. However, as we observe and as argued by [44], affective response is not arbitrary, and often agreement can be found across viewers. The challenge as identified by [44] is to recognize common affective triggers in videos. By providing the largest video dataset of evoked viewer facial expressions, EEV enables this to be explored further in the future.

3 Data Collection

Figure 4: An overview of the data collection process for EEV. Video credit in figure to [12].

The EEV dataset leverages existing facial expression recognition models alongside human annotations in order to study affective content at a large scale. The source of the EEV dataset is publicly available reaction videos, which depict one or more viewers reacting to another video. The video that viewers are watching is typically another publicly available video, which we will call the content video. The facial expressions in the reaction video is used to generate expression labels for the content video. An outline is illustrated in Figure 4.

Video selection. We first compile a list of public reaction videos on a publicly available video corpus by performing a crawl based on keywords in the video title. We then identify videos containing one or more viewers along with the video they were watching. These videos are processed as reaction video and content video pairs, as shown in Figure 4.

We aim to build a dataset from diverse uploader expressions and videos by sampling from a wide range of reaction videos. However, there are biases in content video selection and viewer expression while producing a reaction video. Our dataset may not fully reflect the true distribution when watching videos for a general audience – it is comprised of viewer expressions of people who choose to upload reaction videos.

Facial expression selection. Our facial expression labels are from the work of Cowen & Keltner [11, 10]. By analyzing the self-report of viewers from 2,185 emotionally evocative videos, recent work by Cowen & Keltner [11] suggests that reported user experiences are better captured by emotion categories rather than dimensional labels. This work defines 27 emotion categories, which they found to be linked by smooth gradients. A followup work [10] examined 28 expression categories that are expressed by face and body using human annotations. The facial expressions categories in EEV are based on this work. In our approach, we focus on a subset of the 28 expressions in [10] typically associated with user uploaded reaction videos: amusement, concentration, contentment, interest, surprise. Given more data sources, we can expand the EEV dataset to include more expressions in the future.

3.1 Facial Expression Annotation

The training set of EEV is fully machine annotated, while the validation and test set are annotated by machine and humans. The automatic annotations are derived from a facial expression recognition method similar to Vemulapalli & Agarwala [49]. We evaluate the automatic method against human annotations on the Berkeley faces dataset [10] and on EEV reaction videos. For human annotation, the annotators are presented with 3 second clips of reaction videos. The annotators then select all facial expressions that are present on faces for the entire clip.

Automatic annotations. The facial expression recognition model is based on Vemulapalli & Agarwala [49]. This method uses facial movements over 3 seconds at 6 Hz to predict facial expression labels at each video frame.

We first compute face-based features using the NN2 FaceNet architecture [41]

from input video frames. Specifically, we extract the inception (5a) block with a 7x7 feature map composed of 1,024 channels fed into a 7x7 average pooling layer to create a 1,024 dimensional feature vector representing a single face at a single time point in a video. We feed the face features computed over a given video segment into two long short-term memory (LSTM) layers each with 64 recurrent cells to capture temporal information. The output of the LSTM is then fed through a mixture of experts model (2 mixtures in addition to the dummy expert). The network is trained on 274k ratings of 187k clips of faces from a public video corpus. The data is manually annotated by human raters who selected all facial expression categories that applied to each face.

We applied the automatic annotation model on reaction videos to generate annotations at 6 Hz. Then, we apply a sliding windows of 3 seconds, with a stride of 1 second, over the 6 Hz annotations. We use majority voting for each expression over the sliding window. This is to smooth the data temporally and minimize noise. This reduces our annotations from 6 Hz to 1 Hz.

Evaluation on Berkeley faces [10]. To evaluate the accuracy of the annotation model on natural expressions, we compared them to human judgments of faces in a held-out test set from the dataset introduced by Cowen & Keltner [10]. The results are in Figure 4(a) and the average prediction correlation is 0.70. Although our model is not perfect, we see that the accuracy levels of our model annotations exceed those of single rater human annotations on natural expressions to the average human rating. Importantly, the prediction correlations were similar for faces of different ethnicities, genders, and age groups (Figure 4(b)).

(a) Comparing model predictions to human raters.
(b) Correlation over different demographics.
Figure 5: Total prediction correlation of the facial expression recognition model on the dataset in [10], across the five expression classes.
Expression Precision Recall
Amusement 0.770 0.289
Concentration 0.840 0.134
Contentment 0.552 0.182
Interest 0.769 0.475
Surprise 0.733 0.017
Table 1: Precision and recall of automatic annotation model of facial expressions compared to human annotators on EEV reaction videos from the validation subset.

Evaluation on EEV reaction videos. We further evaluate our automatic expression model on the human-annotated EEV validation subset. This set consists of 39,776 annotations on 192 reaction videos.

For this evaluation, we use predictions at the operating point of our model to compute precision and recall against the human annotated ground truth, as shown in Table 1. We found the average precision and recall to be 73.3% and 21.9% respectively. We first note that our precision and recall is comparable to other large-scale automatically annotated dataset such as YouTube8M [1]. Since EEV has human annotations for the validation and test set but a noisier training set, this makes EEV a good dataset to test approaches that model noise in training data [52, 7, 55]. An additional factor that causes differences in machine and human annotations is that machine annotations are measuring facial muscle movements, while human annotations describe how people conceptualize those movements in light of other available cues (body posture, speech, visible context, etc.) [10]. To fully understand the effects of each, we report the performance of our baseline model on both human and machine annotated subsets.

4 Dataset Characteristics

The EEV dataset consists of 18,541 videos over 1,351 hours annotated at 1 Hz for a total of 4,863,653 automatic annotations. There are a total of 39,776 human annotations over 192 videos in the validation set, with 3 repeats from different annotators. For our human annotations, we focused on optimizing the tradeoff between annotation budget and time. There are an additional 200 video annotated for the EEV test set that we plan to release. The annotations are over five expression classes from [10]: amusement, concentration, contentment, interest, surprise. The dataset is multi-label and each annotation may consist of zero, one or multiple expression classes.

Expression distribution. The distribution of the expressions are in Table 2. The dataset imbalance in EEV is likely due to two factors: natural imbalance of expressions in reaction videos and the low recall of our model. Videos likely do not evoke expressions with the same rate. Furthermore, some expressions, such as surprise, lasts for a comparatively short duration of time compared to concentration or interest. Since our dataset is annotated at the frame level, this will result in less annotations of surprise.

Expression percent Expression percent
Amusement 6.52 % Concentration 15.67 %
Contentment 8.83 % Interest 45.23 %
Surprise 0.08 % - -
Table 2: Facial expression distribution across the entire EEV dataset.

Dataset split. We split the EEV dataset at the video level with a roughly 60:15:25 split into 11,127 training videos, 2,784 validation videos and 4,639 test videos. The video level split ensures that all annotations from one video are only in one split. We verified that there are similar distributions of expressions in each dataset split.

Average statistics. There are 2.66 reaction videos associated with each content video on average. For the content videos, the average video length is 5 minutes, and a video can range from 1 to 30 minutes. Finally, the average number of expressions per annotation (label cardinality) is 0.830.

Figure 6: The co-occurence matrix shows how often facial expressions are labelled together for each annotation in EEV (Best viewed in color).

Expression co-occurrences. The EEV dataset is multi-label and there are 804,616 seconds for which more than one expression occurs. We examine the expression co-occurence matrix in Figure 6. The matrix is normalized using the diagonals, such that each diagonal entry is divided over each row. We can see that amusement tends to co-occur with contentment. Surprise also co-occurs with interest and concentration.

Challenges. The EEV dataset is challenging because the viewer expression can depend on viewer background, external context and other information not present in the video from visual and audio based data. This is a challenge for directly predicting viewer response from general, “in-the-wild” videos. Despite this, we show that our baseline model can learn useful information from the EEV dataset using only video content in section 6.2.

4.1 Video Themes

Theme annotation. We characterize the EEV dataset in terms of video themes (key topics that can be used to describe the video). The video themes are obtained from the video annotation system described in [19]

. These annotations correspond to the Knowledge Graph entities

[29] of the video, which are computed based on video content and metadata [19]. We summarize each video into a set of video themes using the Knowledge Graph entities, similar to the approach used by YouTube8M [1]. This is so that we can better understand the video composition of the EEV dataset. The distribution of the themes in EEV is shown in Figure 2.

Figure 7: Distribution of expressions across different video themes in EEV.

Expression distribution in themes. We choose a subset of themes that we expect to have distinct expression distributions and present the groundtruth distributions in Figure 7. We perform row normalization so that the plot will be invariant to the total number of videos for each theme. We can see that comedy has a greater proportion of amusement labels compared to other themes. Surprise is relatively rare across all themes. Horror has the greatest proportion of concentration and not as much amusement or contentment.

We hope these unique characteristics of the EEV dataset can encourage further studies in video understanding and affective computing.

5 Expression Prediction Model

Our expression prediction model is based on Sun et al. [47]. Our goal is to produce performance baselines for EEV from on an existing model benchmarked on the LIRIS-ACCEDE test set. The LIRIS-ACCEDE dataset [4, 5], part of the MediaEval benchmark [13], provides a way to compare model performances in affective content analysis. The model that achieved top performance on the MediaEval benchmark task in 2018 [47]

uses a combination of pre-computed features across multiple modalities, gated recurrent units (GRUs)

[9], and a mixture of experts (MOE). We use a similar architecture in our experiments. This architecture has also been studied by [37] and performed well on the YouTube8M dataset [1]. Figure 8 presents an overview.

5.1 Feature Extraction

Videos present information to viewers through multiple modalities and a multimodal approach will be needed to understand viewer response [50]. We leverage information from image, face, and audio modalities by extracting frame level features for each second in the videos.

Image and face. For the image features, we read the video frames at 1 FPS. We then feed the frames into the Inception V3 architecture [48]

trained on ImageNet

[14]

. We extract the ReLu activation in the last hidden layer, resulting in a 2048-D feature vector for the image input. For the face features, we use the same extracted image, but focus on the two largest faces in the image. The face features are extracted using an Inception-based model trained on faces

[41]

. This process results in another 2,048-D feature vector based on the face. We pad the face feature vector with zeros if less than two faces are detected in the image.

Audio. The audio feature extraction is based on the VGG-style model provided by AudioSet [18] trained on a preliminary version of YouTube8M [1]. In particular, we extracted audio at 16kHz mono and followed the method from AudioSet to compute the log mel-spectrogram. The log mel spectrogram is then fed into the VGG-style model, which outputs 128-D embeddings.

5.2 Model Architecture

Temporal model. Each feature extracted above (image, face, audio) is fed into its own subnetwork consisting of GRUs. We use GRUs in order to take into account the temporal characteristics of video because viewer response often depends on previous scenes in the video. The features are extracted at 1 FPS and fed into their respective GRU with a sequence length of 60. For timestamps occurring before 60 seconds, we padded the input sequences by repeating the features from the first frame. The outputs of the final state from each GRU (image: 512-D, audio: 128-D, face: 256-D) are then concatenated. The fused 896-D vector is used to produce the multi-label predictions corresponding to the final timestamp in the input sequence.

Classification model. We used context gating [37] to weigh the input features before feeding the representation into an MOE model [24]. The output then goes through another context gating transformation for the final predictions. Context gating introduces gating vectors based on sigmoid activations that are capable of capturing dependencies between input features (input gates) and output labels (output gates). The MOE in our model composes a set of predictors weighted by a softmax. The predictors use a sigmoid as the non-linear activation. The final output context gate outputs the predictions for each class.

Figure 8: Baseline model architecture, based on [47].

Implementation Details. Our models are trained using the Adam optimizer [28] with a mini-batch size of 128. We used a learning rate of

. We used gradient clipping when training the network to mitigate potential exploding gradient problems. For the GRU, we applied a dropout

[46] of

in each layer. For the context gating implementation, we applied batch normalization

[22]

before the non-linear layer. The models are implemented in TensorFlow and trained on TPUs.

6 Experiments

To establish baseline performance, we apply the architecture in Figure 8 to the EEV dataset. Due to the challenges outlined in section 4, we want to ensure that models are able to learn information from different modalities of the data and we can perform better than random. We also test transfer learning from EEV to the LIRIS-ACCEDE dataset.

6.1 Datasets

EEV dataset. For the EEV dataset, we train on the training split, and present results on the validation set. We approach EEV as a multi-label classification problem and use mean average precision (MAP) [1]

to measure model performance. The average precision (AP) approximates the area under the precision-recall curve and is averaged over the classes. The loss function we use to train this dataset is the sum of the cross entropy loss over all classes.

LIRIS-ACCEDE dataset. This dataset is chosen because it is the largest dataset that focuses on a similar task to the EEV dataset – predicting the impact of movies on viewers. The LIRIS-ACCEDE dataset [4] is annotated each second with self-reported valence and arousal values. Each of the two dimensions is a continuous value in . This is a regression problem, and we will use the same metrics as used by the dataset competition [13] to measure performance: mean squared error (MSE) and Pearson’s Correlation Coefficient (PCC). We report the average MSE and PCC over frames.

We have no access to the labels of the test set in LIRIS-ACCEDE, and we report metrics on the validation set. The LIRIS-ACCEDE dataset is divided into three parts during download. We use the first part (14 movies) as our validation set and the rest (40 movies) for training. The loss function we use to train this dataset is the L2-norm loss function.

Input Features Annotation Amuse. AP Concen. AP Content. AP Interest AP Surprise AP MAP
N/A Machine 6.92 16.48 8.95 44.68 0.08 15.42
Image+Face+Audio Machine 11.61 22.68 14.05 52.82 0.44 20.32
Image+Audio Machine 10.41 22.07 13.52 52.32 0.17 19.70
Image Machine 10.75 20.29 12.09 52.63 0.03 19.16
N/A Human 18.53 47.51 20.63 46.22 2.36 27.05
Image+Face+Audio Human 19.49 48.72 20.86 47.29 2.70 27.81
Image+Audio Human 20.51 49.65 20.15 46.04 2.38 27.75
Image Human 20.97 48.94 21.09 46.22 2.47 27.94
Table 3: Results of the baseline model on the entire EEV validation set. The first row shows the performance of a random classifier. The following rows show the performance of the baseline model.

6.2 Results on EEV

We summarize the results of the EEV dataset in Table 3 for both human and machine annotations. The baseline model based on [47] performs better than a random classifier for every expression. For the machine annotations, we improve the MAP by about 5% when using all modalities. This result demonstrates that information for predicting affective responses can be learned from the EEV dataset, despite challenges outlined in section 4. Our observations align with the statement from [44], in that affective response is not arbitrary and there are video content that can evoke consistent viewer responses.

On human annotations, we see that the model is generally more accurate than random, but the performance is closer to random compared to machine annotations. This result is likely because the model is trained on automatic annotations, and there is a difference between the two as in Table 1. Additionally, we note that the automatic annotation are measuring facial muscle movements, while human annotations have conceptualized those movements [10]. However, it is encouraging that the baseline model is still able to perform better than random on human annotations. We have shown that there is a gap in performance on human annotations using current models in affective computing and EEV can be useful to test future approaches that model noisy or missing data.

For performance across expressions, the baseline model is able to improve all expressions by 4% to 8%, except for surprise, on automatic annotations. The relatively poor performance on surprise is likely due to it being a rare class. This result further highlights the gap between current and ideal performance. Approaches that can better handle imbalanced data could be applied to improve performance. For the human annotated subset, amusement and concentration have a greater improvement from random compared to the other expressions. From Table 1, we see that these are also the expressions with the highest precision.

Feature ablation. We use the same architecture described in Figure 8 for our ablation study and remove a subsets of the input features. Our results are in Table 3. Since image+audio features are a popular combination for video classification, we remove face features during training and validation. On the machine annotations, we can see that the MAP decreases by about 0.6%. The biggest decrease in class AP is on amusement. We also remove audio features to see how well we can perform based on image features alone. This decreases the MAP by about 0.5%. The biggest decrease in class AP is on concentration and contentment. A similar decrease in performance is not observed on the human annotated subset. For human-annotated expressions, there is no clear trend with feature ablation. This result is likely because the baseline model performs close to random on human annotations, and so the relative errors between different features are less clear.

The results in Table 3 demonstrate that EEV is a challenging dataset, and we have outlined future directions to explore for novel models to improve baseline performance.

6.3 Results on LIRIS-ACCEDE

Model V-MSE A-MSE V-PCC A-PCC
LIRIS only 0.091 0.104 0.216 0.223
EEV transfer 0.063 0.071 0.261 0.242
Table 4: Results of the baseline model on LIRIS-ACCEDE validation set, using all input features. The mean squared error (MSE) and Pearson’s correlation coefficient (PCC) metrics preceded with “V” corresponds to the valence metrics, while the metrics preceded with “A” corresponds to the arousal metrics. We show results without (first row) and with (second row) transfer learning from EEV.

The results on the LIRIS-ACCEDE dataset are in Table 4. We investigate the impact of transfer learning using the EEV dataset. We experiment with training from scratch and fine-tuning from a checkpoint pre-trained on EEV.

Transfer Learning. The transfer learning model has lower MSE and higher PCC than the model trained from LIRIS only. The improvements to valence metrics are larger than the improvements to arousal. From Figure 9, we see that the transfer learning curve is able to better capture the average valence and arousal, resulting in lower MSE. However, we see that there is still a gap in learning sudden changes in valence and arousal as both models miss the spikes in self-reported affect. Learning short term events could help since we currently use a sequence length of 60s in our GRU. We note that the predictions are able to follow the longer term trend of valence and arousal (positive correlation).

Figure 9: Result of predictions on MEDIAEVAL18_02 from LIRIS dataset, a drama on relationships. The V-MSE, A-MSE, V-PCC, A-PCC for LIRIS only is: 0.08, 0.03, 0.31, 0.45, and for transfer learning is: 0.06, 0.02, 0.23, 0.52.

Overall, the GRU pre-trained using the EEV dataset model achieves a better performance on LIRIS-ACCEDE. This shows the potential of the EEV dataset to be used in transfer learning.

7 Conclusion

We introduce the EEV dataset, a large-scale video dataset for studying evoked viewer facial expressions. The EEV dataset is larger and more diverse than previous affective video datasets, and we achieved this by combining machine and human annotations. We verified the labels on natural facial expressions. While the EEV training set is fully automatically annotated, the validation and test sets have both human and machine annotations. This property makes EEV a good dataset to test models for noisy and incomplete data. Baseline performance on the EEV dataset shows that while affective information can be learned from video content, this is a largely unsolved problem. To address this, EEV can be applied to new approaches in affective computing and help lessen the gap between affective and semantic content datasets. The EEV dataset has the potential to be used for studying affective stimuli in videos, modeling incomplete data and investigating facial expression changes over time. We hope that the EEV dataset will be useful in developing novel models for affective computing and video analysis.

8 Acknowledgements

We are grateful to the Computational Vision Lab at Caltech for making this collaboration possible. We would like to thank Marco Andreetto and Brendan Jou from Google Research for their support as well.

References

  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) YouTube-8m: A large-scale video classification benchmark. CoRR abs/1609.08675. External Links: Link, 1609.08675 Cited by: §1, §2.3, §3.1, §4.1, §5.1, §5, §6.1.
  • [2] E. Acar, F. Hopfgartner, and S. Albayrak (2017) A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material. Multimedia Tools and Applications 76 (9), pp. 11809–11837. Cited by: §2.1.
  • [3] Y. Baveye, C. Chamaret, E. Dellandréa, and L. Chen (2018) Affective video content analysis: a multidisciplinary insight. IEEE Transactions on Affective Computing 9 (4), pp. 396–409. Cited by: §2.1, §2.1, §2.3.
  • [4] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen (2015) Deep learning vs. kernel methods: performance for emotion prediction in videos. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 77–83. Cited by: §2.1, §2.3, §5, §6.1.
  • [5] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen (2015) Liris-accede: a video database for affective content analysis. IEEE Transactions on Affective Computing 6 (1), pp. 43–55. Cited by: §1, §2.1, §2.1, §2.3, §5.
  • [6] H. Becker, J. Fleureau, P. Guillotel, F. Wendling, I. Merlet, and L. Albera (2017) Emotion recognition based on high-resolution eeg recordings and reconstructed brain sources. IEEE Transactions on Affective Computing. Cited by: §2.1.
  • [7] W. Bi and J. T. Kwok (2014) Multilabel classification with label correlations and missing labels. In

    Twenty-Eighth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §3.1.
  • [8] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6299–6308. Cited by: §1, §2.3.
  • [9] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §5.
  • [10] A. S. Cowen and D. Keltner (2019) What the face displays: mapping 28 emotions conveyed by naturalistic expression.. American Psychologist. Cited by: §2.2, Figure 5, §3.1, §3.1, §3.1, §3, §4, §6.2.
  • [11] A. S. Cowen and D. Keltner (2017) Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences 114 (38), pp. E7900–E7909. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/114/38/E7900.full.pdf Cited by: §2.1, §3.
  • [12] (2015) Crazy and funny cats - creative commons.. Note: https://www.youtube.com/watch?v=yINjxluMrUA Cited by: Figure 1, Figure 4.
  • [13] E. Dellandréa, L. Chen, Y. Baveye, M. V. Sjöberg, C. Chamaret, et al. (2018) The mediaeval 2018 emotional impact of movies task. In MediaEval 2018 Multimedia Benchmark Workshop Working Notes Proceedings of the MediaEval 2018 Workshop, Cited by: §2.1, §2.3, §5, §6.1.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.1.
  • [15] Z. Deng, R. Navarathna, P. Carr, S. Mandt, Y. Yue, I. Matthews, and G. Mori (2017)

    Factorized variational autoencoders for modeling audience reactions to movies

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2577–2586. Cited by: §2.1.
  • [16] E. Douglas-Cowie, R. Cowie, I. Sneddon, C. Cox, O. Lowry, M. Mcrorie, J. Martin, L. Devillers, S. Abrilian, A. Batliner, et al. (2007) The humaine database: addressing the collection and annotation of naturalistic and induced emotional data. In International conference on affective computing and intelligent interaction, pp. 488–500. Cited by: §2.3.
  • [17] P. Ekman and W. V. Friesen (1978) Manual for the facial action coding system. Consulting Psychologists Press. Cited by: §2.2.
  • [18] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. Cited by: §5.1.
  • [19] (2013) Google i/o 2013 - semantic video annotations in the youtube topics api: theory and applications.. Note: https://www.youtube.com/watch?v=wf_77z1H-vQ Cited by: §4.1.
  • [20] A. Hanjalic and L. Xu (2005) Affective video content representation and modeling. IEEE transactions on multimedia 7 (1), pp. 143–154. Cited by: §2.1.
  • [21] M. Horvat, S. Popović, and K. Cosić (2013) Multimedia stimuli databases usage patterns: a survey report. In 2013 36th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), pp. 993–997. Cited by: §1, §2.1.
  • [22] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.2.
  • [23] Y. Jiang, B. Xu, and X. Xue (2014) Predicting emotions in user-generated videos. In Twenty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §2.1, §2.3.
  • [24] M. I. Jordan and R. A. Jacobs (1994) Hierarchical mixtures of experts and the em algorithm. Neural computation 6 (2), pp. 181–214. Cited by: §5.2.
  • [25] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, et al. (2013) Combining modality specific deep neural networks for emotion recognition in video. In Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 543–550. Cited by: §2.1.
  • [26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    .
    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §1, §2.3.
  • [27] K. Kassam (2011) Assessment of emotional experience through facial expression. Vol. 71 (eng). External Links: ISSN 0419-4217, ISBN 9781124079752, Link Cited by: §2.3.
  • [28] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • [29] (2015) Knowledge graph search api.. Note: https://developers.google.com/knowledge-graph/ Cited by: §4.1.
  • [30] S. Koelstra, C. Muhl, M. Soleymani, J. Lee, A. Yazdani, T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras (2012) Deap: a database for emotion analysis; using physiological signals. IEEE transactions on affective computing 3 (1), pp. 18–31. Cited by: §1, §2.1, §2.3.
  • [31] Y. Ma, X. Liang, and M. Xu (2018) THUHCSI in mediaeval 2018 emotional impact of movies task. Cited by: §2.1.
  • [32] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic (2017) Automatic analysis of facial actions: a survey. IEEE Transactions on Affective Computing. Cited by: §2.2.
  • [33] D. McDuff, R. El Kaliouby, J. F. Cohn, and R. W. Picard (2015) Predicting ad liking and purchase intent: large-scale analysis of facial responses to ads. IEEE Transactions on Affective Computing 6 (3), pp. 223–235. Cited by: §2.1.
  • [34] D. McDuff, R. El Kaliouby, D. Demirdjian, and R. Picard (2013) Predicting online media effectiveness based on smile responses gathered over the internet. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pp. 1–7. Cited by: §2.1.
  • [35] D. McDuff, R. Kaliouby, T. Senechal, M. Amr, J. Cohn, and R. Picard (2013) Affectiva-mit facial expression dataset (am-fed): naturalistic and spontaneous facial expressions collected. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 881–888. Cited by: §2.1.
  • [36] D. McDuff and M. Soleymani (2017) Large-scale affective content analysis: combining media content features and facial reactions. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 339–345. Cited by: §2.1.
  • [37] A. Miech, I. Laptev, and J. Sivic (2017) Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905. Cited by: §5.2, §5.
  • [38] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, Y. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick, et al. (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
  • [39] S. Poria, E. Cambria, A. Hussain, and G. Huang (2015) Towards an intelligent framework for multimodal affective data analysis. Neural Networks 63, pp. 104–116. Cited by: §2.1.
  • [40] A. Schaefer, F. Nils, X. Sanchez, and P. Philippot (2010) Assessing the effectiveness of a large database of emotion-eliciting films: a new tool for emotion researchers. Cognition and Emotion 24 (7), pp. 1153–1172. Cited by: §2.3.
  • [41] F. Schroff, D. Kalenichenko, and J. Philbin (2015-06)

    FaceNet: a unified embedding for face recognition and clustering

    .
    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1, §5.1.
  • [42] M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic (2016) Analysis of eeg signals and facial expressions for continuous emotion detection. IEEE Transactions on Affective Computing 7 (1), pp. 17–28. Cited by: §2.1, §2.1.
  • [43] M. Soleymani, J. J. Kierkels, G. Chanel, and T. Pun (2009) A bayesian framework for video affective representation. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–7. Cited by: §2.1.
  • [44] M. Soleymani, M. Larson, T. Pun, and A. Hanjalic (2014) Corpus development for affective video indexing. IEEE Transactions on Multimedia 16 (4), pp. 1075–1089. Cited by: §1, §1, §2.3, §6.2.
  • [45] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic (2012) A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing 3 (1), pp. 42–55. Cited by: §1, §2.1.
  • [46] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    The Journal of Machine Learning Research

    15 (1), pp. 1929–1958.
    Cited by: §5.2.
  • [47] J. J. Sun, T. Liu, and G. Prasad (2018) GLA in mediaeval 2018 emotional impact of movies task. Cited by: 3rd item, §2.1, Figure 8, §5, §6.2.
  • [48] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §5.1.
  • [49] R. Vemulapalli and A. Agarwala (2019-06) A compact embedding for facial expression similarity. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1, §3.1.
  • [50] H. L. Wang and L. Cheong (2006) Affective understanding in film. IEEE Transactions on circuits and systems for video technology 16 (6), pp. 689–704. Cited by: §2.1, §5.1.
  • [51] S. Wang and Q. Ji (2015) Video affective content analysis: a survey of state-of-the-art methods. IEEE Transactions on Affective Computing 6 (4), pp. 410–430. Cited by: §2.1, §2.1, §2.3.
  • [52] H. Yu, P. Jain, P. Kar, and I. Dhillon (2014) Large-scale multi-label learning with missing labels. In International conference on machine learning, pp. 593–601. Cited by: §3.1.
  • [53] S. Zhang, Q. Tian, Q. Huang, W. Gao, and S. Li (2009) Utilizing affective analysis for efficient movie browsing. In 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 1853–1856. Cited by: §2.1.
  • [54] S. Zhao, H. Yao, and X. Sun (2013) Video classification and recommendation based on affective analysis of viewers. Neurocomputing 119, pp. 101–110. Cited by: §2.1.
  • [55] P. Zhu, Q. Xu, Q. Hu, C. Zhang, and H. Zhao (2018)

    Multi-label feature selection with missing labels

    .
    Pattern Recognition 74, pp. 488–502. Cited by: §3.1.
  • [56] A. Zlatintsi, P. Koutras, G. Evangelopoulos, N. Malandrakis, N. Efthymiou, K. Pastra, A. Potamianos, and P. Maragos (2017) COGNIMUSE: a multimodal video database annotated with saliency, events, semantics and emotion with application to summarization. EURASIP Journal on Image and Video Processing 2017 (1), pp. 54. Cited by: §2.1, §2.3.