Rapid development of mobile devices has led to an explosive growth of user-generated images and videos, which creates a demand for computational understanding of visual media content. In addition to recognition of objective content, such as objects and scenes, an important dimension of video understanding is emotional or affective content. Such content can strongly resonate with viewers and plays a crucial role in the video-watching experience. Some successes have been achieved with the use of deep-learning architectures trained for text at both sentence- and document-level
or image sentiment analysis. However, the ability to understand emotions from video, to a large extent, remains an unsolved problem.
The understanding of emotion information in videos has many real-world applications. Video recommendation services, such as those employed by YouTube and Netflix, can benefit from matching user interests with the emotions of video content and prediction of interestingness [35, 24, 23]
, leading to improved user satisfaction. Better understanding of video emotions may enable the placement of advertisements that are consistent with the main video’s mood and help avoid social inappropriateness such as placing a funny advertisement alongside a funeral video. Video summarization can also benefit from understanding emotions, since a good summary should probably include strong emotional content from the original video.
This paper tackles three inter-related problems in video emotion understanding. We start with emotion recognition in both supervised and zero-shot conditions. Zero-shot video emotion recognition aims to recognize emotion classes that are not seen during training. This task is motivated by recent cognitive theories [3, 44, 7, 4] that suggest human emotional experiences extend beyond the traditional “basic emotion” categories (e.g. Ekman’s six emotions ). Emotion processes and other cognitive processes in our brain cooperate closely to create rich and diverse emotional and affective experiences [43, 52, 27]
, such as ecstasy, nostalgia, or suspense. Therefore, when operating in the real world, recognition systems trained with a small set of emotion labels will inevitably encounter emotional and affective expressions not in its training set. In zero-shot learning, we rely on heterogeneous sources of background knowledge to alleviate reliance on training data and improve generalization to labels previously unseen. Since our goal is to recognize both emotional and affective impact of video from a computer vision perspective, we use “emotion” as a shorthand to refer to both emotion and affect. the boundary between which can be blurry (see, for example, the argument around surprise.)
We then define a novel task called video emotion attribution, which aims to identify the contribution of each frame in a video to its overall emotion. Not every frame contributes emotional information. This realization further enables the third task – emotion-oriented video summarization. In the emotion-oriented video summarization, we summarize a video and at the same time preserve emotional content as much as possible.
Our knowledge transfer framework performs two types of knowledge transfer. In supervised emotion recognition, we learn an effective video encoding from a large-scale auxiliary emotional image dataset. This auxiliary Image Transfer Encoding (ITE) process can generate a video representation more conducive to emotion recognition than alternative methods. In zero-shot emotion recognition, where the emotion to be recognized is not in the training set, we learn embeddings for emotional words from a large-scale text corpus, so that an unknown emotion word can be semantically related to labeled or known emotions and subsequently recognized. The knowledge transfer technique further facilitates two novel applications: emotion attribution and emotion-oriented summarization. Video sub-shots more related to the overall emotion are identified and emphasized in the automatically generated summary.
Contributions: We introduce a framework for transferring knowledge from heterogeneous sources (image and text) for the understanding of emotions in videos. For the first time, we demonstrate zero-shot emotion recognition by utilizing knowledge learned from text sources to the video domain. We also propose the first definitions and solutions for the problems of emotion video attribution and emotion-oriented summarization. We show that our emotion-oriented summaries are better than alternative methods that do not consider emotion. Finally, we introduce and will make available to the community two new emotion-centric video datasets: the Plutchik and Ekman emotion datasets.
2 Related work
2.1 Psychological Theories of Emotion
It is a widely held view in psychology that emotion contains a number of static categories, each of which is associated with stereotypical facial expressions, physiological measurements, behaviors, and external causes [15, 17]. The most well known model is probably Ekman’s six pan-cultural basic emotions, including happiness, sadness, disgust, anger, fear, and surprise [16, 17]. However, the exact categories can vary from one model to another. Plutchik  added anticipation and trust to the list. Ortony, Clore and Collins’s  model of emotion defined up to 22 emotions, including categories like hope, shame, and gratitude.
Nevertheless, more recent empirical findings and theories from the psychological construction approach [3, 44] suggest emotional experiences are much more varied than previously assumed. It is argued that the categories are modal or stereotypical emotions, and large fuzzy areas exist in the emotion landscape. Instead of associating a fixed set of facial expressions with each emotion, emotion recognition is influenced by visual and language context [7, 4]. The facial expression of smile, for example, can indicate happiness, embarrassment, or being subordinate in different contexts . See Fig. 1 for a real-world example.
Other theories [43, 52, 27] highlight the dynamics of emotion and the interactions between emotional processes and other cognitive processes. Together, the complex dynamics and interactions lead to a rich set of emotional and affective experiences, and correspondingly rich natural language descriptions, such as ecstasy, nostalgia, or suspense. In order to cope with diverse emotional descriptions that may be practically difficult (or at least very costly to label), in this paper we investigate emotion recognition in a zero-shot setting (in addition to, a traditional, supervised setting). Our emotion recognizer is tested against emotional classes that do not appear in the training set. The zero-shot recognition task is designed to test the system’s ability to make use of knowledge learned from heterogeneous sources in order to adapt to unseen tags.
2.2 Automatic Emotion Analysis
In this section, we briefly review three relevant areas of research: recognition of facial expressions from images and videos, recognition of emotional impact of images on viewers, and recognition of emotional impact from videos.
Recognition of facial expressions from images and videos. In the light of findings that humans rely on contextual information to recognize emotion [7, 4], in this paper, we aim to recognize the overall emotional impact of a video from all of its information. This contrasts with the identification of facial expressions and the associated emotion in static images and video, which has been a subject extensively studied. Two recent reviews of the topic can be found in  and . Several competitions, such as the Facial Expression Recognition and Analysis Challenge , the Audio/Visual Emotion Challenge , and the Emotion Recognition In The Wild Challenge , have been held. Notably, Liu et al. construct a mid-level representation called expressionlet from spatio-temporal manifold. Cruz et al. proposed a dynamic downsampling of facial expressions in video at a rate proportional to the rate of temporal changes of visual information. When apex labels are provided, videos are downsampled around the apex of an expression.
A few recent works focused on predicting emotions, affects, and emotion-related cognitive states outside the basic emotion categories. Kaliouby and Robinson  recognized emotion-related mental states, such as agreeing or feeling unsure. Bosch et al. detect learning-related affects including boredom, confusion, delight, engagement, and frustration; other work recognized smirk  / fatigue .
Recognizing the emotional impact of still images on viewers. Machajdik and Hanbury classified images into 8 affective categories: amusement, awe, contentment, excitement, anger, disgust, fear, and sadness. In addition to color, texture, and statistics about faces and skin area present in the image, they also make use of composition features such as the rule of the third and depth of field. Lu et al. studied shape features along the dimensions of rounded-angular and simple-complex, and their effects in arousing viewers’ emotions. You et al.
designed a deep convolutional neural network (CNN) for visual sentiment analysis. After training on the entire training set, images on which the CNN performs poorly are stochastically removed. The remaining images were used to fine-tune the network. A few work[8, 74] also employed off-the-shelf CNN features.
Recognizing emotional impact from videos. For a more comprehensive review, we refer reader to the latest survey . A large number of early work studied emotion in movies (e.g. [37, 68, 30]). Wang and Cheong  used an SVM with diverse audio-visual features to classify 2040 scenes in 36 Hollywood movies into 7 emotions. Jou et al.  worked on animated GIF files. Irie et al. use Latent Dirichlet Allocation to extract audio-visual topics as mid-level features, which are combined with a Hidden-Markov-like dynamic model.
SentiBank  contains a set of adjective-noun pairs, such as “beautiful flowers” and “sad eyes”, and images exemplifying each pair. One linear SVM detector was trained for each pair. The best-performing detectors provide a mid-level representation for emotion recognition. Chen et al.  replaced the SVM detectors with deep convolutional neural networks. Jiang et al.  explored a large set of features and confirmed the effectiveness of mid-level representations like SentiBank. In this work, we transfer knowledge learned from the same set of Flickr images for the purpose of video emotion analysis.
The implicit approach for recognizing the emotional impact of a video is to recognize emotions exhibited by viewers of that video. This clever trick delegates the complex task of video understanding to human viewers, thereby simplifying the problem. McDuff et al.  analyzed facial expressions exhibited by viewers of video advertisements recorded with webcams. Histogram of Oriented Gradient (HOG) features were extracted based on key points on the faces. Purchase intent is predicted based on the entire emotion trajectory over time. Kapoor et al. used video, skin conductance, and pressure sensors on the chair and the mouse to predict frustration when a user interacted with an intelligent tutoring system. However, the success of this approach depends on the availability of a large number of human participants.
Nevertheless, all previous work are limited as they aim to predict emotion or sentiment classes present in the training set. In this work, we utilize knowledge from other domains like images and text in order to identify emotion classes unseen in the training set. In addition, we also investigate related practical applications like emotion-oriented video attribution and summarization.
2.3 Multi-Instance Learning
The knowledge transfer approach adopted in this work is related to multi-instance learning (MIL), which has been extensively studied in the machine learning community and utilized in other domains. We therefore briefly review related techniques in the following. MIL refers to recognition problems where each label is associated with a bag of instances, such as a bag of video frames. It has been used in many problems, such as drug activity prediction , speech recognition 
, image retrieval and classification. The problem investigated in this work is intrinsically a multi-instance learning case as each video consists of many frame instances with possibly different emotions.
There are basically two branches of MIL algorithms. In the first branch, many works attempted to enable single-instance supervised learning algorithms to be directly applicable to multi-instance feature bags. This branch includes most of the early works on MIL,  such as miSVM , MIBoosting 
, Citation-kNN, MI-Kernel , among others. These algorithms achieved satisfactory accuracies in several applications, but most of them can only handle small or moderate-sized data. In other words, they are computationally expensive and cannot be applied to deal with large-scale video data.
The second branch of works are more recent, where researchers tried to solve the MIL problems by adapting multi-instance bags to specific points in the original instance space. Popular algorithms include constructive clustering based ensemble (CCE) 
, multi-instance learning based on the Fisher Vector representation (Mi-FV) and multi-instance learning via embedded instance selection . Inspired by these works, we encode the video frame bags into single-instance representations of the bag-level information. Our approach is different from this category of MIL algorithms in the following: (1) our emotion recognition task is a multi-class multi-instance problem, while most of the previous MIL algorithms aimed at binary classification; (2) We perform the encoding process by using auxiliary data like images, and demonstrate that transferring such knowledge is important for video emotion analysis.
2.4 Video Summarization
Video summarization has been studied for more than two decades. A complete review is beyond the scope of this paper and we refer readers to .
There are two main types of video summaries: key-frames[45, 29, 11, 19] and video skims [55, 70, 72, 20]. Video summarization has been explored for various types of content, including professional videos (e.g., movies or news reports) [55, 70, 72], surveillance videos [20, 19], and, to a lesser extent, user-generated videos . To extract the video summary, most approaches have to rely either on the low-level information, such as visual saliency  and motion cues ; or on the mid-level information e.g. object trajectories , tag localization  and semantic recognition . Facial expressions has been considered by Dhall and Roland 
who extracted video summaries by considering smile/happy face expression. However, none of these approaches have considered video summarization based on more general video emotion content. Such video emotion is an important cue for finding the most “interesting” or “important” video highlights. For example, a good summary of a birthday party, or a graduation ceremony, should capture the most emotional moments in the event. Not considering the valuable emotion dimension in the video summarization task risks losing these precious moments in the summary.
The overview of the proposed framework is illustrated in Figure 2. We start by introducing the problem formulation and common notation, and then discuss the auxiliary image transfer encoding for supervised recognition, auxiliary text-based transfer encoding for zero-shot recognition, and video emotion attribution and summarization.
3.1 Problem Setup
Suppose we have a training video dataset
where the video has frames, with the features , where the subscripts denote the frame of video .
is extracted by the state-of-the-art deep Convolutional Neural Network (CNN) architecture which was recently shown to greatly outperform more traditional hand-crafted low-level features, such as HOG and SIFT, on several benchmark datasets including MNIST and ImageNet in machine learning and computer vision communities. Specifically, we retrain AlexNet  with ImageNet classes and use the seventh layer (“fc7”) representation as , computed on the input frame .
We use to denote the encoded video-level feature of video obtained using auxiliary image transfer encoding, which is introduced in Section 3.2; is the class label of video from the set of training labels and is the total number of training videos. The testing data set is similarly defined as
where is the total number of testing videos. Under the zero-shot learning setting, the labels in training and testing sets have different domains: and , denoting the testing set contains classes previously unseen.
To enable knowledge transfer, we introduce a large-scale auxiliary image set and a text sentiment dataset. We denote the auxiliary image sentiment dataset as , where is the deep CNN feature of an image which is extracted with the same trained model as above.
The textual data are represented as a sequence of words , where the vocabulary is the set of unique words. We learn a -dimensional embedding for each .
3.2 Auxiliary Image Transfer Encoding (ITE) for Supervised Emotion Recognition
In multi-instance learning, the natural way of encoding multi-instance bags into single-instance representations is to cluster the instances of all the bags to several groups. Examples of this include CCE  and Mi-FV . Such clustering enables the re-representations of each bag as the new Bag-of-Words (BoW) features. However, this comes with three challenges. First, the feature set is learned for the tasks of image classification, rather than for video emotion recognition. Second, the large and complex space of video visual elements (such as objects, events, and people) and their possible long-term temporal progression and interaction makes the domain intrinsically more complex than the previous image sentiment and MIL works. Third, the emotion is often expressed on certain limited (sparse) keyframes or video clips, and many video frames may not be directly related to the expressed emotion.
To address these challenges, we utilize emotion information from a large-scale emotional image dataset to encode the video content using a BoW representation. This can be intuitively explained from the perspective of entropy. A dictionary built from the auxiliary emotion-related images can efficiently encode a video frame with emotion information as a sparse vector which concentrates on a few dimensions. In comparison, a frame without emotion information will likely be encoded less efficiently, producing a denser vector with small values in many dimensions. As a result, a non-emotional frame will have higher entropy than the emotional frame, and hence less impact on the resulted BoW representation.
More formally, we first find
clusters from the auxiliary images and encode the video frames based on the cluster centers. Specifically, we perform a spherical k-means clustering on the auxiliary image dataset, which is to solve:
The goal is to find spherical cluster centers
. The cosine similarity is defined as
The variable assigns an image to the closet cluster center , which is defined as
To encode one video bag into its corresponding single-instance representations, a BoW scheme is used to translate the feature set into a -dimensional vector . Specifically, to encode the feature vector of a video , we fix the cluster centers found by the k-means and identify the nearest cluster centers for each frame . We thus can get the assignments of each frame to each cluster, :
where denotes the spherical nearest neighbours111Generally, we require that , since videos can express much more ‘versatile’ emotions that those of images (e.g. Figure 3). to from the cluster centers . We then accumulate the effects of each frame on the dimension to compute the feature vector :
Our encoding scheme in Eq (5) is different from the standard BoW  and soft-weighting BoW encoding . First, the traditional BoW encodes local descriptors, such as SIFT and STIP, which requires a dictionary orders of magnitude greater than our frame set. Thus directly using standard BoW  to our problem will make the generated video-level features too sparse to be discriminative. Second, the soft-weighting encoding assigns one visual feature point to multiple clusters using typically exponential or Gaussian kernel, to downweight the contribution to the clusters that are not closest to the feature. However, in our problem one video frame can express multiple emotions simultaneously. For example, Figure 3 shows an example of both grief and happiness. Thus in Eq (5), one feature instance can equally contribute to different encoding bins. In other words, we make use of a uniform kernel instead.
The encoding scheme from the frame-level deep features to the video-level emotional representation helps the standard video emotion recognition tasks. Given a test video
, its class label can be estimated as
where is the predictor trained from the video-level feature set of the training video set
. In this paper, we use the support vector machines (SVM) classifier with chi-square kernel as the predictor. The procedure of the ITE based supervised emotion recognition is summarized in Algorithm 1.
3.3 Zero-Shot Emotion Recognition
Canonical emotion theories such as Ekman  often provide detailed textual definitions for a fixed number of prototypical emotions. However, recent research [3, 44] questioned the validity of basic emotional categories and highlighted differences within each category. This raises an interesting question: if we face a more diverse list of emotions than those in the training set, can we identify these emotions purely based on their textual description? This is the zero-shot recognition problem. To address it, we need to employ an auxiliary set of textual information to help encode the emotion classes that have never been visually seen before. Here, we use and to indicate the emotion label words of auxiliary and testing dataset.
The encoding algorithm finds a distributed representation for each word in the vocabulary, by training models from large-scale textural copra containing sentiment data. Specifically, we find a low-dimensional vector representationby using each word in a corpus to predict nearby words by maximizing the following log probability:
where is the context window for prediction; and are “input” and “output” words respectively, and is the continuous word representation to learn. Optimizing is an effective way to learn continuous word representations, but the total computational cost has been intractable until recent deep learning developments such as word2vec , and GloVe .
This acquired vector representation serves as the intermediary between an emotion word and its corresponding video emotion class. Mapping video-level features into this semantic space requires a regressor from the video feature space to the word embedding space:
In this work, we train a support vector regressor with a linear kernel for each dimension of the word vector . Similar support vector classifiers have also been used in the attribute learning works [42, 22].
Note that the regression models in Eq (8) potentially have a generalization problem which is largely caused by the different visual distribution of disjointed training and testing classes in zero-shot learning settings. For example, videos of joy usually have positive frames, whilst a sad video would have negative ones. To ameliorate such generalization problems, we take inspiration from  and apply Transductive 1-Step Self-Training (T1S) to adjust the word vector of new emotion classes. Specifically, for a class that is previously unseen, and the corresponding word vector , we compute a smoothed version :
where denotes the set of spherical nearest neighbors to of the predicted testing word vector in the semantic space. Eq (9) aims at transductively ameliorating such visual differences by averaging the with its nearest neighbor testing instances. Here, to prevent the semantic drift of self-training, we only do self-training for one step. By using Eq (9), we can get the updated word vector testing set .
Thus, given a test video in the testing set, its class label can be estimated as
Compared with the zero-shot learning algorithm in , we skip the intermediate level of latent attributes and directly apply the 1-step self-training in the semantic word vector space. In addition, we use cosine similarity as the metric rather than the Euclidean distance. The process is summarized in Algorithm 2.
3.4 Attribution and Summarization
We define the video attribution problem as one of attributing the emotion of a video to its frames/clips. The video emotion attribution problem is inspired by, and yet different from, text-based sentiment attribution . The difference is that sentiment attribution only considers positive or negative attitudes while we consider more diverse emotions for emotion attribution.
Emotion attribution can help us find video highlights , which are the interesting or important events happened in the video. Generally, the concepts of “interesting” and ”important” may be variable for different target video domains and applications, such as the scoring of a goal in soccer videos, applause and cheering in talk-show videos, and exciting speech in presentation videos. Nevertheless, most of these “interesting” or “important” video events may contribute to/convey very strong video emotions which are thus highlighting the core parts of the whole video.
Formally, for one video sequence , we want to find the video frames that highly contribute to the overall expressed video emotion. Using Eq (2), we can encode a frame as a similarity score vector:
where is defined in Eq (4). The vector uses the auxiliary image dataset to evaluate the emotions in the frame.
Equipped with the frame-level emotion score vector, the video emotion attribution can be formulated as measuring the similarity between the video-level emotion vector and the frame-level vectors. Specifically, the attribution score of the video frame is computed as the cosine similarity between , the feature vector of the entire video, and . To find the frame that contributes the most to the overall emotion of , we simply take the over all frames:
Similarly, emotion attribution may be performed on a list of pre-partitioned video clips. Given a list of clips , we can compute the similarity score vector for each clip as
and then perform a similar operation.
Emotion attribution can facilitate interesting applications, such as video summarization and retrieval. To further enable the emotion-based video summarization, we can employ the emotion attribution technique described above and select a set of the most representative frames (or, the corresponding short clips containing the frames). Suppose is the set of key frames of the video, defined as
where is a weight parameter, and the second term rewards key frames that are the most similar to other frames in the same video, which means that the selected set of key frames are representatives of the entire video.
Comparing with previous work [66, 29, 19, 20, 72], Eq (14) considers the summary of both video highlights (by the first term for emotion attribution) and information coverage (by the second term for eliminating redundancy and selecting information-centric frames/clips). Thus our method can produce a condensed, succinct and emotion-rich summary which can facilitate the browsing, retrieval and storage of the original video content. Particularly, our summary results are more emotional interpretable due to the emotion attribution.
4.1 Datasets and Settings
We adopt three video emotion datasets for evaluation. Among them, two (the Plutchik and the Ekman datasets) are introduced by us and will be made available to the community once this paper is accepted.
The YouTube emotion dataset . This dataset contains 1,101 videos annotated with basic emotions from the Plutchik’s Wheel of Emotions. To better evaluate the zero-shot emotion recognition tasks, we re-annotate the videos into emotions according to Plutchik’s definitions by adding variations to each of the basic emotions. For example, we split the basic joy class into ecstasy, joy, and serenity along the arousal dimension. We use the shorthand YouTube-8 and YouTube-24 for the original and re-annotated datasets respectively. YouTube-24 is used as a more difficult task for evaluating the zero-shot recognition.
The Plutchik dataset. This dataset is derived from the recently proposed VideoStory dataset . We use the keywords of the Plutchik’s Wheel of Emotions  to query the textual descriptions of the VideoStory dataset. Since most of the textual descriptions come from the video captions, the emotions of the returned videos are accurately described by their corresponding emotional keywords. We manually pruned noisy videos from the returned set, which leads to a set of videos belonging to emotion classes.
The Ekman dataset. As discussed in the related work, the studies of Ekman  found a high agreement of emotions across cultures can be labelled as basic emotion types. These 6 emotions are a subset of the 8 classes defined by Plutchik. Based on this, we collect the Ekman emotion dataset which has videos in the emotion classes, with a minimum number of 221 videos per class. These videos are collected from social video-sharing websites, such as YouTube and Flickr. The dataset is crowdsource labelled by 10 different annotators, each video was labeled by at least 3 of those annotators. Final annotation for each video is produced by a majority vote among the labels assigned by the annotators.
Auxiliary emotional image and text datasets. From the Flickr image dataset , we select as the auxiliary image data a subset of images of Adjective-Noun Pairs (ANPs) that have top ranks with respect to the emotions (see Table 2 in ). These images are clustered into clusters whose centers are used in Eq (2). As shown in , the large-scale text data can greatly benefit the trained language model. We train the Skip-gram model (Eq 7) on a large-scale text corpus, which includes around 7 billion words from the UMBC WebBase (3 billion words), the latest Wikipedia articles (3 billion words) and some other documents (1 billion words). The trained model has around 4 million unique words and phrases. Most of the documents are about scientific articles and professional reports which have very strict definitions, descriptions and usage of the emotion and sentiment related words.
Experimental settings. Each video is uniformly sampled at
frame increments for feature extraction to reduce the computational cost. The dimension of the real-valued semantic vectors(Eq (7)) is set to . Our AlexNet CNN model is trained by ourselves using
ImageNet classes with the Caffe toolkit, and we use the
-dimensional activations of the 7th fully-connected layer after the Rectified Linear Units as features. The number of nearest neighbors in Eq (4) is empirically set to of the image clusters, which balances the computational cost with a good representation in Eq (5).
4.2 Video Emotion Recognition
4.2.1 Supervised Recognition
We first report results on the supervised emotion recognition task, and compare our ITE encoding method with the following alternative baselines.
MaxP . The instance-level classifiers are trained using the labels inherited from their corresponding bags. These classifiers can be used to predict instance labels of testing videos. The final bag labels are produced by majority vote of instance labels. This method is a variant of the Key Instance Detection (KID)  in multi-class multi-instance setting.
AvgP . We average the frame-level image features of one video as video-level feature descriptions for classification. For the video, its average pooling feature is computed as . The average pooling is the standard approach of aggregating frame-level features into video-level descriptions as mentioned in .
Mi-FV . MIL bags of training videos are mapped into a new bag-level Fisher Vector representation. Mi-FV is able handle large-scale MIL data efficiently.
CCE . The instances of all training bags are clustered into groups, and each bag is re-represented by binary features, where the values of the feature is if the concerned bag has instances falling into the group and otherwise. This is essentially a simplified version of our ITE method encoded by training instances only.
As for the SVM predictor in Eq (6), the linear kernel is used for Mi-FV and MaxP due to the large number of samples/dimensions, and the Chi-square kernel222The RBF kernel is also evaluated but shows slightly lower performance than that of the Chi-square kernel. is used for others. A binary two-class SVM model is trained for each emotion class separately.
The experimental results are shown in Figure 4, which clearly demonstrate that our ITE method significantly outperforms the four alternatives with large margins on all three datasets. This validates the effectiveness of our method in generating a better video-level feature representation based on the auxiliary images. In particular, the improvement of ITE over CCE and Mi-FV verifies that the knowledge transferred from the auxiliary emotional image dataset is probably more critical than that existing in the training video frames. This supports our argument that most of the frames of these videos have no direct relation to the emotions expressed by the videos, and underscores the importance of knowledge transfer.
One should notice that CCE has the worst performance. CCE re-encodes the multi-instances into binary representations by ensemble clustering. Such representations may have better performance than the hand-crafted features used in , but they cannot beat the recently proposed deep features, which have been shown to be able to extract higher level information . In other words, the re-encoding process of CCE loses discriminative information gained from the deep features, and is therefore unsuited for the task.
In addition, Mi-FV and MaxP have similar performance: MaxP is slightly better on Plutchik, Ekman and Mi-FV is slightly better on YouTube. However, the results of Mi-FV and MaxP are much worse than those of AvgP. These differences can be explained by the different choices of kernels. We validate that the AvgP with linear SVM classifier has similar performance (with a variance of) as MaxP and Mi-FV. Nevertheless, due to high dimensions of Fisher Vectors and large amount of training instances in MaxP, nonlinear kernels will introduce prohibitive computational cost. Thus, in subsequent experiments, we use AvgP as the major alternative baseline to ITE since other alternatives do not demonstrate competitive advantages.
4.2.2 Zero-Shot Recognition
Experiments on zero-shot emotion recognition are conducted on two datasets: the Plutchik and the YouTube. We use anger, joy, surprise, and terror as testing classes in the Plutchik dataset ( testing instances in total). We validate our methods on both versions of YouTube dataset: YouTube-8 and YouTube-24. For YouTube-8, we use fear and sadness as the testing classes. For YouTube-24, we randomly split the 24 classes into training and testing classes with 5-round repeated experiments. In this zero-shot setting, no testing classes are seen during training.
First and foremost, we need to highlight that without the heterogeneous knowledge transferred from text, there is no way to enable zero-shot emotion learning. There is no previous work that had ever reported the successful experiments on zero-shot emotion learning, mainly due to the difficulties in clearly defining the semantic word/attribute vectors of emotion classes. In this paper, a language model trained from large-scale text data is employed to transfer the knowledge of emotion words to the video domain via the semantic word vectors. Such a transferring step enables the zero-shot emotion recognition.
We compare our T1S algorithm with Direct Attribution Prediction (DAP) [42, 41]. For DAP, at test time each dimension of the word vectors of each test sample is predicted, from which the test class labels are inferred. DAP can be taken as directly using Eq (10) without the word vector smoothing by Eq (9). Four variants are compared: (a) using different video-level feature representation (AvgP or ITE); (b) using different zero-shot learning algorithm (T1S or DAP). Figure 5 shows the results. Our ITE+T1S approach is the best performing method, outperforming the second best by 3.6, 4.8, and 1.2 absolute percentage points respectively. The results confirm the effectiveness of the knowledge transfer scheme, which improves upon the random baseline by 8.1, 6.3, and 15.9 absolute percentage points. The second best model for Plutchik and YouTube-24 is AvgP+T1S, whose performance is close to the best method. This suggests the T1S technique contributes the biggest performance gain when the training classes bear some similarity to the unseen testing classes. However, when the training classes are very different from the testing classes, the ITE encoding scheme plays an important role. Overall, the experiments show the combination of ITE+T1S is effective under different zero-shot learning conditions. Given the inherent difficulties of the zero-shot learning task, we consider the results to be very promising.
4.2.3 Key Implementation Choices
In this subsection, we discuss the settings of two important experimental options in our framework.
Number of clusters for auxiliary emotional images. We vary the number of clusters (i.e., ) of the auxiliary images in Eq (2). These experiments are conducted on the Ekman dataset. The results are plotted in Figure 6, which further validate the effectiveness of our ITE method: when varying the number of clusters from , our ITE results are gradually improved. This is due to the fact that the increased number of clusters help capture additional discriminative information.
Fine-tuning of the AlexNet CNN model. We also validate the experiments by using the auxiliary images to fine-tune the weights in the AlexNet CNN model. After the fine-tuning step, the classification results on all the three datasets have only small changes () rather than significant improvements. We postulate that the videos in our datasets have more diversified contents and relatively lower quality than the Flickr images. In other words, there are some differences between the distributions of the images and the videos, which is the reason that fine-tuning with images is not very helpful.
4.3 Video Emotion Attribution
As discussed earlier, another advantage of our encoding scheme is that we can identify the video clips333Note that in term of our pilot study, the emotions are sparsely expressed in the video. On averagely around of totally video frames are related to emotion on our three dataset. that have high impact on the overall video emotion.
As the first work on video emotion attribution, we define the evaluation protocol of user study to evaluate the performance of different algorithms for this task: Ten subjects, unaware of project goals, were invited for the user study. Given all emotion keywords of the corresponding dataset and clip computed from the video, participants are asked to guess the name of the emotion expressed in the clip. These clips are selected by different methods discussed below. Since the ground-truth video emotion labels are known, we are able to compute the fraction of participants who assigned the correct emotion label for each clip.
We randomly select videos from each of the three datasets. For each video, we compute the 2-second video clip that contains the highest attribution towards video emotion, using Eq (12).
We compare several different methods: Chance
. It is the prior probability of guessing the emotion without viewing any portion of the videoRandom sampling. We first randomly sample other clips ( seconds each) from the same video that do not overlap with the first clip. Face_present. “face_present” features  is used to rank all the videos frames. Larger and more faces are detected in those frames that are highly ranked. clips are thus generated by using the top rank frames.
The results are shown in Figure 7. Our method achieve best accuracy, and outperforms "face_present" baseline by 16–26 absolute percentage points. It is true that facial expression can convey strong emotions. Nevertheless, a large portion of videos in our dataset are user-generated videos which are more general than face videos used in traditional facial emotion recognition tasks. Those videos do not even have the human faces present. This result indicates that our method can consistently identify video clips that convey emotions recognizably similar to the emotion conveyed by the original video. That is, the identified clips contribute to the overall video emotion.
A qualitative result of emotion attribution is shown in Figure 8, where the video is uniformly sampled every frames. The bar chart shows scores of different frames, where the key frames are shown above the bars. The figure demonstrates that clips with stronger emotional contents are given higher scores of attribution, validating the effectiveness of our method.
4.4 Emotion-Oriented Video Summarization
Finally, we evaluate our framework on emotion-oriented video summarization. We compare with four baselines: (1) Uniform sampling. It uniformly samples several clips from video. (2) K-means sampling. It simply clusters the clips and selects a clip closest to each cluster centroid. (3) Story-driven summarization . This approach was developed to summarize very long egocentric videos. We slightly modify the implementation and make the length of the summary controllable for our task. (4) Real-time summarization . Wang et al.  aim at efficient summarization of videos based on semantic content recognition results. For all the methods, the length of summary is fixed to 6 seconds if the original video is longer than 1 minute. For short videos, the length is fixed at of original video duration.
, we conduct a user study to evaluate different summarization methods. Ten subjects unfamiliar with the project participated in the study. We show the summary results of all the methods (without the audio information) to each participant. Participants are asked to rate each result on a five-point scale for each of the following evaluation metrics: (1)Accuracy: the summary accurately describes the “dominating" high-level semantics of the original video; (2) Coverage: the summary covers as much visual content using as few frames as possible. (3) Quality: the overall subjective quality of the summary; (4) Emotion: the summary conveys the same main emotion as the original video.
The results are shown in Figure 9. The average score is shown in the “Overall” column. Our method (“Emotion-VS") performs better than the other methods on the accuracy and the emotion metrics. On the emotion metric, we beat the best baseline by a margin of 0.87. Although we are doing slightly worse on the coverage metric (-0.13 compared to the best baseline), the drop in quality is minimal (-0.04 compared to the best baseline). The results suggest that the selection of emotional key frames and clips does not only capture the emotion of the original video, but also improves the overall accuracy of the summary, since an accurate summary should capture emotional content. Our emotion-oriented summarization method significantly increases the amount of emotional contents captured by the summary without material loss on other quality measures.
A qualitative evaluation is shown in Figure 10. At the top, the figure shows a video of an art therapist (the woman in green). Different from other methods, our summarization not only captured the therapy procedure, but also focused on the sadness of the therapist, which is the central emotion conveyed in this video. At the bottom, we illustrate a user-generated video where a father surprises his daughter during a baseball game by dressing as the catcher and revealing himself. All baseline methods are more focused on the baseball game itself, which, however, is the least related to the emotion of this video. In contrast, our method clearly captures the revealing of the father, the surprised daughter, and the subsequent, emotional hug.
This paper provides the first study of knowledge transfer from heterogeneous sources for the task of video emotion understanding, which includes supervised and zero-shot emotion recognition, emotion attribution and emotion-oriented summarization. For effective knowledge transfer, we learn encoding schemes from a large-scale emotional image data set and a large, 7-billion-word text corpus. This transfer facilitates the creation of a representation conducive to the tasks of video emotion understanding. In zero-shot emotion recognition, an unknown emotional word is related to known emotion classes through the use of a distributed representation in order to identify emotions unseen during training. Our experiments on three challenging datasets clearly demonstrate the benefit of utilizing external knowledge. Our framework also enables novel applications such as emotion attribution and emotion-oriented video summarization. A user study shows that our summaries accurately capture emotional content consistent with the overall emotion of the original video. Future work will address the joint application of emotion-oriented summarization and story-driven summarization, which should allow us to create complete and emotionally compelling stories.
The authors would like to thank Chong-Wah Ngo for his constructive advise.
-  J. Amores. Multiple instance classification: Review, taxonomy and comparative study. Artif. Intell., 201(4):81–105, 2013.
-  S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS, pages 561–568, 2003.
-  L. F. Barrett. Are emotions natural kinds? Perspectives on Psychological Science, 1(1):28–58, 2006.
-  L. F. Barrett, K. A. Lindquist, and M. Gendron. Language as context for the perception of emotion. Trends in cognitive sciences, 11:327–332, 2007.
-  D. Borth, R. Ji, T. Chen, T. M. Breuel, and S.-F. Chang. Large-scale visual sentiment ontology and detectors using adjective noun pairs. In ACM MM, 2013.
-  N. Bosch, S. D’Mello, R. Baker, J. Ocumpaugh, V. Shute, M. Ventura, L. Wang, and W. Zhao. Automatic detection of learning-centered affective states in the wild. In the 2015 International Conference on Intelligent User Interfaces, 2015.
-  J. M. Carroll and J. A. Russell. Do facial expressions signal specific emotions? Judging emotion from the face in context. Journal of Personality and Social Psychology, 70(2):205–218, 1996.
-  T. Chen, D. Borth, Darrell, and S.-F. Chang. Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. CoRR, 2014.
-  Y. Chen, J. Bi, and J. Z. Wang. Miles: Multiple-instance learning via embedded instance selection. IEEE TPAMI, 28(1):1931–1947, 2006.
-  A. C. Cruz, B. Bhanu, and N. S. Vision and attention theory based sampling for continuous facial emotion recognition. IEEE TAC, 5(4):418–431, 2014.
-  D. DeMenthon, V. Kobla, and D. Doermann. Video summarization by curve simplification. In ACM MM, 1998.
-  A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon. Emotion recognition in the wild challenge 2013. In ACM ICMI, 2013.
-  A. Dhall and G. Roland. Group expression intensity estimation in videos via gaussian processes. In ICPR, 2012.
-  T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell., 89(96):31–7, 1997.
-  R. J. Dolan. Emotion, cognition, and behavior. Science, 298:1191–1194, 2002.
-  P. Ekman. Universals and cultural differences in facial expressions of emotion. In Nebraska Symposium on Motivation, 1972.
-  P. Ekman. An argument for basic emotions. Cognition & emotion, 6(3-4):169–200, 1992.
-  R. el Kaliouby and P. Robinson. Real-time inference of complex mental states from facial expressions and head gestures. In CVPR, 2004.
-  Y. Fu. Multi-view metric learning for multi-view video summarization. CoRR, 2014.
-  Y. Fu, Y. Guo, Y. Zhu, F. Liu, C. Song, and Z.-H. Zhou. Multi-view video summarization. IEEE TMM, 12(7):717–729, 2010.
-  Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Learning multi-modal latent attributes. IEEE TPAMI, 2013.
-  Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong. Transductive multi-view zero-shot learning. IEEE TPAMI, to appear.
-  Y. Fu, T. M. Hospedales, T. Xiang, S. Gong, and Y. Yao. Interestingness prediction by robust learning to rank. In ECCV, 2014.
-  Y. Fu, T. M. Hospedales, T. Xiang, J. Xiong, S. Gong, Y. Wang, and Y. Yao. Robust subjective visual property prediction from crowdsourced pairwise labels. IEEE TPAMI, to appear.
-  T. Gärtner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In In Proc. 19th International Conf. on Machine Learning, pages 179–186. Morgan Kaufmann, 2002.
-  X. Glorot, A. Bordes, and Y. Bengio. Domain adaptation for large-scale sentiment classification: A deep learning approach. In ICML, 2011.
-  J. J. Gross. Emotion regulation: Affective, cognitive, and social consequences. Psychophysiology, 39(3):281–291, 2002.
-  A. Habibian, T. Mensink, and C. G. M. Snoek. Videostory: A new multimedia embedding for few-example recognition and translation of events. In ACM MM, 2014.
-  A. Hanjalic and H. Zhang. An integrated scheme for automated video abstraction based on unsupervised cluster-validity analysis. IEEE TCSVT, 1999.
G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa.
Affective audio-visual words and latent topic driving model for realizing movie affective scene classification.IEEE TMM, 12(6):523–535, Oct 2010.
-  Q. Ji, P. Lan, and C. Looney. A probabilistic framework for modeling and real-time monitoring human fatigue. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 36(5):862–875, Sept 2006.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, 2014.
-  Y.-G. Jiang, B. Xu, and X. Xue. Predicting emotions in user-generated videos. In AAAI, 2014.
-  Y.-G. Jiang, J. Yang, C.-W. Ngo, and A. G. Hauptmann. Representations of keypoint-based semantic concept detection: A comprehensive study. IEEE TMM, 12(1):42–53, 2010.
-  Y.-G. Jiang, YanranWang, R. Feng, X. Xue, Y. Zheng, and H. Yang. Understanding and predicting interestingness of videos. In AAAI, 2013.
-  B. Jou, S. Bhattacharya, and S.-F. Chang. Predicting viewer perceived emotions in animated gifs. In ACM MM, 2014.
-  H.-B. Kang. Affective content detection using HMMs. In ACM MM, 2003.
-  A. Kapoor, W. Burleson, and R. W. Picard. Automatic prediction of frustration. Int. J. Hum.-Comput. Stud., 65(8):724–736, 2007.
-  D. Kotzias, M. Denil, P. Blunsom, and N. de Freitas. Deep multi-instance transfer learning. CoRR, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE TPAMI, 36(3):453–465, 2013.
-  B. Li. A dynamic and dual-process theory of humor. In The 3rd Annual Conference on Advances in Cognitive Systems, 2015.
-  K. A. Lindquist, T. D. Wager, H. Kober, E. Bliss-Moreau, and L. F. Barrett. The brain basis of emotion: a meta-analytic review. Trends in cognitive sciences, 35(3):121–143, 2012.
-  D. Liu, G. Hua, and T. Chen. A hierarchical visual model for video object summarization. IEEE TPAMI, 32(12):2178–2190, 2009.
-  G. Liu, J. Wu, and Z. Zhou. Key instance detection in multi-instance learning. In S. C. H. Hoi and W. L. Buntine, editors, ACML, 2012.
-  M. Liu, S. Shan, R. Wang, , and X. Chen. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In IEEE CVPR, 2014.
-  X. Lu, P. Suryanarayan, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang. On shape and the computability of emotions. In ACM MM, 2012.
-  Z. Lu and K. Grauman. Story-driven summarization for egocentric video. In CVPR, 2013.
Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li.
A user attention model for video summarization.In ACM MM, 2002.
-  J. Machajdik and A. Hanbury. Affective image classication using features inspired by psychology and art theory. In ACM MM, 2010.
-  S. Marsella and J. Gratch. EMA: A process model of appraisal dynamics. Journal of Cognitive Systems Research, 10(1):70–90, 2009.
-  D. McDuff, R. E. Kaliouby, J. F. Cohn, and R. Picard. Predicting ad liking and purchase intent: Large-scale analysis of facial responses to ads. IEEE TAC, 2015.
-  T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013.
-  C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang. Video summarization and scene detection by graph modeling. IEEE TCSVT, 15(2):296–305, 2005.
-  A. Ortony, G. Clore, and A. Collins. The Cognitive Structure of Emotions. Cambridge University Press, 1988.
-  A. Ortony and T. J. Turner. What’s basic about basic emotions? Psychological Review, 97:315–331, 1990.
-  M. Pantic. Machine analysis of facial behaviour: Naturalistic and dynamic behaviour. Philosophical Transactions of the Royal Society B: Biological Sciences, 364(1535):3505–3513, 2009.
-  R. Plutchik. Emotion: Theory, research, and experience. In Theories of Emotion, volume 1. Academic Press, 1980.
-  B. Schuller, M. F. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic. Avec 2011—The first international audio/visual emotion challenge. In ICACII, 2011.
-  R. G. Schuller B. Recognising interest in conversational speech-comparing bag of frames and supra-segmental features. In INTERSPEECH, 2009.
-  T. Senechal, J. Turcot, and R. E. Kaliouby. Smile or smirk? automatic detection of spontaneous asymmetric smiles to understand viewer experience. In IEEE FG, 2013.
-  B. M. Sikka K, Dhall A. Weakly supervised pain localization using multiple instance learning. In IEEE FG, 2013.
-  J. Sivic and A. Zisserman. Video google: a text retrieval approach to object matching in videos. In ICCV, pages 1470–1477, 2003.
Y. Tian, T. Kanade, and J. Cohn.
Facial expression recognition.
In S. Z. Li and A. K. Jain, editors,
Handbook of Face Recognition, pages 487–519. Springer London, 2011.
-  B. T. Truong and S. Venkatesh. Video abstraction: A systematic review and classification. ACM TOMM, 3(1):79–82, 2007.
-  M. Valstar, B. Jiang, M. Mehu, M. Pantic, and S. Klaus. The first facial expression recognition and analysis challenge. In IEEE FG, 2011.
-  H.-L. Wang and L.-F. Cheong. Affective understanding in film. IEEE TCSVT, 16(6):689–704, 2006.
-  J. Wang and J.-D. Zucker. Solving the multiple-instance problem: A lazy learning approach. In ICML, ICML ’00, pages 1119–1126, 2000.
-  M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua. Event driven web video summarization by tag localization and key-shot identification. IEEE TMM, 14(4):975–985, 2012.
-  S. Wang and Q. Ji. Video affective content analysis: a survey of state of the art methods. IEEE TAC, PP(99):1–1, 2015.
-  X. Wang, Y. Jiang, Z. Chai, Z. Gu, X. Du, and D. Wang. Real-time summarization of user-generated videos based on semantic recognition. In ACM MM, 2014.
-  X.-S. Wei, J. Wu, and Z.-H. Zhou. Scalable multi-instance learning. In ICDM, 2014.
-  C. Xu, S. Cetintas, K.-C. Lee, and L.-J. Li. Visual sentiment prediction with deep convolutional neural networks. CoRR, 2014.
-  X. Xu and E. Frank. Logistic regression and boosting for labeled bags of instances. In 8th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2004.
-  Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN video representation for event detection. CoRR, 2014.
-  Q. You, J. Luo, H. Jin, and J. Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In AAAI, 2015.
-  Z.-H. Zhou and M.-L. Zhang. Solving multi-instance problems with classifier ensemble based on constructive clustering. Knowledge and Information Systems, 11(2):155–170, 2007.
-  R. D. Zhu X. Face detection, pose estimation, and landmark localization in the wild. In CVPR, 2012.