A Multi-task Neural Approach for Emotion Attribution, Classification and Summarization

12/21/2018 ∙ by Guoyun Tu, et al. ∙ FUDAN University 24

Emotional content is a crucial ingredient in user-generated videos. However, the sparsely expressed emotions in the user-generated video cause difficulties to emotions analysis in videos. In this paper, we propose a new neural approach---Bi-stream Emotion Attribution-Classification Network (BEAC-Net) to solve three related emotion analysis tasks: emotion recognition, emotion attribution and emotion-oriented summarization, in an integrated framework. BEAC-Net has two major constituents, an attribution network and a classification network. The attribution network extracts the main emotional segment that classification should focus on in order to mitigate the sparsity problem. The classification network utilizes both the extracted segment and the original video in a bi-stream architecture. We contribute a new dataset for the emotion attribution task with human-annotated ground-truth labels for emotion segments. Experiments on two video datasets demonstrate superior performance of the proposed framework and the complementary nature of the dual classification streams.



There are no comments yet.


page 8

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The explosive growth of user-generated videos has created a great demand for computational understanding of visual media data and attracted significant research attention in the multimedia community. Computational understanding of emotions expressed by user-generated video content, enables many real-world applications. For instance, video recommendation services can benefit from matching users’ interests with the emotions of video content; understanding the emotion of the target video and that of the advertisement facilitate strategic reasoning on advertising placement [1]. Significant efforts and successes have been made on the problem of understanding video content, such as the recognition of activities [2, 3] and participants [4]. Despite the pervasiveness of emotional expressions in user-generated video, computational recognition and understanding of these emotions remains largely an open problem.

In this paper, we focus on the recognition of the overall emotion perceived by the audience. The emotion may be expressed by facial expressions, event sequences (e.g., a wedding ceremony), nonverbal language, or even just abstract shapes and colors. This differs from work that focus on one particular channel such as the human face[5, 6, 7, 8], abstract paintings [9] or music [10]. Although it is possible for the perceived emotion to differ from the intended expression of the video, like jokes falling flat, in the datasets we find such cases to be uncommon.

We identify three major challenges faced by video emotion understanding. First, the emotions are often expressed only by a small subset of frames in the video, while the other parts of the video provide background and context; the computational technique must be sensitive to sparse emotional content. Second, one video often contains several emotions, albeit one emotion may play the dominant role. Therefore, it is important to distinguish the video segments that contribute the most to the video’s overall emotion, a problem known as video emotion attribution [11]. Third, user-generated videos are highly variable in production quality, and contain diverse objects, scenes, and events. This is in contrast to commercial video such as news reports, where the production quality (e.g., illumination and camera positions) is consistent and the content restricted to certain domains.

Observing these challenges, we argue that it is crucial to extract feature representations that are sensitive to emotions and invariant under conditions irrelevant to emotions from the videos. In previous work, this is achieved by combining low-level and middle-level features [12], or by using auxiliary image sentiment dataset to encode video frames [11, 13]. The effectiveness of these features have been demonstrated on three emotion-related vision tasks, including emotion recognition, emotion attribution and emotion-oriented video summarization. However, a major drawback of previous work is the three tasks were tackled separately and cannot inform each other.

Extending our earlier work [14], in this paper, we propose a multi-task neural architecture, the Bi-stream Emotion Attribution-Classification Network (BEAC-Net), which tackles both emotion attribution and classification at the same time, thereby allowing related tasks to reinforce each other. BEAC-Net is composed of an attribution network (A-Net) and a classification network (C-Net). The attribution network learns to select a segment from the entire video that captures the main emotion. The classification network processes the segment selected by the A-Net as well as the entire video in a bi-stream architecture in order to recognize overall emotion. In this setup, both the content information and the emotional information are retained to achieve high accuracy with a small number of convolutional layers. Empirical evaluation on the Ekman-6 and the Emotion6 Video datasets demonstrate clear benefits of the joint approach and the complementary nature of the two streams.

The contributions of this work can be summarized as follows: (1) We propose BEAC-Net, an end-to-end trainable neural architecture that tackles emotion attribution and classification simultaneously with significant performance improvements. (2) We propose an efficient dynamic programming method for video summarization based on the output of A-Net. (3) To establish a good benchmark for emotion attribution, we re-annotate the Ekman-6 dataset with the most emotion-oriented segments which can be used as the ground-truth for the emotion attribution task.

Ii Background and Related Work

Fig. 1:

An overview of the BEAC-Net neural architecture. We extract features from every video frame from the “fc7” layer of the DeepSentiBank convolutional neural network model. The attribution network extracts one video segment that expresses the dominant emotion, which is fed to the emotion stream of the classification network. The whole video is downsampled and fed to the content stream of the classification network.

Ii-a Psychological Theories and Implications

Extensive research has been performed on the problem of recognizing emotions from visual information. Most work follow psychological theories that lay out a fixed number of emotion categories, such as Ekman’s six pan-cultural basic emotions [15, 16] and Plutchik’s wheel of emotion [17]. These emotions are considered to be “basic” because they are associated with prototypical and widely recognized facial expressions, verbal and non-verbal language, and have distinct appraisal, antecedent events, and physiological responses. The emotions constantly affect our expression and perception via appraisal-behavior cycles throughout our daily activities, including video production and consumption.

Recent psychological theories [18, 19, 20] suggest the range of emotions are far more varied than prescribed by previous theories due to the complex interaction between emotions and other cognitive processes and temporal succession of emotions. For example, Nummenmaa et al. [21] identified clusters of bodily sensations shared by basic and non-basic emotions. This may have inspired computational work like DeepSentiBank [22] and zero-shot emotion recognition [11], which assumed more categories than those in basic emotion theories.

Dimensional theories of emotion [23, 24, 25] characterize emotions as points in a multi-dimensional space. This direction is theoretically appealing as it allows richer emotion descriptions than the basic categories. Early work almost exclusively use the two dimensions of valence and arousal [23], whereas more recent theories have proposed three [25] or four dimensions [24]. To date, most computational approaches that adopt the dimensional view [26, 27, 28] employ valence and arousal. Notably, [29] proposes a three dimensional model for movie recommendation, where the dimensions include passionate vs. reflective, fast faced vs. slow paced and high vs. low energy.

While we recognize the validity of these recent developments, in this paper we adopt the six basic emotion categories for practical reasons, as these categories provide a time-tested scheme for classification and data annotation.

Ii-B Multimodal Emotion Recognition

Various researchers explored features for visual emotion recognition, such as features enlightened by psychology and art theory[30] and shape features [31]

. A classifier such as a support vector machine (SVM) or K-nearest neighbors (KNN) is trained to distinguish video’s emotions. Wang et al

[32] adapted a variant of SVM with various audio-visual features to divide 2040 frames of 36 Hollywood movies into 7 emotions. Jou et al. [33] focused on animated GIF files, which are similar to short video clips. Yazdani et al. [10] used KNN to classify music video clips.

Since facial expressions are important expressions of emotion, many researchers focused on recognizing emotions from facial expressions. Joho et al. [5] paid close attention to viewers’ facial signals for detection. Zhao et al. [6] extracted viewers’ facial activities frame by frame and drew an emotional curve to classify each video into different sections. Zhen et al. [7] create features by localizing facial muscular regions. Liu et al. [8] construct expressionlet, a mid-level representation for dynamic facial expression recognition.

Deep neural networks have also been used for visual sentiment analysis

[34, 35]. A massive scale of visual sentiment dataset was proposed in Sentibank [35] and DeepSentiBank [22]. Sentibank is composed of 1,533 adjective-noun pairs, such as “happy dog” and “beautiful sky”. Subsequently, the authors used deep convolutional neural networks (CNN) dealing with images of strong sentiment and achieved better performance than the former models.

The emotional content in videos can be recognized from visual features, audio features and their combination. A number of work attempted to recognize emotionsand affects from speech [36, 37, 38]. [39] jointly uses speech and facial expressions. [40] extracts mid-level audio-visual features. [41] employs the visual, auditory, and textual modalities for video retrieval. [42] provides a comprehensive technique that exploit audio, facial expressions, spatial-temporal information and mouth movements. Sparse coding[43, 44] also prove to be effective for emotion recognition. For a recent survey, we refer interested readers to [45].

Most existing work focus on emotion understanding from video focus on classification. As emotional content are sparsely expressed in user-generated video, the task of identifying emotional segments in the video [46, 13, 11] may provide assistance to the classification task. Noting the synergy between the two tasks, in this paper we propose a multi-task neural network that tackles both tasks simultaneously.

Ii-C Emotion-oriented video summarization

Video summarization has been studied for more than two decades [47] and a detailed review is beyond the scope of this paper. In broad strokes, we can categorize summarization approaches into two major approaches: keyframes extraction and video skims. A large varienty of video features have been exploited, including visual saliency[48], motion cues [49], mid-level features [50, 51], and semantic recognition [52].

Recently, we [11] introduced the task of emotion-oriented summarization which points at finishing video summarization task according to video emotion content. Inspired by the task of semantic attribution in text analysis, the task of emotion attribution [11] are defined as attributing the video’s overall emotion to its individual segments. However, [11]

still processed the emotion recognition, summarization and attribution tasks separately. Intrinsically, the emotion recognition can remarkably benefit from emotion attribution and emotion-oriented summarization; and the results of emotion attribution can provide more information for emotion-oriented summary. Thus, our framework is designed to solve these three tasks simultaneously and mutually by introducing Spatial transform networks.

In the previous work [14], we only focused on the segment of high emotional value while neglected other frames which may contain content information. However, in this paper, the emotion segment and the entire content information will be combined with different emphasis.

Ii-D Spatial-temporal Neural Networks

The proposed technique in this paper is inspired partially by the spatial transform network (ST-net) [53], which is firstly proposed for image (or feature map) classification. ST-net provides the capability for spatial transformation, which helps various tasks such as co-localization [54] and spatial attention [55]. It is fully-differentiable and it can transform an image or a feature map with little time loss as an insert framework.

ST-net could be split into three parts: 1) Localization Network: employing an arbitrary form of function to generate the the transformation parameters which could adopt any dimensions. 2) Parameterized Sampling Grid: applying the generated to build a regular grid and avert the source feature map to the target coordination. 3) Differentiable Image Sampling: providing a (sub-)differentiable sampling mechanism and allowing the loss gradients to flow back to the whole network.

So far there are various variants and improvement of ST-net. Singh et al.[56]

adapted it for end-to-end facial learning framework and proposed a loss function for solving the problem that ST-net might lead to its output patch beyond the input boundaries. Lin et al.

[57] improved upon ST-net by theoretically connecting it to the inverse compositional LK algorithm which exhibits better performance than original ST-net in various tasks.

The attribution network in BEAC-Net can be seen as performing spatial transformation on the temporal dimension. It enables the network to identify video segments that carry emotional content, which alleviates the sparsity of emotion content in video.

BEAC-Net contains a two-stream architecture that extract features not only from the video segment identified by the attribution network, but also the entire video as its context. This is different from the two-stream architecture introduced by [58], which contains a convolutional stream to process pixels of the frames and another convolutional stream for optical flow features. [59] further generalize this approach to 3D convolutions. By leveraging local motion information from optical flow, these approaches are effective at activity recognition. Optical flow features are not used in this paper, though we hypothesize they could lead to further improvements.

Iii The Emotion Attribution-Classification Network

The BEAC-Net architecture is an end-to-end multi-task network that naturally handles the emotion attribution task and the emotion classification task. In this section, we describe its two constituents: the emotion attribution network (A-Net) and the emotion classification network (C-Net). The former extracts a segment from the video that contains emotional content, whereas the latter classifies the video into an emotion by using the extracted segment together with its context. Each input video is represented by features extracted using the DeepSentiBank network

[22]. Fig. 1 provides an illustration of the network architecture.

Iii-a Feature Extraction

We extract video features using the deep convolutional network provided by [22], which classifies images into adjective noun pairs (ANPs). Each ANP is a concept consists of an adjective followed by a noun, such as “creepy house” and “dark places”. The network was trained on images for classification into ANPs. The network in [22] contains five convolutional layers and three fully connected layers. We take the 4096-dimensional activation from the second fully connected layer labeled as “fc7”.

The classification into ANPs can be considered as the joint recognition of objects and the emotions associated with the object. Indeed, [22]

reports that initializing weights trained from solely object recognition on ImageNet substantially boosts performance. As a result, we believe the features extracted by this network retain both object and affective information from the images.

Formally, let us denote the whole dataset as where denotes the video , denotes its emotion label, and denotes the supervision on emotion attribution, which is explained in the next section. The frames of are denoted as . Let be the feature extraction function. The video features are represented as .

Iii-B The Attribution Network

The emotion attribution task is to identify video frames responsible for a particular emotion in the video. The attribution network learns to select one continuous segment of frames that contains the main emotion in the original video of frames. The network predicts two parameters that are sufficient for selecting of any continuous video segment and at the same time stay in fixed ranges between -1 and 1. This formulation simplifies training and inference.

Formally, the indices of frames are in the range

. We let the indices be continuous due to the possibility of interpolation. For any given starting frame

and ending frame (), we define a function that maps to -1 and to 1 and is parameterized by , which are defined by


Obviously, and . We can define the function as:


Therefore, the attribution network only needs to predict . Two fully connected layers project the video features to . The network utilizes the following square loss function for regression, which computes the differences between the output of the network and the externally supplied supervision .


In order to provide a solution to the emotion attribution task, we only need to perform the inverse operation of Eq. 1 and recover the start time and end time from the regression output :


When selecting frames, and are rounded to the nearest integer.

Iii-C The Classification Network

The emotion classification task, as the name implies, categorizes the video as one of the emotions. We propose a novel two-stream neural architecture that employs the emotion segment selected by the attribution network in combination with the original video. This architecture allows us to focus on the dominant emotion, but also consider the context it appears in. It may also be seen as a variant of multi-resolution network, where we apply coarse resolution on the entire video and fine resolution on the emotional segment. We call the two streams the emotion stream and the content stream respectively.

The two streams have symmetrical architectures, containing one convolution layer before converging to two fully connected layers. In the content stream, the 100 frames from the original video are compressed to 20 frames, resulting in an input matrix of size

, which is identical with the input to the emotion stream. As we use the DeepSentiBank model to extract features for individual frames, the convolution layer can be seen as learning feature over the temporal dimension. The two streams share the same parameters for the convolutional layer, which accelerates training remarkably. Meanwhile, the interaction of both streams is the critical contributor to achieving high accuracy. The final output of the network comes from a softmax layer.

The classification network adopts the standard cross-entropy loss. For a -class classification, the loss function is written as


where is a ground-truth one-hot vector and is the output of the softmax classification.

Iii-D Joint Training

In order to stabilize optimization, for every data point , we introduce an indicator function , where indicates the temporal intersection over union (tIoU) between the A-Net’s prediction and ground truth . is a predefined threshold. That is, the indicator function returns 1 if and only if the attribution network is sufficiently accurate for .

We combine the standard cross-entropy classification loss and the attribution regression loss to create the final loss function as


In plain words, although A-net and C-net are trained jointly, gradients from the classification loss are backpropagated only when the attribution network is accurate enough. Otherwise, only the gradients from the attribution loss are backpropagated. The parameters from C-net remain same. For each data point, the network focuses on training either the A-Net or the C-Net, but not both. We find this to stabilize training and improve performance.

Iii-E Emotion-Oriented Summarization

In this section, based on the output of the emotion attribution network, we formulate the emotion-oriented summarization problem as a constrained optimization. The summarization aims to maintain continuity between selected video frames while select as few frames as possible and focus on the emotional content. This problem can be efficiently solved by the MINMAX dynamic programming method [60].

The emotion-oriented summarization problem can be formally stated as follows. From a video containing frames , we want to select a subset of frames . Here is not a predefined constant but determined by the optimization. We want to optimize the sum of individual frame’s cost


subject to the following constraints.

  • . Always select the first and the last frames.

  • . The frames are not too spaced out. The constant is the maximum index difference for adjacent summary frames.

  • , where is the Euclidean distance. In words, there is no large feature-space discontinuity () between and in the video.

In other words, we minimize the total cost by selecting fewer frames, but we must also make sure removing a frame does not create a large gap in feature space. This avoids discontinuity that disrupts the viewing experience.

Based on the emotional segment identified by the A-Net, we encourage the inclusion of emotional frames in the summary by setting

We present a solution to the problem using the MINMAX dynamic programming technique [60]. We measure the discontinuity between adjacent frames in the video summary between the selected frames and as

where denotes the Euclidean distance between the features and . The rate of this sequence segment is,

This requires the discontinuity in every segment to be smaller than the maximum allowable amount . If the sequence segment has admissible discontinuity, the cost of the segment is represented by the cost of the summary frame.

Using the dynamic programming technique, we define the quantity as the minimum cost where is the number of frame selected so far and is the next frame to select. The recurrence equation is given by


for all . To obtain the optimal solution, we find the minimum value because we must include the last frame of the original value. The whole sequence is found by tracing through the intermediate minima back to the first frame. It is easy to see that the time complexity of the algorithm is , where is the maximum number of frames that the video summary can have.

Iv Experiments

Iv-a Dataset and Preprocessing

We conduct experiments on two video emotion datasets. Since our paper mostly address the problem of emotion expressed by videos and one of our task is emotion classification. The widen-researched Ekman’s emotion theory which claims that there are six prototypical emotion expressions( anger, surprise, fear, joy, disgust and sadness) is the most suitable classification standard. Therefore, we choose the datasets based on Ekman’s six basic emotions.

The Emotion6 Video dataset. The Emotion6 dataset [61]

contains 1980 images that are labeled with a distribution over 6 basic emotions (anger, surprise, fear, joy, disgust and sadness) and a neutral category. The images do not contain facial expressions or text directly associated with emotions. We consider the emotion category with the highest probability as the dominant emotion.

For the purpose of video understanding, we create Emotion6 Video, a synthetic dataset of emotional videos using images from Emotion6. We collected an auxiliary set of neutral images from the first few seconds and the last few seconds of YouTube videos as these frames are unlikely to contain emotions. After the frames are collected, we manually examine these frames and select a subset that contain no emotions.

In order to create a video with a particular dominant emotion, we select images from Emotion6 that have the dominant emotion or from the neutral set. This allows us to create ground-truth emotion labels and duration annotations for the emotional segment. We created 600 videos for each class for a total of 3,600 videos.

The Ekman-6 dataset.. The Ekman-6 dataset [11] contains 6 basic types of emotions: anger, surprise, fear, joy, disgust and sadness. The total number of videos is 1637, with 282 videos in the largest category “Surprise” and 177 videos in the smallest category “disgust”. In this paper, we use 1496 videos whose sizes are greater than 45MiB. To further assess the tasks of attribution and video-oriented summarization, the dataset is annotated with the most emotion-based segment. For every video, three annotators selected no more than 3 key segments which has the most contribution to the overall emotion of the video. The longest overlap between two annotators was considered to be the ground truth. We will release these labels upon the acceptance.

Preprocessing. We use the same split for the two datasets, with 70% of the data used as the training set, 15% as validation, and 15% for testing. As a preprocessing step, we uniformly sample frames for each video in the Emotion6 Video dataset. Due to the fact that videos in the Ekman-6 dataset are slightly longer, we uniformly sample frames from each video. Black frames are added if the video contains less than frames, notice less than 1% of the videos comprise less than 100 frames. We also create two variations for the Ekman-6 dataset. The two-class condition focuses on the two largest emotion categories, anger and surprise. The second condition employs all videos in the dataset.

Iv-B Hyperparameter Settings

We set the initial value of to

. The models are trained for 200 epochs. Dropout is employed here on all fully-connected layers and the keep ratio is set as

. A softmax layer is employed before the crossentropy loss. The network is optimized using Adam[62]

. For each dataset, the model is repeatedly trained for 5 times in order to reduce the variance; and the averaged performance is reported. The source code is available for download at


The convolutional layer in the classification network has 8 convolutional filters with the size of . Intuitively, the features extracted by DeepSentiBank can be identified as a generic extractor for emotional related features. The convolution layers in the C-net are utilized to understand the contextual information surrounding the emotional segments. The two fully connected layers has units each. The threshold in the loss function is set to 0.6.

Dataset SVM ITE C-Stream Unsup. E-Stream E-stream C+UnsupE Temporal Attention BEAC-Net
Emotion6 Video
Ekman-6 (two classes)
Ekman-6 (all classes)
TABLE I: Emotion recognition results.

Iv-C Competing Baselines

Our model is compared against the following baseline models.

Image Transfer Encoding (ITE). Our model is compared against the state-of-the-art method – Image Transfer Encoding (ITE) [11], which using an emotion centric dictionary extracted from auxiliary images to encode videos. The encoding scheme has been shown to have a good performance in emotion recognition, attribution and emotion-oriented summarization.

We replicated the same setting of ITE as described in [11]

: we cluster a set of auxiliary images into 2000 clusters using K-means on features extracted from AlexNet

[63]. For each frame, we select clusters whose center are the closest to the frame and the video feature vector is computed as the sum of the individual frames’ similarity to the clusters. Formally, let denote the cluster centers. The representation for the video is a 2000-dimensional vector , which is computed as summation over all frames:


where the indicator function equals 1 if and only if the cluster center is among the nearest clusters of A linear SVM model is trained to solve the emotion recognition task. The attribution can be solved by selecting a sequence of frames whose similarities to video-level representation are greater than a set threshold while enduring no more than 10 frames out of threshold. And the frames with maximal similarities are adopted as the summarization result.

Support Vector Machine (SVM). The features are used to train a linear SVM classifier on each frame of the video. The final classification labels are extracted as the majority vote. The attribution results are obtained by selecting the longest segment classified as the same emotion and the summarization results are obtained by selecting the frames with the highest emotion scores.

The Content Stream Only (C-Stream). For the task of video emotion classification, we perform an ablation study by removing the attribution network and the associated emotion stream from the classification network. What remains is a single-stream, conventional convolutional neural network. We report the result for emotion classification only, as this network is not capable of emotion attribution.

Supervised Emotion Stream (E-Stream). As a second ablated network, we remove the content stream from the classification network. The attribution network and the associated emotion stream are kept intact. The attribution loss is also kept as part of the loss function.

Unsupervised Emotion Stream (Unsup. E-Stream). This is a third ablated network. Similar to the E-Stream version, we remove the content stream from the classification network. In addition, we also remove the attribution loss from the loss function. The A-Net and the emotion stream are kept intact. That is, we use only the emotion stream for classification, but do not supply supervision to the attribution network.

C-Stream and Unsupervised E-Stream (C+UnsupE). This is a fourth ablated network. We use both the C-stream with E-Stream, but remove the attribution loss. This is equivalent to the full BEAC-Net sans the supervision signal for emotion attribution.

Temporal Attention. Due to the popularity of the attention mechanism [64]

in neural networks, we create a baseline using a typical attention formulation over the temporal dimension. We modify the A-Net by adding two fully connected layers with 128 hidden units and the ReLU activation function, followed by a softmax operation. The output is an attention weight

for every frame in the video, such that and . Recall that the features of frame in the video are denoted as . The final representation for the entire video is computed as the convex combination and fed to the E-stream. The E-stream and C-Stream remain unchanged from the full BEAC-Net.

Iv-D Results and Discussion

Emotion recognition. We perform the emotion recognition on the Emotion6 Video dataset and the Ekman-6 dataset, where Ekman-6 has two experimental conditions. Table 1 reports the classification accuracy. The first column of Table 1 reports chance-level performance.

We observe that BEAC-Net achieves the best performance among all baseline models, including all ablated versions. Compared to the previous state-of-the-art method ITE, BEAC-Net improves classification performance by 22.2%, 6.3% and 6.1%, respectively.

The three experimental conditions establish an easy-to-hard spectrum. The artificial Emotion6 Video dataset is the simplest, for which a simple SVM can achieve 80% accuracy. The full Ekman-6 with all 6 emotions is the most difficult. It is worth noting that the effectiveness of the bi-stream architecture is the most obvious on the most difficult full Ekman-6 dataset, leaving a 2.2% gap between BEAC-Net and the second best technique. E-Stream is almost the same as BEAC-Net on the simplest conditions, but the gap widens as the task gets more difficult.

The ablation study reveals the complementarity of all constituents of BEAC-Net. The C-Stream convolutional network underperforms BEAC-Net by 18.4%, 12.1%, and 2.2%. The E-Stream with attribution supervision underperforms by 0.2%, 1.2%, and 4.4%. Interestingly, the E-Stream beats the C-Stream on the Emotion6 and two-class Ekman-6, but underperforms on the full Ekman-6 dataset. This results indicate that the two streams indeed complement each other under different conditions and their co-existence is crucial for accurate emotion recognition. The comparison between the unsupervised E-stream and E-stream as well as that between C+UnsupE and BEAC-Net shows the benefits of the attribution supervisory signal. On average, the improvements on the three conditions are 14.1%, 1.7%, and 2.7%.

The comparison between Temporal Attention and C+UnsupE is particularly interesting due to their similarity. The only differences lie in the following. First, C+UnsupE uses hard cutoffs whereas the temporal attention baseline assigns a non-zero weight to every frame. Second, the A-Net selects a continuous video chunk, whereas the temporal attention may pay attention to arbitrary frames. Therefore, this comparison can help us understand if the proposed A-Net is better than the classical attention.

The results confirm the superiority of A-Net over temporal attention. Temporal attention performs better on the synthetic dataset, Emotion6, by 2.1%. However, C+UnsupE performs substantially better on the other two experimental conditions by margins of 9.3% and 8.2%, respectively. Since Ekman-6 is a natural dataset, we consider the performances on Ekman-6 to be more realistic and more representative. This result indicates that excluding many frames in the video is beneficial, corroborating our claim that A-Net’s hard cutoff is effective in the handling of sparse emotional data.

In order to gain a better understanding of the datasets, we perform an additional experiment on transfer learning using Emotion6 and the full Ekman-6 dataset. Since the two datasets have the same emotion categories, we can train BEAC-Net on one dataset and test it on the other. The results, as shown in Table

II, are below chance, suggesting the trained network weights are not transferable. We attribute this to the domain differences between the synthetic Emotion6 Video dataset and the natural Ekman-6 dataset.

Dataset Chance Accuracy
Emotion6 Video Ekman-6
Ekman-6 Emotion6 Video
TABLE II: Transfer learning results.
Fig. 2: Emotion attribution results. We report the mAP scores for each dataset. The horizontal axis indicate different tIoU thresholds
Fig. 3: Attribution Result with same range of thresholds. Here we just report the result of BEAC-net.
Fig. 4: The qualitative results of emotion-oriented video summarization. “Average” scores are the average of the other four scores.

Emotion Attribution. We report the results on emotion attribution. Here the comparison baselines include ITE, SVM and Unsupervised E-Stream. For SVM, the longest majority-voted video emotion segment is considered to be the extracted emotional segment.

We use mean average precision (mAP) to evaluate the performance of emotion attribution[65]. To calculate the mAP, we have to determine whether one video detection is true positive or not. Specifically, the overlap between the predicted video segment and the ground-truth segment are firstly calculated. This overlap is quantified by the temporal intersection over union () score of predicted and ground-truth segment. The prediction will be marked as positive if the overlap is greater than a threshold. The three experimental conditions, Emotion6 Video, two-class Ekman-6, and all-class Ekman-6 are the same as previously. Fig. 2 shows the results, where the horizontal axis indicate different tIoU thresholds.

Once again, we observe strong performance from BEAC-Net, which achieves the best performance in almost all conditions. This validates that the A-Net can help identify the video segments that contribute the most to the overall emotion of one video. The unsupervised E-Stream performs worse than BEAC-Stream, but remains a close second in the Ekman-6 experiments.

On the Emotion6 video dataset, BEAC-Net outperforms other methods except for the last two tIoU thresholds, where the SVM method has very good and stable performance. We hypothesize that this is because every frame in Emotion6 video dataset which is created from an image dataset has a definite label, while fundamentally the SVM method is based on classifying individual images. On the Ekman-6 dataset, the supervisions have been labeled for video segments instead of individual frames. Thus, not every frame in the emotional segment necessarily express the emotion. This is likely a reason why the frame-based SVM performs poorly in those conditions.

On the two-class Ekman-6 condition, BEAC-Net comfortably beats the rest, except for the very first tIoU setting. On the full Ekman-6 condition, BEAC-Net still outperforms the baselines, but the performance gap is smaller. This is consistent with our observation that the full Ekman-6 is the most difficult dataset.

We also observe that this is still a remarkable disparity among the performance of our network across the two dataset with varying complexity. Fig. 3 demonstrates the result. We observe that once the mAPs drops below 0.9, it would become extremely sensitive to tIou threshold according all three curves. It is also a suitable range to compare the performance of different approaches. Thus for datasets with varying complexity, it is necessary to adjust tIou threshold to compare the performance. Therefore, it is hard to set a constant standard, which is tIou threshold here, fitting different suituations.

Error Analysis.

To further understand the proposed technique, we quantitatively analyze the relation between the presence of human faces and the emotion classification accuracy of BEAC-Net. First, we detect the presence of faces in the test set of the full Ekman6 dataset, which consists of 218 videos, using a highly accurate face detection algorithm.

111https://github.com/ageitgey/face_recognition. The algorithm achieves 99.28% accuracy on the Faces in the Wild dataset [66]. Next, we compute the proportion of frames that contain faces for each video. The results are shown in Table III. BEAC-Net performs better when a higher percentage of frames do not contain faces.

The results suggest that BEAC-Net does not focus solely on facial expressions. This is consistent with the task being investigated, which is about detecting the overall emotion of videos, rather than only faces or people. However, we do recognize that faces are a major source of emotional expressions; about 39% of test videos contain 40% or more frames with at least one human face. Adding a dedicated module for facial emotion recognition could further improve performance and would be a future research direction.

Frames without Faces Data Proportion Classification Accuracy
TABLE III: Classification accuracy with different proportions of frames without human face in the Ekman-6 Dataset.
Fig. 5: The demonstration of some typical failure cases. Here we manually extract 4 key frames from original videos and from the attribution result of BEAC-Net.

We also manually examined a few failure cases. Fig. 5 demonstrates the key frames in three video where BEAC-Net failed on both the emotion attribution task and the classification task. The first case is labeled as disgust but the ground truth label is joy. Upon close inspection, the video shows an actress fooling people by pretending to vomit. People around her soon realize the joke and begin to laugh. This video is difficult because it involves humor, whose recognition likely involves a rapid succession of several emotional and cognitive processes [19]. In general, we believe emotional dynamics will pose a difficult challenge for computational recognition because of limited theoretical understanding of this topic as well as the lack of relevant datasets. In the second case, the emotion is mainly expressed by a sequence of severe car crashes, a semantic inference that BEAC-Net was unable to make. In the third case, BEAC-Net did not recognize the strong emotions exhibited by the facial expressions.

Fig. 6: The qualitative results of emotion-oriented video summarization. We compare several baselines.

Emotion-oriented Summarization. To quantitatively evaluate the video summaries, we carry out a user study where ten human participants viewed and rated summaries of twelve videos. To test the methods under different conditions, we compare video summaries containing 3 and 6 frames, respectively. The 12 videos were randomly assigned to the 3-frame and the 6-frame conditions. As the video summarization task is different from the two other tasks, we differed from the previous experiments and created 3 baseline techniques below.

Uniform: uniformly sample the frames/clips from the videos;

SVM: the video is summarized by the scores of SVM prediction. The frames with top-6 or top-3 scores of each label are selected in practice.

ITE: we use the summarization method based on ITE, as described in [11].

We recruited ten volunteers for the user study and they were kept blind to the summarization techniques. We showed summary videos from all methods to the participants and asked them to rated the quality of summary on five-point Likert scale on the following four criteria [11]:

Accuracy: whether the summary accurately describes the main content of original video?

Coverage: how much visual content has been covered in the summary?

Quality: how is the overall subjective quality of summary?

Emotion: how many percentage of the same emotion has been conveyed from original video?

Fig. 4 shows the results from the user study, where the average column reports the average rating across four questions. On four out of the five measures (including the average), our method outperforms all baseline methods. The largest performance gap of 0.83 appears on the emotion criterion, suggesting our summaries covers emotional content substantially better than other methods. On the quality criterion, we perform slightly worse than the uniform method, but the gap is a mere 0.07.

The qualitative results, shown in Figure 6, exemplifies advantages of our comprehensive summarization method: (1) Coverage of sparsely positioned emotional expressions. Figure 6 b) contains an illuminating example. The original video contains an interview of two young lovers recalling their love stories. The vast majority of the video contains an interview with the couple sitting on a couch, as shown in the uniform row; emotional expressions are sparse and widely dispersed. Our model accurately chose multiple memory flashbacks as the summary while other methods give priority to the interview shots. (2) Capturing the main emotion segment. Benefiting from the results of the attribution framework, our summarization method focuses on clips that contain the main emotion of the video. Figure 6 a) shows a video with mainly angry content, and the video summary created by our method shows the fighting scenes. Figure 6 c) shows a video with sadness; our summary not only captures the crying but also the cause of sadness, a photo of a murdered child.

V Conclusions

Computational understanding of emotions in user-generated video content is a challenging task because of the sparsity of emotional content, the presence of multiple emotions, and the variable quality of user-generated video. We suggest that the ability to locate emotional content plays an important role in accurate video understanding.

Toward this end, we present a multi-task neural network with a novel bi-stream architecture, which we call Bi-stream Emotion Attribution-Classification Network (BEAC-Net), which is end-to-end trainable and can solve the emotion attribution and recognition simultaneously. The attribution network can locate the emotional content, which is processed in parallel with the original video in the two streams. Empirical evidence shows that the bi-stream architecture to provide significant benefits for emotion recognition and that the proposed emotion attribution network outperforms traditional temporal attention. The results corroborates our hypothesis that the proposed technique improves the handling of sparse emotional content. In addition, we propose a video summarization technique based on the attribution provided by BEAC-Net. The technique outperformed existing baselines in a user study.

Emotions play an important role in the human cognitive system and day-to-day activities. An accurate understanding of human emotions could enable many interesting applications such as story generation based on visual information

[67]. We believe this work represents a significant step in improving understanding emotional content in video.


  • [1] H. K. Karthik Yadati and M. Kankanhalli, “CAVVA: Computational affective video-in-video advertising,” IEEE Transactions on Multimedia, vol. 16, no. 1, 2014.
  • [2] N. Ikizler-Cinbis and S. Sclaroff, “Web-based classifiers for human action recognition,” IEEE Transactions on Multimedia, vol. 14, pp. 1031–1045, Aug 2012.
  • [3] W. Xu, Z. Miao, X. P. Zhang, and Y. Tian, “A hierarchical spatio-temporal model for human activity recognition,” IEEE Transactions on Multimedia, vol. 19, pp. 1494–1509, July 2017.
  • [4] K. Somandepalli, N. Kumar, T. Guha, and S. S. Narayanan, “Unsupervised discovery of character dictionaries in animation movies,” IEEE Transactions on Multimedia, vol. PP, no. 99, pp. 1–1, 2017.
  • [5] H. Joho, J. M. Jose, R. Valenti, and N. Sebe, “Exploiting facial expressions for affective video summarisation,” in Proc. ACM conference on Image and Video Retrieval, 2009.
  • [6] S. Zhao, H. Yao, X. Sun, P. Xu, X. Liu, and R. Ji, “Video indexing and recommendation based on affective analysis of viewers,” in ACM MM, 201.
  • [7] Q. Zhen, D. Huang, Y. Wang, and L. Chen, “Muscular movement model-based automatic 3d/4d facial expression recognition,” IEEE Transactions on Multimedia, vol. 18, pp. 1438–1450, July 2016.
  • [8] M. Liu, S. Shan, R. Wang, and X. Chen, “Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition,” in CVPR, 2014.
  • [9] X. Alameda-Pineda, E. Ricci, Y. Yan, and N. Sebe, “Recognizing emotions from abstract paintings using non-linear matrix completion,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 5240–5248, 2016.
  • [10] A. Yazdani, K. Kappeler, and T. Ebrahimi, “Affective content analysis of music video clips,” in Proc. 1st ACM workshop Music information retrieval with user-centered and multimodal strategies, 2011.
  • [11] B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization,” IEEE TAC, 2017.
  • [12] Y. Jiang, B. Xu, and X. Xue, “Predicting emotions in user-generated videos,” in AAAI, 2014.
  • [13]

    B. Xu, Y. Fu, Y.-G. Jiang, B. Li, and L. Sigal, “Video emotion recognition with transferred deep feature encodings,” in

    ICMR, 2016.
  • [14] J. Gao, Y. Fu, Y.-G. Jiang, and X. Xue, “Frame-transformer emotion classification network,” in ACM ICMR, 2017.
  • [15] P. Ekman, “Universals and cultural differences in facial expressions of emotion,” Nebrasak Symposium on Motivation, vol. 19, pp. 207–284, 1972.
  • [16] P. Ekman, “Basic emotions,” in Handbook of Cognition and Emotion, 1999.
  • [17] R. Plutchik and H. Kellerman, Emotion: Theory, research and experience. Vol. 1, Theories of emotion. Academic Press, 1980.
  • [18] L. F. Barrett, “Are emotions natural kinds?,” Perspectives on Psychological Science, vol. 1, no. 1, pp. 28–58, 2006.
  • [19] B. Li, “A dynamic and dual-process theory of humor,” in The 3rd Annual Conference on Advances in Cognitive Systems, 2015.
  • [20] J. J. Gross, “Emotion regulation: Affective, cognitive, and social consequences,” Psychophysiology, vol. 39, no. 3, p. 281–291, 2002.
  • [21] L. Nummenmaa, E. Glerean, R. Hari, and J. K. Hietanen, “Bodily maps of emotions,” Proceedings of the National Academy of Sciences of the United States of America, vol. 111, no. 2, pp. 646–651, 2013.
  • [22] T. Chen, D. Borth, Darrell, and S.-F. Chang, “DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks,” CoRR, 2014.
  • [23] A. Russell, James, “A circumplex model of affect,” Journal of Personality and Social Psychology, vol. 39, no. 6, pp. 1161–1178, 1980.
  • [24] J. R. Fontaine, K. R. Scherer, E. B. Roesch, and P. C. Ellsworth, “The world of emotions is not two-dimensional,” Psychological Science, vol. 18, no. 12, pp. 1050–1057, 2007.
  • [25] H. Lövheim, “A new three-dimensional model for emotions and monoamine neurotransmitters,” Medical Hypotheses, vol. 78, no. 2, pp. 341–348, 2012.
  • [26] S. Chen, Q. Jin, J. Zhao, and S. Wang, “Multimodal multi-task learning for dimensional and continuous emotion recognition,” pp. 19–26, 2017.
  • [27]

    J. Huang, Y. Li, J. Tao, Z. Lian, Z. Wen, M. Yang, and J. Yi, “Continuous multimodal emotion prediction based on long short term memory recurrent neural network,” pp. 11–18, 2017.

  • [28] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris-accede: A video database for affective content analysis,” IEEE Transactions on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015.
  • [29] S. Benini, L. Canini, and R. Leonardi, “A connotative space for supporting movie affective recommendation,” IEEE Transactions on Multimedia, vol. 13, no. 6, pp. 1356–1370, 2011.
  • [30] J. Machajdik and A. Hanbury, “Affective image classication using features inspired by psychology and art theory,” in ACM MM, 2010.
  • [31] X. Lu, P. Suryanarayan, R. B. Adams, J. Li, M. G. Newman, and J. Z. Wang, “On shape and the computability of emotions,” in ACM MM, 2012.
  • [32] H.-L. Wang and L.-F. Cheong, “Affective understanding in film,” IEEE TCSVT, 2006.
  • [33] B. Jou, S. Bhattacharya, and S.-F. Chang, “Predicting viewer perceived emotions in animated gifs,” in ACM MM, 2014.
  • [34] Q. You, J. Luo, H. Jin, and J. Yang, “Robust image sentiment analysis using progressively trained and domain transferred deep networks,” in AAAI, 2015.
  • [35] D. Borth, R. Ji, T. Chen, T. M. Breuel, and S.-F. Chang., “Large-scale visual sentiment ontology and detectors using adjective noun pairs,” in ACM MM, 2013.
  • [36]

    B. Schuller, G. Rigoll, and M. Lang, “Hidden markov model-based speech emotion recognition,” in

    Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. II–1, IEEE, 2003.
  • [37] Q. Mao, M. Dong, Z. Huang, and Y. Zhan, “Learning salient features for speech emotion recognition using convolutional neural networks,” IEEE Transactions on Multimedia, vol. 16, no. 8, pp. 2203–2213, 2014.
  • [38] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching,” IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576–1590, 2018.
  • [39] Z. Zeng, J. Tu, M. Liu, T. S. Huang, B. Pianfetti, D. Roth, and S. Levinson, “Audio-visual affect recognition,” IEEE Transactions on multimedia, vol. 9, no. 2, pp. 424–428, 2007.
  • [40] E. Acar, F. Hopfgartner, and S. Albayrak, “A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material,” Multimedia Tools and Applications, pp. 1–29, 2016.
  • [41] L. Pang, S. Zhu, and C.-W. Ngo, “Deep multimodal learning for affective analysis and retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 11, 2015.
  • [42] S. E. Kahou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre, R. Memisevic, P. Vincent, A. Courville, Y. Bengio, R. C. Ferrari, et al., “Combining modality specific deep neural networks for emotion recognition in video,” in Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 543–550, ACM, 2013.
  • [43] W. Hu, X. Ding, B. Li, J. Wang, Y. Gao, F. Wang, and S. Maybank, “Multi-perspective cost-sensitive context-aware multi-instance sparse coding and its application to sensitive video recognition,” IEEE Transactions on Multimedia, vol. 18, no. 1, 2016.
  • [44] Y. Song, L.-P. Morency, and R. Davis, “Learning a sparse codebook of facial and body microexpressions for emotion recognition,” in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013.
  • [45] S. Wang and Q. Ji, “Video affective content analysis: a survey of state of the art methods,” IEEE TAC, vol. PP, no. 99, pp. 1–1, 2015.
  • [46] S. Arifin and P. Y. K. Cheung, “Affective level video segmentation by utilizing the pleasure-arousal-dominance information,” IEEE Transactions on Multimedia, vol. 10, no. 7, 2008.
  • [47] B. T. Truong and S. Venkatesh, “Video abstraction: A systematic review and classification,” ACM TOMM, vol. 3, no. 1, pp. 79–82, 2007.
  • [48]

    Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li, “A user attention model for video summarization,” in

    ACM MM, 2002.
  • [49] J.-L. Lai and Y. Yi, “Key frame extraction based on visual attention model,” Journal of Visual Communication and Image Representation, vol. 23, no. 1, pp. 114–125, 2012.
  • [50] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, and T.-S. Chua, “Event driven web video summarization by tag localization and key-shot identification,” IEEE Transactions on Multimedia, vol. 14, no. 4, pp. 975–985, 2012.
  • [51] F. Wang and C. W. Ngo, “Summarizing rushes videos by motion, object, and event understanding,” IEEE Transactions on Multimedia, vol. 14, pp. 76–87, Feb 2012.
  • [52] X. Wang, Y. Jiang, Z. Chai, Z. Gu, X. Du, and D. Wang, “Real-time summarization of user-generated videos based on semantic recognition,” in ACM MM, 2014.
  • [53] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, pp. 2017–2025, 2015.
  • [54] K. K. Singh and Y. J. Lee, “End-to-end localization and ranking for relative attributes,” in European Conference on Computer Vision, pp. 753–769, Springer, 2016.
  • [55] X. Kelvin, L. B. Jimmy, K. Ryan, C. Kyunghyun, C. Aaron, S. Ruslan, R. S. Zemel, and B. Yoshua, “Show, attend, tell: Neural image caption generation with visual attention,” ICML, vol. 37, pp. 2048–2057, 2015.
  • [56] K. K. Singh and Y. J. Lee, “End-to-end localization and ranking for relative attributes,” in ECCV, 2016.
  • [57] C.-H. Lin and S. Lucey, “Inverse compositional spatial transformer networks,” in CVPR, 2017.
  • [58] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014.
  • [59] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, pp. 4724–4733, IEEE, 2017.
  • [60] Z. Li, G. M. Schuster, and A. K. Katsaggelos, “Minmax optimal video summarization,” IEEE TCSVT, 2005.
  • [61] K.-C. Peng, T. Chen, A. Sadovnik, and A. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” in CVPR, 2015.
  • [62] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [63] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [64] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.
  • [65] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970, 2015.
  • [66] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Tech. Rep. 07-49, University of Massachusetts, Amherst, October 2007.
  • [67] R. Cardona-Rivera and B. Li, “Plotshot: Generating discourse-constrained stories around photos,” in

    Proceedings of the 12th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

    , 2016.