Visual-Texual Emotion Analysis with Deep Coupled Video and Danmu Neural Networks

11/19/2018 ∙ by Chenchen Li, et al. ∙ Huazhong University of Science u0026 Technology Shanghai Jiao Tong University Peking University 0

User emotion analysis toward videos is to automatically recognize the general emotional status of viewers from the multimedia content embedded in the online video stream. Existing works fall in two categories: 1) visual-based methods, which focus on visual content and extract a specific set of features of videos. However, it is generally hard to learn a mapping function from low-level video pixels to high-level emotion space due to great intra-class variance. 2) textual-based methods, which focus on the investigation of user-generated comments associated with videos. The learned word representations by traditional linguistic approaches typically lack emotion information and the global comments usually reflect viewers' high-level understandings rather than instantaneous emotions. To address these limitations, in this paper, we propose to jointly utilize video content and user-generated texts simultaneously for emotion analysis. In particular, we introduce exploiting a new type of user-generated texts, i.e., "danmu", which are real-time comments floating on the video and contain rich information to convey viewers' emotional opinions. To enhance the emotion discriminativeness of words in textual feature extraction, we propose Emotional Word Embedding (EWE) to learn text representations by jointly considering their semantics and emotions. Afterwards, we propose a novel visual-textual emotion analysis model with Deep Coupled Video and Danmu Neural networks (DCVDN), in which visual and textual features are synchronously extracted and fused to form a comprehensive representation by deep-canonically-correlated-autoencoder-based multi-view learning. Through extensive experiments on a self-crawled real-world video-danmu dataset, we prove that DCVDN significantly outperforms the state-of-the-art baselines.



There are no comments yet.


page 4

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In some online video platforms, such as Bilibili111 and Youku222, overlaying moving subtitles on video playback streams have become a featured function on the websites, through which users can share feelings and express attitudes towards the content of videos when they are watching. Given an online video clip as well as its associated textual comments, visual-textual emotion analysis is to automatically recognize the general emotional status of viewers towards the video with the help of visual information and embedded comments. A precise visual-textual emotion analytical method will promote in-depth understanding on viewers’ experience, and benefit a broad range of applications such as opinion mining [38], affective computing [13], and trailer production [16].

Existing methods for emotion analysis of online videos can be divided into two categories according to the types of input data. The first class of methods is visual-based, i.e., they take the visual content in videos as input, and perform emotion analysis based on the visual information. Typically, in visual-based methods, a specific set of low-level features are extracted from video frames to reveal its underlying emotion [24, 4, 35, 5]. However, visual-based methods exhibit the following limitations: 1) It is generally hard to learn a mapping function solely from low-level video/image pixels to high-level emotion space due to the great intra-class variance [35, 31]. 2) It is only feasible to directly apply visual-based methods to images and short videos, as the features of video would increase explosively with its length. Otherwise, visual features need to be periodically sampled. 3) In visual-based methods, the well-selected visual features are more relevant to the emotion of the video content than the emotion of viewers, which could inevitably dampen their performance in user emotion analysis scenarios.

As opposed to visual-based methods, the second class of methods is textual-based, which utilize user-generated textual comments as input, and extract linguistic or semantic information as features for emotion analysis [20, 37]

. Based on their methodologies, existing textual-based methods can further be classified into lexicon-based methods and embedding-based methods. Traditional lexicon-based approaches lack considering the syntactic and semantic information of words, hence unable to achieve satisfactory performance in practice. Recently, word2vec


, as a typical example of embedding-based methods, provides an effective way of modeling semantic context of words. However, word2vec can only extract semantic proximity of words from texts, while the contextual emotional information is ignored. As a result, words with different emotions, such as happy and anger, are mapped to close vectors

[28]. Moreover, it is worth noticing that, most of textual-based methods are based on the global comments for videos (comments that are attached to the videos below), which, unfortunately, can only reflect viewers’ high-level understandings on the content rather than their emotion development towards the video.

To address aforementioned limitations, in this paper, we consider analyzing viewer’s emotion towards online videos by utilizing a new types of textual data, known as “danmu”. Unlike the traditional global comments gathered in a comment section below the videos, danmu is the real-time comments floating on the video in the snapshot, moving along with video playback. Viewers can watch the video while sending comments and reading other viewers’ comments simultaneously. An example of danmu screenshot is illustrated in Figure 1

. Generally, as viewers can express their emotion without any delay, danmus are real-time commentary subtitles and play an important role in conveying emotional opinion from the commentator to other viewers. Compared with global comments, danmus have two distinguished characteristics: 1) Danmus are highly correlated with the specific moments in video clips. 2) Danmus are generally not distributed uniformly over the whole video. In fact, the distribution pattern of danmus reflects the development of the viewers’ emotion, e.g., emotion burst, which could greatly facilitate emotion analysis tasks.

Given danmu as the new source of data, we propose a novel visual-textual emotion analysis model, named Deep Coupled Video and Danmu Neural networks (DCVDN). DCVDN takes both video frames and associated danmus as input data and aims to construct a joint emotion-oriented representation for viewers’ emotion analysis. Specifically, for each video clip, we first perform clustering on all of its danmus according to their burst pattern. Each set of clustered danmus are aggregated into one danmu document as nearby danmus express viewers’ attitudes towards similar video content at a specific moment. In DCVDN, to overcome the limitation of emotion-unaware textual-based methods, we propose a novel textual representation learning method, called Emotional Word Embedding (EWE), to learn textual features from danmu documents. The key idea of EWE is to encode emotional information along with the semantics into each word for joint word representation learning, which is proved to be able to effectively preserve the original emotion information in texts during learning process. In addition, we also extract video features from video frames synchronized with the burst points of danmu. As viewer’s emotion can be reflected as a joint expression of both video content and danmu texts, in this work, we intend to explore the learning of highly non-linear relationships that exist among the visual and textual features. In DCVDN, a joint emotion-oriented representation is developed over the space of video and danmu, by utilizing a Deep Canonically Correlated Auto-Encoder (DCCAE) to achieve multi-view learning for emotion analysis.

It’s also noticeable that each video only has one lable in this work

. We know that the emotion of one video may be a mixture of serveral ones with different levels, thus the output of our propose model is the probability over the seven classes of the emotions. However, our goal is to predict the main emotion of each video.

To evaluate our proposed DCVDN, we collect video clips and their associated danmus from Bilibili, one of the most popular online video websites in China. Our video-danmu dataset consists of 4,056 video clips and 371,177 danmus, in which each example is associated with one of seven emotion classes: happy, love, anger, sad, fear, disgust, and surprise. We compare our DCVDN with 14 state-of-the-art baselines by conducting extensive experiments on the video-danmu dataset, and the results demonstrate that DCVDN achieves substantial gains over other visual-based or textual-based methods. Specifically, DCVDN outperforms visual-based baselines by to on Accuracy and by to on Precision, and outperforms textual-based baselines by to on Accuracy and by to on Precision.

Figure 1: Illustration of danmus associating with a video clip on Bilibili website.
Figure 2: The framework of DCVDN: the video and associated danmus are clustered based on danmus’ burst pattern, video segments and danmu documents are synchronized in time, visual and textual features are extracted respectively by CNN and EWE, and finally joint representations are learned by DCCAE for classification.

2 Related Work

Among textual-based methods for emotion analysis, lexicon [20, 11]

has been widely used due to its simplicity. However, lexicon-based methods cannot exploit the relationship between words. Recently, distributed representations of words have emerged and successfully proliferated in language models and NLP tasks

[18, 37], which can encode both syntactic and semantic information of words into low-dimensional vectors to quantify and categorize semantic similarities between words. Most word embedding models typically represent each word using a single vector, making them indiscriminative under different emotion circumstances. Aware of this limitation, some multi-prototype vector space models have been proposed [17, 30, 23]. [17] uses latent topic models to discriminate word representations by jointly considering words and their contexts. [23] uses a mix of unsupervised and supervised techniques to learn word vectors capturing semantic term–document information as well as rich sentiment content. Distinguishable from existing works, our EWE first uses Latent Dirichlet Allocation (LDA) [2]

to infer emotion labels and then incorporates them along with word context in representation learning to differentiate words under different emotional and semantic context. There are also various topic models on sentiment analysis

[15, 14]. [15] proposes a novel probabilistic modeling framework based on LDA to detect sentiment and topic simultaneously from text. [14]

observes that sentiments are dependent on local context, and relaxes the sentiment independent assumption. It considers the sentiments words as a Markov chain.

There are also quite a lot of works on visual sentiment analysis. For example, [24, 20] use low-level image properties, including pixel-level color histogram and Scale-Invariant Feature Transform (SIFT), as the features to predict the emotion of images. [4, 38] employ middle-level features, such as visual entities and attributes, as the features for emotion analysis. [35, 36]

utilize Convolutional Neural Networks (CNNs) to extract high-level features through a series of nonlinear transform, which have been proved surpassing other models with low-level and mid-level features

[36]. [34] think that the local areas are pretty relevant to human’s emotional response to the whole image and proposed model to utilize the recent studies attention mechanism to jointly discover relevant local regions and build a sentiment classifier on top of these local regions. [8] presented a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from the unannotated text.

To combine visual and textual information, recent years have witnessed some preliminary effort on multimodal models. For example, [37, 20] employ both text and images for sentiment analysis. [20]

employs Deep Boltzmann Machine (DBM) to fuse features from audio-visual and textual modalities, while

[37] employs cross-modality consistent regression. Moreover, Deep Neural Network (DNN) based approaches [20] are generally used for multi-view representation fusion. Prior works have shown the benefits of multi-view method on emotion analysis [37]. One step advanced in our work, we employ DCCAE [33], which combines autoencoder and canonical correlation to obtain unsupervised representation learning by jointly optimizing the reconstruction errors minus canonical correlation between extracted features in multiple views. Autoencoder is a useful tool for the representation learning, in which the objective is to learn a compact representation that best reconstructs the inputs [19]

via unsupervised learning.

[12] introduced an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. [22] Canonical correlation analysis (CCA) [10] can maximize the mutual information between different modalities and has been justied in many previous works [9, 22, 27]. [22]

makes (CCA) to learn the correlations between visual features and textual features for image retrieval.

[27] uses a variant CCA to learn a mapping between textual words and visual words. Although multi-view methods have been studied extensively, there only exists few works on emotional analysis [37, 32, 7]. For example, [32] proposed a novel Cross-media Bag-of-words Model (CBM) for Microblog sentiment analysis. It represented the text and image of a Weibo tweet as a unified Bag-of-words representation.

3 Visual-Textual Emotion Analysis

In this section, we discuss proposed DCVDN with details. We first provide model overview and then introduce video and danmu preprocessing, EWE, DCCAE and classification in the subsequent subsections, respectively. Each video is with one label, thus we are solving a one-label classification problem.

3.1 DCVDN Overview

Figure 2 depicts the framework of DCVDN, which consists of three modules: preprocessing and feature extraction, multi-view representations learning, and classification.

The first module is committed to preprocess the inputs and extract visual and textual features. It is observed that, for each video clip, danmus are likely to burst around some key video frames. The distributions of danmus usually reflect the emotion evolution of the viewers and nearby danmus are more likely to express emotions towards the same video content. Therefore, for each video clip, we cluster all of its associated danmus according to their burst pattern. Utilizing the results, we aggregate the danmus in each cluster into one document, since it is more effective to analyze longer document rather than shorter ones, which are typically semantic and emotion-ambiguous. Afterwards, we aim to learn emotion-aware word embedding and document embedding for each word and each danmu document, respectively. Correspondingly, we propose EWE, which combines semantic and emotional information of each word to enhance emotion discriminativeness. For videos, we synchronize the selection of the frames corresponding to the burst points of danmus and focus on feature extraction from those selected frames, which are more important and relevant than others to invoke viewers’ emotion burst. We apply pre-trained Convolutional Neural Networks (CNN) (e.g., VGGNet [26]) to extract features of the video frames as CNN has been proved to achieve the state-of-the-art performance on sentiment analysis [36]. These danmu document embeddings and CNN features will be fed into following DCCAE for further joint representation learning.

The second module is the multi-view representation learning for information fusion between video and danmu. The documents of danmu have highly direct correlations with viewers’ emotion and video frames can provide robust background information with appropriate guidance. In DCVDN, for each pair of danmu document and corresponding video frame, we employ DCCAE to learn a multi-view representation in an unsupervised way. A set of obtained multi-view representations will be fed into the following classification module as the input features. From implementation point of view, unsupervised joint representation learning ahead of supervised classification helps avoid complicated end-to-end model training, which effectively facilitate the convergence of training process in practice.

The last module refers to the classification task. It is clear that for each video clip, the multi-view representations output from the second module are still in time series, each corresponding to a clustered time period in the video. Hence, Long Short-Term Memory (LSTM) is adopted to address the time dependency across those features. The output of LSTM is treated as the ultimate emotion-aware embedding for each video clip, and eventually fed into softmax function to obtain the target emotion prediction.

3.2 Preprocessing and Feature Extraction

In this subsection, we discuss the preprocessing and feature extraction methods for danmus and videos in detail. The whole process is also shown in Algorithm 1.

Input: Videos = { , , }, Danmus = { , , }.
Output: , .
for  to  do
       , , = ;
       for  to  do
             = ;
, =
= (, )
for  to  do
       for  to  do
             = ;
             = ;
= ;
= ;
Algorithm 1 Preprocessing and Feature Extraction

3.2.1 Preprocessing and Feature Extraction on Danmu

As aforementioned, danmu is a kind of timely user-generated comment with non-uniform distribution over entire video. The distribution of danmus reflects user engagement to the video content and the video content at burst points of danmus is typically more attractive to viewers than other parts. Aware of this phenomenon, we apply K-means algorithm to segment danmus into a set of clusters according to their burst pattern and aggregate all danmus in the same cluster into a danmu document. Formally speaking, consider a dataset with a total of

videos, denoted by . Each is associated with a collection of danmus, denoted by , where represents the text of -th danmu for , represents the emergence moment of relative to the beginning of (, if ), and is the total number of danmus in . For each , we aim to find a -partition {, …, } satisfying


where and for each cluster , we have

, which is the centroid of cluster and is also treated as the burst point of cluster . Once clusters are formed, we obtain the danmu document set for by aggregating all danmus in same clusters, i.e., = {, …, }, where each corresponds to a danmu document, i.e., , and the document set includes all danmu documents associated with , i.e., = .

To extract the textual features from danmu to enhance emotion discriminativeness, we correspondingly propose emotion-based Latent Dirichlet Allocation (eLDA) [2] and EWE to first learn the emotional embedding of each word and then derive the emotional document embedding for each danmu document . We will discuss the details in the next subsection.

3.2.2 Preprocessing and Feature Extraction on Video

For videos, we exploit the clustering information in danmus to select frames for visual feature extraction. Specifically, we draw out the video frames corresponding to the burst points of danmu clusters in each video clip as they are more attractive to the viewers. In this way, the video frames and danmu documents are synchronized in time. Formally, for each video , based on danmu cluster partition = , …, , we select the key frame at the time moment of burst point to represent the basic visual content of cluster . As the result, we would get a set of frames for , i.e., , …, , in one-to-one correspondence to the danmu clusters .

Previous work [36] has shown that visual features extracted by CNN networks can achieve satisfactory performance for emotion analysis. Therefore, in this work, we employ the pre-trained CNN, i.e., VGG-Net fc-7 [26], for visual feature extraction from each video frame . Basically, danmu texts could explicitly deliver viewers’ opinions and video frames would provide the supportive background information of emotion-relevant content.

3.3 Danmu Document Embedding Learning

In this subsection, we discuss the emotion-oriented embedding learning for word and danmu documents. We first introduce eLDA method to estimate the emotion label of each word, and then discuss the details of proposed EWE model, which aims to combine emotion and semantic information to learn a comprehensive and effective word embedding to facilitate viewers’ emotion analysis.

3.3.1 eLDA

LDA [2] is an unsupervised model and is commonly used to infer the topic label for words and documents. Inspired by LDA for topic analysis, in this work, we exploit it to infer the emotion labels by considering danmu documents as random mixture over latent emotions and each emotion is characterized by a distribution over words. Particularly, each danmu document is represented as a multinomial distribution from Dirichlet distribution over a set of emotions, each emotion is usually represented as a multinomial distribution over a set of vocabulary from . The generative process is defined formally as follow:

  • For each danmu document , choose a multinomial distribution over the emotions from ;

  • For each emotion , choose a multinomial distribution over the words from ;

  • For each word position in document ,

    • Choose an emotion from ;

    • Choose a word from the .

From implementation perspective, in order to effectively infer emotions, we need prior knowledge of the emotional ground truths of some words. When determining the emotion of a word, if the word exists in our emotion lexicon, we choose to use its corresponding emotion in lexicon, otherwise, we choose the emotion according to the probabilities of . Considering that danmu culture (sometimes called manga and anime) is mainly popular among youngsters, the authentic word emotions are somewhat different from the common sense in the existing lexicon. Therefore, it is desired to build a new lexicon specifically for the manga and anime culture. We spend great effort to construct such kind of lexicon, which consists of network-popular words and emoticons. Emoticons are the textual portrayals of a user’s moods or facial expressions in the form of icons. For example, represents happiness and stands for crying. We select these focused words and emoticons according to their occurrence frequency in our dataset.

It is worth pointing out that we cluster the emotion distribution into certain number of classes and treat the result of clustering as the final emotion label for each word, rather than directly use the emotion with the maximal probability as adopted in TWE [17]. Specifically, suppose we obtain the emotion distribution of each word (the -th word in the damu aggregation of ) after the interference of eLDA model. Then we use K-means algorithm to cluster these emotion distributions, which aims to find a -partition satisfying


where and is the centroid of cluster . The new emotion label of is if . The reason for the clustering is that the number of emotion labels is generally small (7 for emotion classification tasks) and we can’t fully explore the information hidden in the distributions with such a few labels. To avoid the dilemma, we recluster the distributions into more classes in order to make the new labels more discriminative. The new labels would be used to learn EWE at a later time.

3.3.2 Emotional Word and Document Embeddings

Figure 3: Skip-Gram and EWE models. Gray and blue circles represent word and emotion embeddings, respectively.

Word embedding, which represents each word using a vector, is widely used to capture semantic information of words. Skip-Gram model [18] is a well-known framework for word embedding, which learns word representation that is useful for predicting context words in a sliding window when given the target word. Formally, given a word sequence , the objective of Skip-Gram is to maximize the average log probability


where is the context window size for the target word, which can be a function of the centered word . Probability is defined as a softmax function as follows:


where and are word vectors of target word and context word , respectively, and is word vocabulary.

It is noticed that Skip-Gram model for word embedding focuses on the semantic context and assumes that each word always preserves a single vector, which sometimes is indiscriminate under different emotion circumstances. These facts motivate us to propose a joint emotion and semantics learning model, named EWE. The basic idea of EWE is to preserve emotion information of words when measuring the interaction between target word and context word . In this way, a word with different associated emotions would correspond to different embeddings so as to effectively enhance the emotion discriminativeness of each word.

Specifically, rather than solely using the target word to predict context words in Skip-Gram, inspired by [17], EWE jointly utilizes ’s as well, i.e., the emotions of the words in the danmu documents. EWE aims to learn the representations for words and emotions separately and simultaneously. In particular, it regards each emotion as a pseudo word and consider the occurrence of this pseudo word in all the positions wherever the positioned words are assigned with this emotion. EWE uses both the target word and its associated emotion to predict context words, as shown in Figure 3. For each target word with its emotion , the objective of EWE is to maximize the following average log probability


where is similar with


and and are the represention vectors of words and emotions respectively. When we minimize the log loss , we consider that is . The other learning process of emotional word is the same as the textual word. The process is shown in Algorithm 2.

Input: = [, , ], , ,
= [, , ],
= [, , ]
Initialize randomly a matrix
Initialize randomly a matrix
for  to  do
       = 0
       Forward Propagation:
       for  to  do
             if  or  then
                   = #
            if  then
                   = + +
      Backward Propagation:
       = -
       = -
Algorithm 2 EWE

is the length of the documents, is the number of vocabulary and is the number of emotions, which are the results of clustering of eLDA. is the size of contextual window, is the user-defined size of representation and is the learning rate. and are the representation vectors of words and emotions respectively.

Emotional word embedding of word in emotion is obtained by concatenating the embeddings of and , i.e., , where is the concatenation operation and the dimension of is double of and . Correspondingly, the document embedding in EWE is to aggregate emotional word embeddings of the words in a danmu document. The document embedding of is defined as , where can be the term frequency-inverse document frequency of word in .

3.4 Deep Multi-View Representation Learning

In this subsection, we introduce the multi-view representation learning method in DCVDN, which aims to simultaneously utilize video and danmu information to learn a joint representation based on the extracted visual and textual features. Inspired by canonical correlation analysis (CCA) and reconstruction-based objectives, we employ deep canonically correlated autoencoders to fuse the latent features from video and danmu points of views. In particular, DCCAE consists of two autoencoders and optimizes the combination of canonical correlation between the learned textual and visual representations and the reconstruction errors of the autoencoders. The structure of DCCAE is shown in the module of “Deep Multi-view Learning: DCCAE” in Figure 2 and its optimization objective is as follows


where is the trade-off parameter, is the sample size, ] and ] are the feature matrices of visual and textual viewpoints, each and y referring to the visual and textual features extracted from a damu document and corresponding video frame, respectively. Moreover, , , and denote mapping functions implemented in autoencoders. The encoder-decoder pair (, ) and (, ) constitute two autoencoders, each for one of two viewpoints. The corresponding parameters in encoding functions and and decoding functions and are denoted by , , , and , respectively. and are the CCA directions that project the outputs of and , where is the dimensionality of input features to autoencoders.

Mathematically, the first term of Eq. (7) is the objective of CCA, while the second and third terms are the losses of autoencoders, which can be understood as adding autoencoders as regularization terms to CCA objective. The constraint is for CCA to ensure the objective is invariant to the scale of and . CCA aims to maximize the mutual information between videos and danmus, while autoencoders aim to minimize the reconstruction errors of two views. In this way, DCCAE tries to explore an optimal trade-off between the information captured from the reconstruction of each view and the information captured from learning the mutual information of two views. Therefore, it can achieve better representations. The output features of two views are and , respectively, which would be as the inputs to two seperated LSTM for the later classification.

3.5 Classification

In this module, the DCCAE outputs are utilized to do classification. As aforementioned, the output representations from DCCAE are still in time series from two modalities. To address time dependency across the representations, we feed the two modalities into two separated LSTMs and get the final outputs of two LSTMs, from the video part and from the text part. Then we simply concatenate two parts into one, = . The obtained representation would eventually be fed into following fully-connected network with a softmax function to obtain the target emotion prediction. The classification network is depicted as the rightmost module in Figure 2.

4 Experiments

In this section, we carry out extensive experiments to evaluate the performance of DCVDN. We first introduce the datasets used for experiments and then compare our model with other 14 baselines, which address on visual, textual and joint features for emotion analysis and videos classification, respectively.

4.1 Datasets

Number of Videos Avg. Length Length Range
4,056 82.89s 1.44s - 514.13s
Happy Love Anger Sad Fear Hate Surprise
620 877 290 631 647 669 322
Table 1: The basic statistics of the Video-Danmu dataset.
Number of Words
Happy Love Anger Sad Fear Hate Surprise
90 784 25 132 146 348 67
Total 1592 Ave. Occurrence Freq. 1550.39
Number of Emoticons
Anger Disgust Fear Shame Guilt Joy Sadness
417 522 41 192 148 265 85
Total 1670 Ave. Occurrence Freq. 7.90
Table 2: Number of words and emoticons of different emotion classes in self-built emotion lexicon.

4.1.1 Datasets For Video-Danmu

Considering the lack of existing danmu-related dataset, we put great effort to self construct a new dataset, called Video-Danmu333We are happy to share this dataset to public after the paper gets published.. This dataset include videos and their associated danmus directly crawled from Bilibili website, which is one of the most popular websites providing danmu services in China. There are 4,056 videos in the dataset, which last ranging from 1.44 to 514.13 seconds and average at 82.89 seconds. We labelled the videos into 7 emotion classes, i.e., happy, love, anger, sad, fear, disgust and surprise, with the help of a group of student helpers in our university. Table 1 shows the basic statistics of the dataset. The number of videos falling in each emotion category is relatively balanced, ranging from 290 to 669 pieces. Table 2 lists the number of words and emoticons belong to each emotion class in self-built emotion lexicon. Emoticon is a kind of text expression, like ( ) show the happiness, show the crying. They usually directly express the emotion of viewers. The average occurrence frequency of the words in our dataset is about , which strongly validates their popularity in practice. And the average occurrence of emoticons in the dataset is about 8 times.

4.1.2 Datasets For Textual Analysis

In order to show that our EWE can be applied to other text-based emotion applications as well, we also use two additional text datasets for comparison.

  • Incident reports dataset (ISEAR) [21]: ISEAR contains incident reports obtained from an international survey on emotion reactions. A number of psychologists and non-psychologists, were asked to report situations in which they had experienced all of 7 major emotions (joy, fear, anger, sadness, disgust, shame, and guilt).

  • Multi-Domain Sentiment Dataset [3]: This dataset contains four-domain (books, dvd, electronic and kitchen & housewares) reviews of productions from Amazons. It consists of reviews, including positive reviews and negative reviews.

4.2 Baselines

We compare proposed DCVDN with other 14 baselines, which can be divided into four categories, i.e., visual-based, textual-based, multi-view learning and video classification methods. We also compare proposed EWE with other textual emotion analysis baselines.

Visual-based baselines:

  • GCH/LCH: use low-level features (64-bin global color histogram features (GCH) and 64-bin local color histogram (LCH)) as defined in [24].

  • Caffenet: An ImageNet pre-trained AlexNet

    [6] followed by fine tuning.

  • PCNN: Progressive CNN [35].

Note that in all above approaches, the image features of selected frames will be fed into LSTM for final classification.

Textual-based baselines:

  • Lexicon method: We count the number of words belonging to each emotional class in each document. Then we choose the emotion class with the largest count as the result.

  • eLDA: Aggregate all danmus of a video into one document and infer the emotion distribution of the document. Choose the emotion class with the largest probability as the result.

  • Word embedding: Learn word representations by Skip-Gram model [18].

  • Topical word embedding (TWE): Learn word representations by TWE model [17], which jointly utilizes the target word and its topic to predict context words.

  • SSWE [28]: Sentiment-Specific word embedding for sentiment classification.

Multi-view learning baselines:

  • Simple-Con: Concatenate the features from different views.

  • DistAE: A joint learning [33], the objective of which is a combination of reconstruction errors of autoencoders of different views and the average discrepancy across the learned projections of multiple views.

Video classification baselines:

  • Conv3D [29]: 3D Convolutional neural networks.

  • Temporal [25]: Optical flows of the frames in the video, widely used for actions recognition.

  • Temporal + Spatial [25]: Use CNN to extract the spatial features, and average the temporal features and spatial features.

4.3 Parameter Settings

For the set of danmus in each video, we divide it into clusters and aggregate each part into one document. While the number of clusters is user-defined parameter, and performans well in our experiments, thus we recommend it. In EWE, the dimensions of word vectors and emotion vectors are , therefore the dimensions of emotional word and document embeddings are . The visual features are extracted by VGG-Net fc-7, which results in features. In multi-view learning module, the two autoencoders in the DCCAE are -layer, in which the size of the middle layer is and the size of other layers is equal to the inputs. In the classification module, We use LayerNorm LSTM [1] here. The length of LSTM is set to , forget-bias is , and hidden layer size is . The following fully-connected network has 2 layers, with the size of the hidden layer as . We focus on Accuracy and Precision

as the evaluation metrics in our experiments.

4.4 Case Studies

In this subsection, we first present one example to show that the key frames are strongly related to the burst points of danmus, then we present three prediction examples to validate the superiority of DCVDN over other baselines.

4.4.1 An Example of Key Frames and Danmus

We provide an example of the relationship between the burst points of danmus and the selected key frames as shown in Fig. 4, in order to show that our clustering approach can select more important frames with the help of danmus. The most famous clip of this video is that an ancient minister just abused his opponent. ”I’ve never seen anyone so brazen!” his opponent angrily said and died then. The above frame sequence is achieved by even chosen from the video, and the lower frame sequence is achieved by our clustering method based on damu burst pattern. The middle chart shows the change in the amount of danmus appearing in each second. Our method successfully finds the key frame, which more comprehensively reflect the content background. It is also evident in the middle chart that the amount of danmus always change over time and the changes are strongly related to the audiences’ interest. Our clustering method nearly selects the frame with most danmus in each time interval. Moreover, our method selects the frame with the most important words, ”I’ve never seen anyone so brazen!”. And as shown in the chart, the selected frame corresponds to the time moment with most danmus. By contrast, the uniform selection misses the frame, which is not always effective in practice.

Figure 4: An example of the relation between burst points of danmus and selected key frames. The above frame sequence is achieved by even choosing, and the below frame sequence is achieved by our clustering method. The middle chart shows the change in the amount of danmus appearing in each second.

4.4.2 Three Examples of Prediction Results

(a) A wacky video with common visual content and scaring BGM. The label is “Disgust”. VGG predicts Love, EWE predicts Fear and DCVDN predicts correctly.
(b) Two audiences made proposes to the tennis stars and got different responses. The label is “Happy”. VGG and DCVDN predict correctly, but EWE considers it’s “Fear”.
(c) A combination of several movie clips with sad BGM. The label is “Sad”. EWE and DCVDN predict correctly, but VGG considers it’s “Happy”.
Figure 5: Three prediction examples to illustrate performance comparison between VGG, EWE and DCVDN.

In the subsubsection, we provide three prediction examples to illustrate the performance comparison between VGG (visual method only), EWE (textual method only) and DCVDN (our method, jointly combining visual and textual information), as shown in Fig. 5.

Fig. 5 (a) is a wacky music video with common visual content, however, the background music (BGM) sounds scaring. The ground truth of this video is “Disgust”. DCVDN gives the right answer. VGG considers it’s “Love” for it looks like a music video and most music video make us feel Love. The result of EWE is “Fear”, the reason of which is that the audiences sometimes express their disgust via the adjective of fear in texts, such as ”Help me! I died as soon as he sang. I choose to die”. Fig. 5 (b) is a video about two tennis stars, Nadal and Steffi. At the beginning of the video, one audience asked Nadal, ”Will you marry me?”. Nadal refused her shyly. Then, another audience asked Steffi, ”Will you marry me?”. Steffi asked back, ”How much do you have?”. Other audiences in the stands laughed loudly. The ground truth of example (b) is “Happy”, as the audiences propose to the stars are very funny. VGG and DCVDN all give the right answer, probably because the video is about sports stars, while the probability of Disgust ranks second high in the result of VGG maybe due to the poor quality of videos. EWE considers the emotion of the video “Fear”. This probably because word “shy” appears frequently in danmu texts and “Shy” is the subclass of “Fear” in our dataset. Fig. 5 (c) is a music video combination clips from several movies, such as Harry Potter and the Lord of the Rings. These clips with sorrowful BGM are more about the parting or about death, and the ground truth is “Sad”. Both of EWE and DCVDN give the right answer with almost probability, while VGG considers it’s “Happy” with high probability and it’s “Sad” with low probability. These results are reasonable because the visual content is mainly about the movie stars, which looks “Happy” in most time. However, in these cases, danmu would give us more information about the true feeling of audiences, which is beyond the visual content.

4.5 Evaluations

4.5.1 EWE on the Emotional Analysis

We first compare our EWE model with textual-based baselines on our own dataset and two public datasets. Table 3 shows the comparison results with the texts in our own dataset. EWE outperforms all other textual-based baselines under investigation by to on Accuracy

. The performance of the lexicon-based method and eLDA is pretty poor, which indicates that the relation between the number of emotional words and the emotion label of videos are not that strong. The embedding-based methods can perform much better, which can effectively capture the upper-level features in emotion space through highly non-linear transformations. EWE achieves the best performance, which is due to the fact that the emotional word embeddings are more informative and could provide more hints for emotion analysis. Table

4 shows the comparison results on the ISEAR dataset. EWE performs more steadily than all other textual-based baselines with to improvement on Accuracy. EWE may perform even better if the training examples in the dataset is with more balanced distribution across different classes. Table 5 demonstrates the comparison results on the Multi-Domain Sentiment dataset, which only contains the positive and negative labels. Besides the baselines investigated with previous two datasets, we also include SSWE [28] as the baseline for the sentiment classification task. The performance of EWE is also the best one, with to improvement on Accuracy, although it’s relatively not that outstanding like the performance on previous two datasets. We notice that the performance of lexicon method and eLDA are bad, which may indicate that the quality of the sentiment dictionary is not good. This could adversely dampen the performance of EWE, which could be the possible reason for EWE not prominent as with the previous two datasets.

4.5.2 DCVDN-V on the Video-Danmu Dataset

Precision Lexcion eLDA WE TWE EWE
happy 0.106 0.493 0.568 0.644 0.636
love 0.757 0.051 0.737 0.749 0.777
anger 0.0 0.0 1.0 1.0 1.0
sad 0.067 0.547 0.803 0.826 0.811
fear 0.049 0.223 0.384 0.624 0.504
disgust 0.094 0.659 0.554 0.647 0.630
surprise 0.12 0.062 0.299 0.688 0.403
Accuracy 0.214 0.321 0.624 0.669 0.683
Table 3: Comparison of Precision and Accuracy between EWE and textual baselines on the Video-Damu dataset.
Precision Lexcion eLDA WE TWE EWE
Anger 0.131 0.274 0.258 0.234 0.309
Disgust 0.139 0.308 0.339 0.313 0.346
Joy 0.154 0.186 0.428 0.451 0.472
Shame 0.148 0.165 0.244 0.309 0.244
Fear 0.150 0.445 0.414 0.391 0.244
Sadness 0.161 0.359 0.437 0.515 0.488
Guilt 0.152 0.0 0.312 0.304 0.488
Accuracy 0.148 0.272 0.354 0.357 0.396
Table 4: Comparison of Precision and Accuracy between EWE and textual baselines on ISEAR.
Precision positive negative Accuracy
Lexcion 0.510 0.516 0.512
eLDA 0.505 0.505 0.505
SSWE 0.560 0.624 0.613
WE 0.569 0.680 0.639
TWE 0.560 0.678 0.642
EWE 0.580 0.680 0.651
Table 5: Comparison of Precision and Accuracy between EWE and textual baselines on Multi-Domain Sentiment Dataset.

We compare the visual part of our model, DCVDN-V, with other visual-based baselines and video classification methods on the Video-Danmu dataset. DCVDN-V is the reduced version of DCVDN solely considering visual input and using VGG-Net and autoencoder for feature extraction. Table 6 shows the comparison results between visual-based baselines and DCVDN-V. Similarly, the Precision is counted based on each respective emotion class and the Accuracy is the overall average across all emotion classes. Basically, DCVDN-V outperforms other visual-based baselines by to on Accuracy

. Moreover, the deep learning based methods generally achieve more or less improvements compared with the low-level feature based approach. It is also worth pointing out that the

Precision of “Happy” and “Love” predicted by visual-based methods is relatively lower than other classes compared with textual-based methods. The reason may be that the visual characteristics of “Happy” and “Love” videos are quite similar to each other so that the features may lead to great intra-class variance. This phenomenon strongly verify that it is hard to learn a clear mapping function solely from visual features to high-level emotions. Therefore, with the enhancement by the interactive information from user-generated texts, joint features may achieve remarkable improvements compared with pure visual-based methods. As our dataset is based on videos, we also compare DCVDN-V with other video classification baselines, with the result shown in Table 7. The results show that the performance of different video classification methods are very close to each other, while DCVDN-V outperforms them significantly by to enhancement on Accuracy. This demonstrates that our method can learn more information related to emotion analysis rather than the video classification methods. What’s more, the outperformance of DCVDN-V in these experiments can show that the features if the videos are learned well with the help of the mutual information from the text part.

Precision GCH LCH PCNN CaffeNet DCVDN-V
happy 0.283 0.170 0.271 0.174 0.323
love 0.361 0.355 0.355 0.405 0.463
anger 0.833 0.573 0.938 0.963 1.0
sad 0.364 0.384 0.440 0.541 0.616
fear 0.359 0.438 0.438 0.481 0.609
disgust 0.448 0.452 0.411 0.518 0.642
surprise 0.219 0.343 0.346 0.272 0.338
Accuracy 0.410 0.394 0.423 0.455 0.532
Table 6: Comparison of Precision and Accuracy between DCVDN-V and visual baselines.
Precision Conv3D Temporal T + S DCVDN-V
happy 0.275 0.313 0.308 0.323
love 0.366 0.446 0.338 0.463
anger 0.941 0.940 0.968 1.0
sad 0.472 0.489 0.667 0.616
fear 0.430 0.407 0.387 0.609
disgust 0.417 0.433 0.524 0.642
surprise 0.439 0.345 0.232 0.338
Accuracy 0.436 0.423 0.427 0.532
Table 7: Comparison of Precision and Accuracy between DCVDN-V and video classification baselines.

4.5.3 DCVDN on the Video-Danmu Dataset

Table 8 shows the comparison results between DCVDN and multi-view learning baselines, where the “ T + S” means “Temporal + Spatial” in the first line of the fourth column. The proposed DCVDN with DCCAE surpasses other multi-view learning methods by to on Accuracy. The performance of DistAE sometimes is even worse than Simple-Con. This is because DistAE aims to minimize the distance between visual and textual views, however they are not the same although they are somewhat correlated. By contrast, DCCAE provides the flexibility to dig deep about the relationship between different views so as to effectively facilitate joint representation learning. The whole results shown in Table 8 can also justify that CCA is able to maximize the mutual information between videos and danmus.

Precision Simple-Con DistAE DCVDN
happy 0.729 0.622 0.732
love 0.816 0.782 0.754
anger 1.0 1.0 1.0
sad 0.795 0.805 0.814
fear 0.632 0.571 0.716
disgust 0.601 0.627 0.628
surprise 0.442 0.652 0.450
Accuracy 0.720 0.713 0.731
Table 8: Comparison of Precision and Accuracy between DCVDN and multi-view learning baselines.

4.5.4 Impact of the Size of Dataset

In this subsection, we show that the size of our dataset is large enough to learn a good model. We test the accuracy with on different ratios of the size of our dataset. The accuracy is , , respectively when the ratio of size is , and , which can show that our dataset is large enough to prove the performance of our models.

5 Conclusions

In this paper, we study user emotion analysis toward online videos by jointly utilizing video frames and danmu texts simultaneously. To encode emotion into the learned word embeddings, we propose EWE to learn text representations by jointly considering their semantics and emotions. Afterwards, we propose a novel visual-textual emotion analysis approach with deep coupled video and danmu neural networks, in which visual and textual features are synchronously extracted and fused to form a comprehensive representation by deep-canonically-correlated-autoencoder-based multi-view learning. To evaluate the performance of EWE and DCVDN, we conduct extensive experiments on public datasets and self-crawled video-damu dataset. The experimental results strongly validate the superiority of EWE and the overall DCVDN over other state-of-the-art baselines.


  • [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, March 2003.
  • [3] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, pages 440–447, 2007.
  • [4] Damian Borth, Tao Chen, Rongrong Ji, and Shih-Fu Chang. Sentibank: Large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pages 459–460, New York, NY, USA, 2013. ACM.
  • [5] Víctor Campos, Brendan Jou, and X. Giró-i Nieto. From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction. Image and Vision Computing, 2017.
  • [6] Victor Campos, Amaia Salvador, Xavier Giro-i Nieto, and Brendan Jou. Diving deep into sentiment: Understanding fine-tuned cnns for visual sentiment prediction. In Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia, ASM ’15, pages 57–62, New York, NY, USA, 2015. ACM.
  • [7] Donglin Cao, Rongrong Ji, Dazhen Lin, and Shaozi Li. A cross-media public sentiment analysis system for microblog. Multimedia Syst., 22(4):479–486, July 2016.
  • [8] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 2121–2129. Curran Associates, Inc., 2013.
  • [9] David R Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16(12):2639–2664, 2004.
  • [10] Harold Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
  • [11] Xia Hu, Jiliang Tang, Huiji Gao, and Huan Liu. Unsupervised sentiment analysis with emotional signals. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, pages 607–618, New York, NY, USA, 2013. ACM.
  • [12] Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  • [13] E. Ko and E. Y. Kim. Recognizing the sentiments of web images using hand-designed features. In 2015 IEEE 14th International Conference on Cognitive Informatics Cognitive Computing (ICCI*CC), pages 156–161, July 2015.
  • [14] Fangtao Li, Minlie Huang, and Xiaoyan Zhu. Sentiment analysis with global topics and local dependency. In

    Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence

    , AAAI’10, pages 1371–1376. AAAI Press, 2010.
  • [15] Chenghua Lin and Yulan He. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM ’09, pages 375–384, New York, NY, USA, 2009. ACM.
  • [16] Sidi Liu, Jinglei Lv, Yimin Hou, Ting Shoemaker, Qinglin Dong, Kaiming Li, and Tianming Liu. What makes a good movie trailer?: Interpretation from simultaneous eeg and eyetracker recording. In Proceedings of the 2016 ACM on Multimedia Conference, MM ’16, pages 82–86, New York, NY, USA, 2016. ACM.
  • [17] Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. Topical word embeddings. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 2418–2424. AAAI Press, 2015.
  • [18] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  • [19] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y. Ng. Multimodal deep learning. In Lise Getoor and Tobias Scheffer, editors, ICML, pages 689–696. Omnipress, 2011.
  • [20] L. Pang, S. Zhu, and C. W. Ngo. Deep multimodal learning for affective analysis and retrieval. IEEE Transactions on Multimedia, 17(11):2008–2020, Nov 2015.
  • [21] W Gerrod Parrott. Emotions in social psychology: Essential readings. Psychology Press, 2001.
  • [22] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger Levy, and Nuno Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM international conference on Multimedia, pages 251–260. ACM, 2010.
  • [23] Joseph Reisinger and Raymond J. Mooney. Multi-prototype vector-space models of word meaning. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, pages 109–117, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.
  • [24] Stefan Siersdorfer, Enrico Minack, Fan Deng, and Jonathon Hare. Analyzing and predicting sentiment of images on the social web. In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, pages 715–718, New York, NY, USA, 2010. ACM.
  • [25] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [26] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [27] Richard Socher and Li Fei-Fei. Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 966–973. IEEE, 2010.
  • [28] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1555–1565, Baltimore, Maryland, June 2014. Association for Computational Linguistics.
  • [29] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 4489–4497. IEEE, 2015.
  • [30] Hongwei Wang, Jia Wang, Miao Zhao, Jiannong Cao, and Minyi Guo. Joint-topic-semantic-aware social recommendation for online voting. In CIKM. ACM, 2017.
  • [31] Jingwen Wang, Jianlong Fu, Yong Xu, and Tao Mei. Beyond object recognition: Visual sentiment analysis with deep coupled adjective and noun neural networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 3484–3490. AAAI Press, 2016.
  • [32] Min Wang, Donglin Cao, Lingxiao Li, Shaozi Li, and Rongrong Ji. Microblog sentiment analysis based on cross-media bag-of-words model. In Proceedings of International Conference on Internet Multimedia Computing and Service, ICIMCS ’14, pages 76:76–76:80, New York, NY, USA, 2014. ACM.
  • [33] Weiran Wang, Raman Arora, Karen Livescu, and Jeff Bilmes. On deep multi-view representation learning. In

    Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37

    , ICML’15, pages 1083–1092., 2015.
  • [34] Quanzeng You, Hailin Jin, and Jiebo Luo. Visual sentiment analysis by attending on local image regions. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • [35] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Robust image sentiment analysis using progressively trained and domain transferred deep networks. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 381–388. AAAI Press, 2015.
  • [36] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 308–314. AAAI Press, 2016.
  • [37] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, WSDM ’16, pages 13–22, New York, NY, USA, 2016. ACM.
  • [38] Jianbo Yuan, Sean Mcdonough, Quanzeng You, and Jiebo Luo. Sentribute: Image sentiment analysis from a mid-level perspective. In Proceedings of the Second International Workshop on Issues of Sentiment Discovery and Opinion Mining, WISDOM ’13, pages 10:1–10:8, New York, NY, USA, 2013. ACM.