Sentiment analysis or affective computing systems are designed to analyze human emotional states, and may benefit the development of human-computer interaction. The basic tasks include recognition of human sentiment using information from multiple modalities like facial expressions, body movement and gestures, speech and physiological signals. The labels for human sentiment are often either discrete categorical labels of six universal emotions (Disgust, Fear, Happiness, Surprise, Sadness, and Anger) [ekman1971constants], or continuous-valued annotations in the arousal and valence spaces [thompson2011methods]. Previous research, therefore, has normally modeled the problem as either a classification[zhou2017action] or a regression[zhou2017pose] task, using deep models like the CNN[khorrami2015deep], or traditional approaches like the SVM or Regression Tree[shan2009facial].
Further improvements in the performance and reliability of affective systems will rely on long-term contextual information modeling, and cross-modality analysis. Since emotions normally change gradually under the same context, analyzing long-term dependency of emotions will stabilize the overall predictions. Meanwhile, humans perceive others’ emotional states by combining informatino across multiple modalities simultaneously. Combining different modalities will yield better emotion recognition with more human-like computational models [morency2011towards]. These two aspects are explicitly emphasized in the 2018 IJCNN challenge ”One-Minute Gradual-Emotion Recognition (OMG-Emotion)” [barros2018omg]. In this challenge, long monologue videos with gradual emotional changes are selected from YouTube, and carefully annotated using both arousal/valence and emotional categories at the utterance-level. All the video clips contain visual, audio and transcript information. The performance of three unimodal recognition systems are provided as the baseline.
In developing our multimodal system for sentiment analysis to address this challenge, we have been inspired by many previous works, such as that combining visual and audio features[tzirakis2017end], as well as speech content [morency2011towards, poria2016fusing, zadeh2017tensor]. People have also combined physiological signals into emotion recognition systems [ranganathan2016multimodal]
. Methods of combining cues from each modality can be categorized into early or late fusion. For early fusion, features from different modalities are projected into the same joint feature space before being fed into the classifier[rosas2013multimodal, poria2015towards]. For late fusion, classifications are made on each modality and their decisions or predictions are later merged together, e.g. by taking the mean or other linear combination [cai2015convolutional, glodek2013kalman]. Some works[kessous2010multimodal, poria2015deep] even implemented a hybrid fusion strategy to utilize both the advantages of late fusion and early fusion.
In this paper, we investigated the use of a number of feature extraction, classification and fusion methods. Our final trimodal method aggregates visual, audio and text features for a single-shot utterance-level sentiment regression using early fusion. To verify the effectiveness of multimodal fusion, we compared it with three unimodal methods. Our proposed multimodal approach outperformed the unimodal ones as well as the baseline methods, achieving validation set concordance correlation coefficients (CCC) of 0.400 on the arousal task, and 0.353 on the valence task.
Ii-a Dataset and Metrics
The OMG-Emotion Behavior Dataset[barros2018omg]
is a long-term multi-modal corpus for sentiment analysis. It is constructed by picking out the videos with emotion behaviors from Youtube videos using keywords like ”monologues”, ”auditions” etc. Most videos in OMG dataset have standard resolution of 1280x720, and the main language is English. Utterances are then extracted from each video where there are high speech probability. The dataset is split into training, validation and testing set. There are 231 videos in the training set, 60 videos in the validation set, and 204 videos in the testing set. Thus the number of utterances are 2440, 617 and 2229 respectively.
Each utterance is annotated by arousal/valence value in dimensional space, as well as seven discrete emotion labels. Arousal is a continuous score ranging from 0 (calm) to 1 (excited), while valence is a continuous score ranging from -1 (negative) to +1 (positive).
Two following metrics are used to evaluate the arousal/valence estimation over this dataset: MSE (mean squared error) and CCC (the concordance correlation coefficients). The CCC is defined as:
where is the Correlation Coefficient between the predictions and groundtruth. and denote the mean, and and
are the corresponding variance.
Ii-B System Architecture
Figure 1 shows the architecture of our proposed model. Our deep neural network model consists of three parts: (1) the subnetworks for each single modality; (2) the early fusion layer which concatenates three unimodal representations together; and (3) the final decision layer that estimates the sentiment.
Ii-B1 Visual Subnetwork
Visual features consist of OpenFace [baltruvsaitis2016openface] estimators on the whole frames, and VGG face representation [parkhi2015deep]
on facial regions. For OpenFace features, we use OpenFace toolkit to extract the estimated 68 facial landmarks in both 2D and 3D world coordinates, eye gaze direction vector in 3D, head pose, rigid head shape, and Facial Action Units intensity[ekman1978facial] indicating the facial muscle movements. The detailed feature descriptions are seen in[Tadas2018openface] Those visual descriptors are regarded as strong indicators of human emotions and sentiments [ranganathan2016multimodal, soleymani2012multimodal]. For the VGG face representation, facial region in each frame is cropped and aligned using a 3D Constrained Local Model described in [baltruvsaitis20123d]. We zero out the background according to the face contour indicated by the facial landmarks. Then, the cropped faces are resized to 2242243 and fed into a VGG Face model pretrained on a large face dataset. We take the 4096-dimensional feature vectors in the fc6 layer, and concatenate them with the visual features extracted by OpenFace. The total dimension of the concatenated features is 4805.
The concatenated visual features from a single utterance are further fed into a LSTM layer with 64 hidden units followed by a dense layer with 256 hidden neurons for temporal modeling. Specifically, 20 frames are uniformly sampled from each utterance and fed into the network for training and testing. In the case of shorter length of utterance, we duplicated the last frame to fill the gap.
Ii-B2 Audio Subnetwork
Audio features are extracted using openSMILE toolkit[eyben2010opensmile], and we use the same feature set as suggested in the INTERSPEECH 2010 paralinguistics challenge[schuller2010interspeech]. The set contains Mel Frequency Cepstral Coefficients (MFCCs), MFCC, loudness, pitch, jitter, etc.[emobase2010]. These features describe the prosodic pattern of different speakers and are consistent signs of their affective states. For each utterance sample, We extract 1582 dimensional features from the audio signal. These audio features are then fed into a fully connected layer with 256 units.
Ii-B3 Text Subnetwork
We use two opinion lexicons to analyze the patterns in language context. The first one is Bing Liu’s opinion Lexicon[ding2008holistic] with 2006 positive words and 4783 negative words. The second one is MPQA Subjectivity Lexicon[wilson2005recognizing] with 2718 positive words and 4913 negative words. For each utterance, we compute the frequency of positive and negative words according to the two lexicons, as well as the total word number in the whole utterance. For utterances without transcript, we replicate the transcript of the closest utterance in time. We also extract the word frequencies over the entire video, and assign them as features for all utterances in the same video. The total dimension of word feature is finally 10, including utterance-level and video-level word frequency from two lexicons and the total word counts. These text features are also fed into a fully connected layer with 256 units.
Ii-B4 Fusion and Decision Layers
We combine cues from the three modalities using early fusion strategy. The aggregated feature vector is fully connected to a two-layer neural network with 1024 hidden units and a single output neuron, activated by sigmoid (for arousal task) or hyperbolic tangent function (for valence task). We first use MSE as the loss function for joint training, and applyloss for further refinement.
In comparison, we also design a late fusion strategy. In this case, we add a decision layer in each subnetwork and combine the 3 predictions using a linear regression trained by MSE.
|Early Fusion(Fine Tuned)||0.400||0.058||0.353||0.136|
We trained and evaluated the multimodal network on OMG dataset. The model was trained for at most 300 epochs. To prevent overfitting, we applied an early-stopping policy with 20 epochs patience, which means to stop training after the validation loss doesn’t drop for 20 epochs, and we deployed dropout strategy with ratio 0.5 for each fully connected layer. The learning rate wasfor arousal task and for valence task.
Iii-a Unimodal Approach
We first evaluated the performance of model trained with single modality. For each unimodal model, the same decision layer introduced in Section II-B4 was deployed.
For visual unimodal model, we investigated the effectiveness of VGG-face and OpenFace features separately in an ablation test. The comparison results are shown in Table I. Our results demonstrated that VGG-face features outperformed OpenFace features under the same model architecture. Better performance on both arousal and valence tasks were achieved when the two features are fused.
For the audio network, we focused on studying the importance of temporal modeling in utterance. We implemented another LSTM-based network for audio modality. Specifically, we divided each audio file into audio frames of 0.5 second length, and extracted openSMILE features for each single frame. Those features are then fed into a 64 cells LSTM layer followed by the decision layer. We compared this LSTM-based model with our audio unimodal model described in section II-B2. The results in Table I show the model without LSTM performs better than the audio model with LSTM. The LSTM layer does not benefit the estimation.
For text modality, we compared the proposed word frequency statistical approach with models using pretrained word embeddings and LSTM layers in NLP(Natural Language Processing). We implemented the latter approach by using the 100 dimensional GloVe word vectors pretrained on English WikiPedia[pennington2014glove] and a 64 cells LSTM layer in Text(LSTM) model. We compared the performance with text unimodal model using simple opinion lexicon features. The result is shown in Table I. Surprisingly, simple lexicon features performed better. This results from the frequently occurring errors as being transcribed using Automatic Speech Recognition Tool in this dataset. The opinion lexicon features, however, mostly ignore these errors by only counting the words appearing in opinion lexicon.
Iii-B Multimodal Approach
We trained the trimodal network by using the concatenated multimodal features. With respect to fusion strategies, We compared the early and late feature fusion strategies in Table II. The results demonstrated that learning benefits more from early fused representation. The performance is further improved by fine-tuning the system using loss. Table III showed the comparison of our unimodal or multimodal systems performances with the baseline results. The trimodal model has better performance than any of the unimodal models.
In this paper, we propose a multimodal system that utilizes visual, audio and text features to perform a continuous affect prediction task in utterance level. Early feature fusion strategy is deployed and CCC loss is directly applied for network fine-tuning to boost the estimation performance. In the OMG dataset, both our unimodal or multimodal models outperform the baseline methods significantly. Our results shows that cross-modal information will greatly benefit the estimation of long-term affective states.