The organizers specially collected and annotated a One-Minute Gradual-Emotional Behavior dataset (OMG-Emotion dataset) for the challenge. The dataset is composed of Youtube videos chosen through keywords based on long-term emotional behaviors such as ”monologues”, ”auditions”, ”dialogues” and ”emotional scenes”. An annotator has to watch a whole video in a sequence so that he takes into consideration the contextual information before annotating the arousal and valence for each utterance of a video. The dataset provided by the organizers contains a train split of 231 videos composed of 2442 utterances and validation split of 60 videos composed of 617 utterances. For each utterance, the gold arousal and valence level is given.
Because context is taken into account during annotation, we propose a context-dependent architecture [Poria et al.2017] where the arousal and valence of an utterance is predicted according to the surrounding context. Our model consists of three successive stages:
A context-independent unimodal stage to extract linguistic, visual and acoustic features per utterance
A context-dependent unimodal stage to extract linguistic, visual and acoustic features per video
A context-dependent multimodal stage to make a final prediction per video
2.1 Context-independent Unimodal stage
Firstly, the unimodal features are extracted from each utterance separately. We use a mean square error as loss function :
where is the number of utterances predicted,
the prediction vector for arousal or valence andthe ground truth vector.
Below, we explain the linguistic, visual and acoustic feature extraction methods.
2.1.1 Convolutional Neural Networks for Sentences
For each utterance, a transcription is given as a written sentence. We train a simple CNN with one layer of convolution [Kim2014] on top of word vectors obtained from an unsupervised neural language model [Mikolov et al.2013]. More precisely, we represent an utterance (here, a sentence) as a sequence of
-dimensional word2vec vectors concatenated. Each sentence is wrapped to a window of 50 words which serves as the input to the CNN. Our model has one convolutional layer of three kernels of size 3, 4 and 2 with 30, 30 and 60 feature maps respectively. We then apply a max-overtime pooling operation over the feature map and capture the most important feature, one with the highest value, for each feature map. Each kernel and max-pooling operation are interleaved with ReLu activation function. Finally, a fully connected network layerof size predicts both arousal and valence of the utterance. We extract the 120-dimensional features of an utterance before the operation.
2.1.2 3D-CNN for visual input
In this section, we explain how we extract features of each utterance’s video with a 3D-CNN [Ji et al.2013]. A video is a sequence of frames of size . The 3D convolution is achieved by convolving a 3D-kernel to the cube formed by stacking multiple successive video frames together. By this construction, the feature maps in the convolution layer is connected to multiple frames in the previous layer and therefore is able to capture the temporal information. In our experiments, we sample 32 frames of size per video, equally inter-spaced, so that each video in the dataset . Our CNN consists of 2 convolutional layers of 32 filters of size . Each layer is followed by two max-pooling layers of size and respectively. Afterwards, two fully connected network layers and map the CNN outputs to a predicted arousal and valence level. We extract the 128-dimensional features of an utterance before the operation.
2.1.3 OpenSmile for audio input
For every utterance’s video, we sample a Waveform Audio file at 16 KHz frequency and use OpenSmile [Eyben et al.2010] to extract 6373 features from the IS13-ComParE configuration file. To reduce the number, we only select the -best features based on univariate statistical regression tests where arousal and valence levels are the targets. We pick for both arousal and valence tests and merge features indexes together. We ended up with 121 unique features per utterances.
2.2 Context-dependent Unimodal stage
In this section, we stack the utterances video-wise for each modality. Lets consider a modality of utterance feature size , a video is the sequence of utterances vectors where and is the number of utterances in . We now have a set of modality videos where is number of video in the dataset.
In previous similar work [Poria et al.2017], the video matrice was the input of a bi-directional LSTM network to capture previous and following context. We argue that, especially if the video has many utterances, the context might be incomplete or inaccurate for a specific utterance. We tackle the problem by using self-attention (sometimes called intra-attention). This attention mechanism relates different positions of a single sequence in order to compute a representation of the sequence and has been successfully used in a variety of tasks [Parikh et al.2016, Lin et al.2017, Vaswani et al.2017]. More specifically, we use the ”transformer” encoder with multi-head self-attention to compute our context-dependent unimodal video features.
2.2.1 Transformer encoder
The encoder is composed of a stack of N identical blocks. Each block has two layers. The first layer is a multi-head self-attention mechanism, and the second is a fully connected feed-forward network. Each layer is followed by a normalization layer and employs a residual connection. The output of each layer can be rewritten as the following
where layer(x) is the function implemented by the layer itself (multi-head attention or feed forward).
2.2.2 Multi-Head attention
Let be the queries and keys dimension and the values dimension, the attention function is the dot products of the query with all keys, divide each by , and apply a softmax function to obtain the weights on the values :
Authors found it beneficial to linearly project the queries, keys and values times with different learned linear projections to , and dimensions. The output of the multi-head attention is the concatenation of the number of values.
We pick .
2.2.3 Dense output layer
The output of each utterance’s transformer goes through a last fully connected layer of size to predict both arousal and valence level. Because we make our prediction per video, we propose to include the concordance correlation coefficient in our loss function. We define where
We now want to minimize
for both arousal and valence value. In addition to lead to better results, we found it to give the model more stability between evaluation and reproducibility between runs.
3 Context-dependent Multimodal stage
This section is similar to the previous section, except that we now have only one set of video where each video is composed of multimodal utterances . In our experiments, we tried two types of fusion.
We simply concatenate each modality utterance-wise. The utterance can be rewritten where denotes concatenation.
Multimodal Compact Bilinear Pooling
We would like each feature of each modality to combine with each others. We would learn a model (here linear), i.e. where is the outer-product operation and denotes linearizing the matrix in a vector. In our experiments, our modality feature size are 120, 128 and 121. If we want , would have 951 millions parameters. A multimodal compact bilinear pooling model [Fukui et al.2016] can be learned by relying on the Count Sketch projection function [Charikar et al.2002] to project the outer product to a lower dimensional space, which reduces the number of parameters in .
We report our preliminary results in term of the concordance correlation coefficient metric.
|Monomodal feature extraction|
|Text - CNN||0.165|
|Audio - OpenSmile||0.150|
|Video - 3DCNN||0.186|
|T + A + V||0.274|
|T + A + V + CBP||0.301|
- [Charikar et al.2002] Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, pages 693–703. Springer.
- [Eyben et al.2010] Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459–1462. ACM.
- [Fukui et al.2016] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847.
[Ji et al.2013]
Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu.
3d convolutional neural networks for human action recognition.IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231.
- [Kim2014] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- [Lin et al.2017] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
- [Mikolov et al.2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- [Parikh et al.2016] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.
[Poria et al.2017]
Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh,
and Louis-Philippe Morency.
Context-dependent sentiment analysis in user-generated videos.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 873–883.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.