Abstract: UMONS submission for the OMG-Emotion Challenge

by   Delbrouck Jean-Benoit, et al.

This paper describes the UMONS solution for the OMG-Emotion Challenge. We explore a context-dependent architecture where the arousal and valence of an utterance are predicted according to its surrounding context (i.e. the preceding and following utterances of the video). We report an improvement when taking into account context for both unimodal and multimodal predictions.



page 1

page 2

page 3

page 4


Transformer for Emotion Recognition

This paper describes the UMONS solution for the OMG-Emotion Challenge. W...

audEERING's approach to the One-Minute-Gradual Emotion Challenge

This paper describes audEERING's submissions as well as additional evalu...

Multimodal Emotion Recognition for One-Minute-Gradual Emotion Challenge

The continuous dimensional emotion modelled by arousal and valence can d...

OMG - Emotion Challenge Solution

This short paper describes our solution to the 2018 IEEE World Congress ...

A Multimodal Emotion Sensing Platform for Building Emotion-Aware Applications

Humans use a host of signals to infer the emotional state of others. In ...

ANA at SemEval-2019 Task 3: Contextual Emotion detection in Conversations through hierarchical LSTMs and BERT

This paper describes the system submitted by ANA Team for the SemEval-20...

Affective Recommendation System for Tourists by Using Emotion Generating Calculations

An emotion orientated intelligent interface consists of Emotion Generati...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Data

The organizers specially collected and annotated a One-Minute Gradual-Emotional Behavior dataset (OMG-Emotion dataset) for the challenge. The dataset is composed of Youtube videos chosen through keywords based on long-term emotional behaviors such as ”monologues”, ”auditions”, ”dialogues” and ”emotional scenes”. An annotator has to watch a whole video in a sequence so that he takes into consideration the contextual information before annotating the arousal and valence for each utterance of a video. The dataset provided by the organizers contains a train split of 231 videos composed of 2442 utterances and validation split of 60 videos composed of 617 utterances. For each utterance, the gold arousal and valence level is given.

2 Architecture

Because context is taken into account during annotation, we propose a context-dependent architecture [Poria et al.2017] where the arousal and valence of an utterance is predicted according to the surrounding context. Our model consists of three successive stages:

  • A context-independent unimodal stage to extract linguistic, visual and acoustic features per utterance

  • A context-dependent unimodal stage to extract linguistic, visual and acoustic features per video

  • A context-dependent multimodal stage to make a final prediction per video

2.1 Context-independent Unimodal stage

Firstly, the unimodal features are extracted from each utterance separately. We use a mean square error as loss function :

where is the number of utterances predicted,

the prediction vector for arousal or valence and

the ground truth vector.

Below, we explain the linguistic, visual and acoustic feature extraction methods.

2.1.1 Convolutional Neural Networks for Sentences

For each utterance, a transcription is given as a written sentence. We train a simple CNN with one layer of convolution [Kim2014] on top of word vectors obtained from an unsupervised neural language model [Mikolov et al.2013]. More precisely, we represent an utterance (here, a sentence) as a sequence of

-dimensional word2vec vectors concatenated. Each sentence is wrapped to a window of 50 words which serves as the input to the CNN. Our model has one convolutional layer of three kernels of size 3, 4 and 2 with 30, 30 and 60 feature maps respectively. We then apply a max-overtime pooling operation over the feature map and capture the most important feature, one with the highest value, for each feature map. Each kernel and max-pooling operation are interleaved with ReLu activation function. Finally, a fully connected network layer

of size predicts both arousal and valence of the utterance. We extract the 120-dimensional features of an utterance before the operation.

2.1.2 3D-CNN for visual input

In this section, we explain how we extract features of each utterance’s video with a 3D-CNN [Ji et al.2013]. A video is a sequence of frames of size . The 3D convolution is achieved by convolving a 3D-kernel to the cube formed by stacking multiple successive video frames together. By this construction, the feature maps in the convolution layer is connected to multiple frames in the previous layer and therefore is able to capture the temporal information. In our experiments, we sample 32 frames of size per video, equally inter-spaced, so that each video in the dataset . Our CNN consists of 2 convolutional layers of 32 filters of size . Each layer is followed by two max-pooling layers of size and respectively. Afterwards, two fully connected network layers and map the CNN outputs to a predicted arousal and valence level. We extract the 128-dimensional features of an utterance before the operation.

2.1.3 OpenSmile for audio input

For every utterance’s video, we sample a Waveform Audio file at 16 KHz frequency and use OpenSmile [Eyben et al.2010] to extract 6373 features from the IS13-ComParE configuration file. To reduce the number, we only select the -best features based on univariate statistical regression tests where arousal and valence levels are the targets. We pick for both arousal and valence tests and merge features indexes together. We ended up with 121 unique features per utterances.

2.2 Context-dependent Unimodal stage

In this section, we stack the utterances video-wise for each modality. Lets consider a modality of utterance feature size , a video is the sequence of utterances vectors where and is the number of utterances in . We now have a set of modality videos where is number of video in the dataset.

In previous similar work [Poria et al.2017], the video matrice was the input of a bi-directional LSTM network to capture previous and following context. We argue that, especially if the video has many utterances, the context might be incomplete or inaccurate for a specific utterance. We tackle the problem by using self-attention (sometimes called intra-attention). This attention mechanism relates different positions of a single sequence in order to compute a representation of the sequence and has been successfully used in a variety of tasks [Parikh et al.2016, Lin et al.2017, Vaswani et al.2017]. More specifically, we use the ”transformer” encoder with multi-head self-attention to compute our context-dependent unimodal video features.

Figure 1: Overview of the Context-dependent Unimodal stage. Each utterance’s arousal and valence level are predicted through a whole video

2.2.1 Transformer encoder

The encoder is composed of a stack of N identical blocks. Each block has two layers. The first layer is a multi-head self-attention mechanism, and the second is a fully connected feed-forward network. Each layer is followed by a normalization layer and employs a residual connection. The output of each layer can be rewritten as the following

where layer(x) is the function implemented by the layer itself (multi-head attention or feed forward).

2.2.2 Multi-Head attention

Let be the queries and keys dimension and the values dimension, the attention function is the dot products of the query with all keys, divide each by , and apply a softmax function to obtain the weights on the values :

Authors found it beneficial to linearly project the queries, keys and values times with different learned linear projections to , and dimensions. The output of the multi-head attention is the concatenation of the number of values.

We pick .

2.2.3 Dense output layer

The output of each utterance’s transformer goes through a last fully connected layer of size to predict both arousal and valence level. Because we make our prediction per video, we propose to include the concordance correlation coefficient in our loss function. We define where

We now want to minimize

for both arousal and valence value. In addition to lead to better results, we found it to give the model more stability between evaluation and reproducibility between runs.

3 Context-dependent Multimodal stage

This section is similar to the previous section, except that we now have only one set of video where each video is composed of multimodal utterances . In our experiments, we tried two types of fusion.

  1. [label=]

  2. Concatenation
    We simply concatenate each modality utterance-wise. The utterance can be rewritten where denotes concatenation.

  3. Multimodal Compact Bilinear Pooling
    We would like each feature of each modality to combine with each others. We would learn a model (here linear), i.e. where is the outer-product operation and denotes linearizing the matrix in a vector. In our experiments, our modality feature size are 120, 128 and 121. If we want , would have 951 millions parameters. A multimodal compact bilinear pooling model [Fukui et al.2016] can be learned by relying on the Count Sketch projection function [Charikar et al.2002] to project the outer product to a lower dimensional space, which reduces the number of parameters in .

4 Results

We report our preliminary results in term of the concordance correlation coefficient metric.

Model Mean CCC
Monomodal feature extraction
Text - CNN 0.165
Audio - OpenSmile 0.150
Video - 3DCNN 0.186
Contextual monomodal
Text 0.220
Audio 0.223
Video 0.227
Contextual multimodal
T + A + V 0.274
T + A + V + CBP 0.301