This paper addresses the problem of automatic emotion recognition in the...
Personality and emotion are both central to affective computing. Existin...
This short paper describes our solution to the 2018 IEEE World Congress ...
A natural conversational interface that allows longitudinal symptom trac...
We propose a K-Means based prediction system, nicknamed SERVANT (Scene
Disfluencies in spontaneous speech are known to be associated with proso...
Millions of people reach out to digital assistants such as Siri every da...
In this paper, we comprehensively describe the methodology of our submissions to the One-Minute Gradual-Emotion Behavior Challenge (OMG-Emotion). Section II introduces the representation of videos and audios that we use as the input of deep networks. The designation of model architectures are depicted in section II, followed by the results in section IV and conclusion in Section V. Source codes for this paper are available 111https://github.com/pengsongyou/OMG-ADSC.
In our two submissions, our models use either only visual input or both visual and acoustic input. This section details how we preprocess these two modalities from the provided OMG-emotion dataset .
Since the audio files are not provided separately, we first convert all snippets to WAV files, each one of which is single-channel and sampled at 16kHz. Similar to 
, spectrograms are then calculated every 10ms with a sliding hamming window of width 25ms and 512-point FFT. We assume that 3 seconds of the audio signal contains emotion information, thus 3 seconds of audio signal is taken to get the short-time Fourier transform (STFT) spectrum. Since each frequency bin in the spectrum is a complex number, we obtain an STFT map of size, where indicates both the real and imaginary parts of the acquired STFT values.
With the preprocessed video and audio data, we design and adopt the following deep networks to the arousal-valence regression problem. The overall architecture can be viewed in figure 1.
We use a rather straightforward network for the audio stream. The STFT maps is input to the base network VGG-16 
pretrained on ImageNet. Since the depth dimension of an STFT map is, we modified the first layer of VGG-16 accordingly. The output feature is then fed into two fully-connected (FC) layers with dropouts in between.
If we solely train the network on audios, another FC layer and the Tanh function are applied to acquire the final arousal and valence values.
Due to the fact that the length of every snippet in dataset varies from seconds up to seconds, the number of extracted frames may be quite different among snippets. In order to fully utilize the temporal information with feasible GPU memory utilization, we first sparsely sample frames from a snippet. Inspired by , we divide a snippet in order into segments, from which a single frame is randomly sampled.
For each of the selected frames, an intermediate feature (dim=512) can be obtained from SphereFace. Then, these features are fed into a bidirectional LSTM. Finally, an temporal-average pooling layer followed by an FC and Tanh are employed to acquire two emotion scores.
For the sake of jointly training ANet and VNet together, we design the following scheme. First we take the VNet and ANet solely trained beforehand. With VNet kept the same, we require to sample STFT maps from every snippet and then average their outputs of the penultimate FC layer in ANet. Then, we simply concatenate the ANet and VNet features and feed into another FC layer followed by Tanh. Figure 1 illustrates the architecture of the joint training.
The architectures are implemented in PyTorch. In joint training, We train with an initial learning rate of
and decrease by a factor of 10 every 7 epochs. For each mini-batch,and
are set to 4 and 16, respectively. Batch size is 6. We also sue gradient clipping when the norm is over 20. With one NVIDIA GTX TITAN X, it takes around 7 minutes for one epoch.
It should be noted that one important difference between joint training and training VNet and ANet separately is the loss function. MSE loss is employed for sole training while CCC for joint training.
In the part, we briefly illustrate the performance of our models over the provided baseline methods in CCC.
Table I compares our ANet with a method  pretrained on RAVDESS and another method on OpenSmile. It shows that even without pretraining on any audio dataset, our ANet still outperforms the baselines in both arousal and valence scores.
|Video Baseline ||0.12||0.23||0.35|
Finally, table III illustrates the effectiveness of jointly training networks for both audio and video streams. Further performance improvement has been achieved with such a training scheme.
This paper describes a novel architecture for arousal-valence estimation on the OMG-Emotion Dataset. We have shown the advantage of our deep network over the baseline methods.
Sphereface: Deep hypersphere embedding for face recognition.In CVPR, volume 1, 2017.