Log In Sign Up

Multimodal End-to-End Group Emotion Recognition using Cross-Modal Attention

by   Lev Evtodienko, et al.

Classifying group-level emotions is a challenging task due to complexity of video, in which not only visual, but also audio information should be taken into consideration. Existing works on multimodal emotion recognition are using bulky approach, where pretrained neural networks are used as a feature extractors and then extracted features are being fused. However, this approach does not consider attributes of multimodal data and feature extractors cannot be fine-tuned for specific task which can be disadvantageous for overall model accuracy. To this end, our impact is twofold: (i) we train model end-to-end, which allows early layers of neural network to be adapted with taking into account later, fusion layers, of two modalities; (ii) all layers of our model was fine-tuned for downstream task of emotion recognition, so there were no need to train neural networks from scratch. Our model achieves best validation accuracy of 60.37 baseline and is competitive with existing works, audio and video modalities.


page 1

page 2

page 3

page 4


End-to-End Multimodal Emotion Recognition using Deep Neural Networks

Automatic affect recognition is a challenging task due to the various mo...

Multimodal End-to-End Sparse Model for Emotion Recognition

Existing works on multimodal affective computing tasks, such as emotion ...

End-to-End Lip Synchronisation

The goal of this work is to synchronise audio and video of a talking fac...

Interpretability for Multimodal Emotion Recognition using Concept Activation Vectors

Multimodal Emotion Recognition refers to the classification of input vid...

Framewise approach in multimodal emotion recognition in OMG challenge

In this report we described our approach achieves 53% of unweighted accu...

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

We propose a cross-modal co-attention model for continuous emotion recog...

Attentive Cross-modal Connections for Deep Multimodal Wearable-based Emotion Recognition

Classification of human emotions can play an essential role in the desig...

1 Introduction

Emotion recognition is difficult and important task. Understanding emotions in groups of people is vital not only for every individual of a group, but also for people with different background and cultures. Moreover, knowledge about common emotion of group could be of interest for business, learning, healthcare, surveillance and robotics.

Early affective computing was focused on individuals, while in recent years more research is done for groups of people from raw, uncontrolled data (i.e. ”in the wild”) [7]. Results performed on these type of data are easier adoptable to real-world situations such as surveillance cameras or videos from internet.

Dataset used in training was presented in EmotiW 2020 Audio-Video Group Emotion Recognition grand challenge [8] and the exact challenge is Video level Group AFfect (VGAF). The task of this competition was to classify group emotion into three different categories: negative, neutral and positive. The biggest challenges of this dataset are: different lightning conditions, languages, video quality, frame rate, occlusions and intersection of people. One of the approach to handle such data, is to use only one modality [22], [26]

. Another option are two-stage models, where stages are feature extraction and modality fusion respectively


Despite the fact, that unimodal models show decent results, when given poor input information, this problem can not be compensated by another modality, which will hurt the performance. As for two-stage models, fixed feature extractors can not be fine-tuned for a specific task with respect to the information from low-level fusion layers.

To address these issues we propose our model with following features:

  • Model was trained fully end-to-end, compensating the problem of missing information about modalities interaction. Moreover, if one modality does not carry a lot of useful information, another one can mitigate this problem.

  • We do not freeze layers throughout our model, which allows it to be fully optimized to a given task and helps model achieve solid results using models from different domains effectively.

Figure 1: Proposed end-to-end architecture for group emotion recognition.

2 Related work

Automatic human emotion recognition is a topic of active research for nearly a decade [23], [31], [4]. In early works on multimodal learning there were several techniques proposed, such as early fusion [20] and late fusion [37]

. While the idea behind these techniques are simple, they show decent results and are still being widely adopted for multimodal tasks. Recently, researchers have been working not only on fusion techniques, but also on multimodal architectures, where pairs of different modalities’ are being fed to network. These architectures, for instance, are Tensor Fusion Networks

[36], LXMERT [30], ClipBERT [18], VATT [1], VILT [15]. Due to increase in computational power in recent years, a lot of work was done using multimodal learning in different areas, such as question answering [29]

, image captioning

[17], emotion recognition [5], affect recognition [2].

Previous results on VGAF [8] dataset were obtained using two-stage models, where at first features were extracted using fixed models and then, late fusion was used for fusion of extracted features [14], [19], [35], [21], [27], [28], [24]. The best result for this dataset was shown by the winners of Audio-Visual Group Emotion Recognition challenge of EmotiW2020 [19], where team uses 5 different modalities and achieves 74.28% on validation set.

3 Dataset

The VGAF dataset [8] contains 2661 video clips for training and 766 for validation. Dataset was acquired from YouTube with tags, such as ”fight”, ”holiday” or ”protest”, videos from which characterize different emotions. Each video was cropped in clip of 5 seconds length. The data is contains 3 classes – Positive, Neutral and Negative corresponding to the 3 emotion categories. Challenges of dataset are different contexts in every video, various resolutions, frame rates, multilingual audio, which is serious abstraction for the vast majority of available models. Moreover, as there are no labels of what exactly language is in the video, it makes it impossible to collect additional text transcription (modality) using automatic tools.

4 Methodology

4.1 Encoders

We describe the problem as follows. Let X = be a sample of data, where I is a number of multimodal samples, is a sequence of RGB video frames and is a raw audio of a given sample.

First, we extract 8 equally distributed frames. Vision encoder accepts a sequence of extracted frames , where C, T, H, and W are the number of channels, temporal size, height, and width, respectively. For the vision encoder we use approach, inspired by ClipBERT [18], namely we applied mean-pooling ( we will define this operation as M)  to aggregate temporal information of a frames sequence, which is inexpensive way to make use of temporal information. We use pretrained 2D ResNet101d [11] for feature extraction. During our experiments we tried several backbone architectures described in Table 1. We decided not to use 3D CNNs, because it can highly increase time of training without advantage in accuracy [33], [25]

. To pass extracted feature maps further, in attention layers, we flatten embedding on last two dimensions and proceed with passing vector to projection layer (defined as

PL shown on Fig.1), which projected embeddings of different modalities to a common space. During current research instead of projection layer, layer module was considered, where GeLU [12]activation function and additional projection was added, but such a module lead to faster overfitting and shows approximately 3% drop in acccuracy.

Backbone Accuracy
ResNet101d [11] 60.37%
ResNet50d [11] 58.44%
RexNet100 [9] 57.98%
ResNet50 [10] 56.83%
Table 1: Various vision encoder backbones and overall model validation accuracy.

We extracted raw audio data at sampling rate of 16000 Hz, and pass it to Hubert [13]. We choose this model, because it was trained in self-supervised manner, which provides more robust representations and can handle multilingual data. To pass embeddings further, in self-attention layer, projection layer (shown on Fig.1) was used.

We define process of extracting embeddings of audio and video encoder stages as follows:

for audio embeddings, where , S is a sequence length and N is a number of features.

4.2 Attention layers

In this section we will review attention layers used in our model: self-attention and cross-attention. The main purpose of these layers in our model is aligning multimodal information from encoding embedding after encoders.

Self-attention was initially intorduced in [3] and described as layer for extracting information from a set of context vectors to query vector. Formally, it can be written as

where K and V are context vectors and Q is a query vector. For our model we used multi-head attention (MHA), which was introduced in [32] and can be defined as

are learnable parameters of query, key and value respectively. Unlike the BERT [6], in our model self-attention is not used for text data, but for audio and visual embeddings. In the context of our model input to the self-attention is .

Cross-modal attention has similar definition, but instead of computing dot product between the same vector for query and key, it makes use of multimodal nature of video, i.e. there are two cross-modal attentions, one takes K = V = and Q = as input; another one takes K = V = and Q = . Such attention enables one modality for receiving information from another modality and helps align them.

Following architecture decisions of [32], we have added skip connections

, which add weights from layer before attention block with weights from layer after it. To prevent exploding gradient problem, Layer Normalization was applied before penultimate projection layers.

5 Training

We train our model using Adam optimizer [16], with learning rate and weight decay . As encoder parts of our models were already pretrained and only being fine-tuned during training procedure layers of this part of the model was trained with lower learning rate multiplied by factor of .

One of the biggest challenge of learning multimodal neural networks is the fact, that they are exposed to severe overfitting. Usual regularization techniques are often ineffective [34] for these networks. To mitigate this problem we use label smoothing with , which makes neural network be less ”confident” about class it predicts.

Models Modality Accuracy
K-injection network [35] A+V 63.58%
ResNet50 + BiLSTM [24] A+V 61.83%
Hubert+ResNet101d (ours) A+V 60.37%
Inception + LSTM [8] (baseline) A+V 52.09%
Table 2: Results on validation data for two modalities on VGAF dataset. A, V, F states for audio, video, face accordingly.
Figure 2: Confusion matrix for predictions of our model.

6 Results

We use accuracy as evaluation metric of our model. It achieves an overall accuracy of 60.37% and outperforms baseline by a margin of 8.5%. In

Table 2 comparison of all available teams results, who used audio and video modalities for their final predictions, is shown. It can be seen, that our model has a competitive results compared to other works. The best model consists of ResNet101d and Hubert [13] as encoders for video and audio. The most challenging class for our model is ”Positive”, it can be explained by the challenging nature of the dataset and hard interpretation of emotions by themselves. Moreover, some videos in dataset have similar context, but different emotions and labels for them. Classifying ”Neutral” videos as ”Negative” ones can be a problem, which emerges from the fact, that there are big number of protests in dataset, where some of them are peaceful and others are aggressive.

7 Conclusion

Group video emotion recognition is a challenging task, especially for ”in the wild” data. In this paper we present model for VGAF dataset from Audio-Visual Group Emotion Recognition challenge of EmotiW2020. Two novel approaches is used for our model. Our model was trained end-to-end and optimized fully during training process, which help us achieve noticeable result of 60.37% validation accuracy, which outperforms baseline significantly and can perform practically on par with existing bimodal audio-visual models.


  • [1] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong (2021)

    VATT: transformers for multimodal self-supervised learning from raw video, audio and text

    CoRR abs/2104.11178. External Links: Link, 2104.11178 Cited by: §2.
  • [2] A. Bagher Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency (2018-07) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2236–2246. External Links: Link, Document Cited by: §2.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §4.2.
  • [4] C. Busso, M. Bulut, C. Lee, E. (. Kazemzadeh, E. M. Provost, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42, pp. 335–359. Cited by: §2.
  • [5] W. Dai, S. Cahyawijaya, Z. Liu, and P. Fung (2021-06) Multimodal end-to-end sparse model for emotion recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5305–5316. External Links: Link, Document Cited by: §2.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §4.2.
  • [7] A. Dhall, R. Goecke, S. Ghosh, J. Joshi, J. Hoey, and T. Gedeon (2017) From individual to group-level emotion recognition: emotiw 5.0. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI ’17, New York, NY, USA, pp. 524–528. External Links: ISBN 9781450355438, Link, Document Cited by: §1.
  • [8] A. Dhall, G. Sharma, R. Goecke, and T. Gedeon (2020) EmotiW 2020: driver gaze, group emotion, student engagement and physiological signal based challenges. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 784–789. External Links: ISBN 9781450375818, Link, Document Cited by: §1, §2, §3, Table 2.
  • [9] D. Han, S. Yun, B. Heo, and Y. Yoo (2020)

    ReXNet: diminishing representational bottleneck on convolutional neural network

    External Links: 2007.00992 Cited by: Table 1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: Table 1.
  • [11] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li (2018) Bag of tricks for image classification with convolutional neural networks. External Links: 1812.01187 Cited by: §4.1, Table 1.
  • [12] D. Hendrycks and K. Gimpel (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415. External Links: Link, 1606.08415 Cited by: §4.1.
  • [13] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. CoRR abs/2106.07447. External Links: Link, 2106.07447 Cited by: §4.1, §6.
  • [14] B. T. Jin, L. Abdelrahman, C. K. Chen, and A. Khanzada (2020) Fusical: multimodal fusion for video sentiment. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 798–806. External Links: ISBN 9781450375818, Link, Document Cited by: §1, §2.
  • [15] W. Kim, B. Son, and I. Kim (2021) ViLT: vision-and-language transformer without convolution or region supervision. External Links: 2102.03334 Cited by: §2.
  • [16] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.
  • [17] I. Laina, C. Rupprecht, and N. Navab (2019) Towards unsupervised image captioning with shared multimodal embeddings. CoRR abs/1908.09317. External Links: Link, 1908.09317 Cited by: §2.
  • [18] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu (2021) Less is more: clipbert for video-and-language learning via sparse sampling. CoRR abs/2102.06183. External Links: Link, 2102.06183 Cited by: §2, §4.1.
  • [19] C. Liu, W. Jiang, M. Wang, and T. Tang (2020) Group level audio-video emotion recognition using hybrid networks. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 807–812. External Links: ISBN 9781450375818, Link, Document Cited by: §2.
  • [20] L. Morency, R. Mihalcea, and P. Doshi (2011)

    Towards multimodal sentiment analysis: harvesting opinions from the web

    In Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI ’11, New York, NY, USA, pp. 169–176. External Links: ISBN 9781450306416, Link, Document Cited by: §2.
  • [21] S. Ottl, S. Amiriparian, M. Gerczuk, V. Karas, and B. Schuller (2020) Group-level speech emotion recognition utilising deep spectrum features. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 821–826. External Links: ISBN 9781450375818, Link, Document Cited by: §2.
  • [22] S. Ottl, S. Amiriparian, M. Gerczuk, V. Karas, and B. Schuller (2020) Group-level speech emotion recognition utilising deep spectrum features. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 821–826. External Links: ISBN 9781450375818, Link Cited by: §1.
  • [23] R. W. Picard (2010) Affective computing: from laughter to ieee. IEEE Transactions on Affective Computing 1 (1), pp. 11–17. External Links: Document Cited by: §2.
  • [24] J. R. Pinto, T. Gonçalves, C. Pinto, L. Sanhudo, J. Fonseca, F. Gonçalves, P. Carvalho, and J. S. Cardoso (2020) Audiovisual classification of group emotion valence using activity recognition networks. In 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS), Vol. , pp. 114–119. External Links: Document Cited by: §2, Table 2.
  • [25] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. CoRR abs/1711.10305. External Links: Link, 1711.10305 Cited by: §4.1.
  • [26] A. V. Savchenko (2021) Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. arXiv preprint arXiv:2103.17107. Cited by: §1.
  • [27] S. Srivastava, S. A. S. Lakshminarayan, S. Hinduja, S. R. Jannat, H. Elhamdadi, and S. Canavan (2020) Recognizing emotion in the wild using multimodal data. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 849–857. External Links: ISBN 9781450375818, Link, Document Cited by: §2.
  • [28] M. Sun, J. Li, H. Feng, W. Gou, H. Shen, J. Tang, Y. Yang, and J. Ye (2020) Multi-modal fusion using spatio-temporal and static features for group emotion recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 835–840. External Links: ISBN 9781450375818, Link, Document Cited by: §2.
  • [29] A. Talmor, O. Yoran, A. Catav, D. Lahav, Y. Wang, A. Asai, G. Ilharco, H. Hajishirzi, and J. Berant (2021) MultiModalQA: complex question answering over text, tables and images. CoRR abs/2104.06039. External Links: Link, 2104.06039 Cited by: §2.
  • [30] H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. CoRR abs/1908.07490. External Links: Link, 1908.07490 Cited by: §2.
  • [31] P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou (2017) End-to-end multimodal emotion recognition using deep neural networks. CoRR abs/1704.08619. External Links: Link, 1704.08619 Cited by: §2.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §4.2, §4.2.
  • [33] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool (2016) Temporal segment networks: towards good practices for deep action recognition. CoRR abs/1608.00859. External Links: Link, 1608.00859 Cited by: §4.1.
  • [34] W. Wang, D. Tran, and M. Feiszli (2019) What makes training multi-modal networks hard?. CoRR abs/1905.12681. External Links: Link, 1905.12681 Cited by: §5.
  • [35] Y. Wang, J. Wu, P. Heracleous, S. Wada, R. Kimura, and S. Kurihara (2020) Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 827–834. External Links: ISBN 9781450375818, Link, Document Cited by: §2, Table 2.
  • [36] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency (2017) Tensor fusion network for multimodal sentiment analysis. CoRR abs/1707.07250. External Links: Link, 1707.07250 Cited by: §2.
  • [37] A. Zadeh, R. Zellers, E. Pincus, and L. Morency (2016) MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR abs/1606.06259. External Links: Link, 1606.06259 Cited by: §2.