Emotion recognition is difficult and important task. Understanding emotions in groups of people is vital not only for every individual of a group, but also for people with different background and cultures. Moreover, knowledge about common emotion of group could be of interest for business, learning, healthcare, surveillance and robotics.
Early affective computing was focused on individuals, while in recent years more research is done for groups of people from raw, uncontrolled data (i.e. ”in the wild”) . Results performed on these type of data are easier adoptable to real-world situations such as surveillance cameras or videos from internet.
Dataset used in training was presented in EmotiW 2020 Audio-Video Group Emotion Recognition grand challenge  and the exact challenge is Video level Group AFfect (VGAF). The task of this competition was to classify group emotion into three different categories: negative, neutral and positive. The biggest challenges of this dataset are: different lightning conditions, languages, video quality, frame rate, occlusions and intersection of people. One of the approach to handle such data, is to use only one modality , 
. Another option are two-stage models, where stages are feature extraction and modality fusion respectively.
Despite the fact, that unimodal models show decent results, when given poor input information, this problem can not be compensated by another modality, which will hurt the performance. As for two-stage models, fixed feature extractors can not be fine-tuned for a specific task with respect to the information from low-level fusion layers.
To address these issues we propose our model with following features:
Model was trained fully end-to-end, compensating the problem of missing information about modalities interaction. Moreover, if one modality does not carry a lot of useful information, another one can mitigate this problem.
We do not freeze layers throughout our model, which allows it to be fully optimized to a given task and helps model achieve solid results using models from different domains effectively.
2 Related work
Automatic human emotion recognition is a topic of active research for nearly a decade , , . In early works on multimodal learning there were several techniques proposed, such as early fusion  and late fusion 
. While the idea behind these techniques are simple, they show decent results and are still being widely adopted for multimodal tasks. Recently, researchers have been working not only on fusion techniques, but also on multimodal architectures, where pairs of different modalities’ are being fed to network. These architectures, for instance, are Tensor Fusion Networks, LXMERT , ClipBERT , VATT , VILT . Due to increase in computational power in recent years, a lot of work was done using multimodal learning in different areas, such as question answering 17], emotion recognition , affect recognition .
Previous results on VGAF  dataset were obtained using two-stage models, where at first features were extracted using fixed models and then, late fusion was used for fusion of extracted features , , , , , , . The best result for this dataset was shown by the winners of Audio-Visual Group Emotion Recognition challenge of EmotiW2020 , where team uses 5 different modalities and achieves 74.28% on validation set.
The VGAF dataset  contains 2661 video clips for training and 766 for validation. Dataset was acquired from YouTube with tags, such as ”fight”, ”holiday” or ”protest”, videos from which characterize different emotions. Each video was cropped in clip of 5 seconds length. The data is contains 3 classes – Positive, Neutral and Negative corresponding to the 3 emotion categories. Challenges of dataset are different contexts in every video, various resolutions, frame rates, multilingual audio, which is serious abstraction for the vast majority of available models. Moreover, as there are no labels of what exactly language is in the video, it makes it impossible to collect additional text transcription (modality) using automatic tools.
We describe the problem as follows. Let X = be a sample of data, where I is a number of multimodal samples, is a sequence of RGB video frames and is a raw audio of a given sample.
First, we extract 8 equally distributed frames. Vision encoder accepts a sequence of extracted frames , where C, T, H, and W are the number of channels, temporal size, height, and width, respectively. For the vision encoder we use approach, inspired by ClipBERT , namely we applied mean-pooling ( we will define this operation as M) to aggregate temporal information of a frames sequence, which is inexpensive way to make use of temporal information. We use pretrained 2D ResNet101d  for feature extraction. During our experiments we tried several backbone architectures described in Table 1. We decided not to use 3D CNNs, because it can highly increase time of training without advantage in accuracy , 
. To pass extracted feature maps further, in attention layers, we flatten embedding on last two dimensions and proceed with passing vector to projection layer (defined asPL shown on Fig.1), which projected embeddings of different modalities to a common space. During current research instead of projection layer, layer module was considered, where GeLU activation function and additional projection was added, but such a module lead to faster overfitting and shows approximately 3% drop in acccuracy.
We extracted raw audio data at sampling rate of 16000 Hz, and pass it to Hubert . We choose this model, because it was trained in self-supervised manner, which provides more robust representations and can handle multilingual data. To pass embeddings further, in self-attention layer, projection layer (shown on Fig.1) was used.
We define process of extracting embeddings of audio and video encoder stages as follows:
for audio embeddings, where , S is a sequence length and N is a number of features.
4.2 Attention layers
In this section we will review attention layers used in our model: self-attention and cross-attention. The main purpose of these layers in our model is aligning multimodal information from encoding embedding after encoders.
Self-attention was initially intorduced in  and described as layer for extracting information from a set of context vectors to query vector. Formally, it can be written as
where K and V are context vectors and Q is a query vector. For our model we used multi-head attention (MHA), which was introduced in  and can be defined as
are learnable parameters of query, key and value respectively. Unlike the BERT , in our model self-attention is not used for text data, but for audio and visual embeddings. In the context of our model input to the self-attention is .
Cross-modal attention has similar definition, but instead of computing dot product between the same vector for query and key, it makes use of multimodal nature of video, i.e. there are two cross-modal attentions, one takes K = V = and Q = as input; another one takes K = V = and Q = . Such attention enables one modality for receiving information from another modality and helps align them.
We train our model using Adam optimizer , with learning rate and weight decay . As encoder parts of our models were already pretrained and only being fine-tuned during training procedure layers of this part of the model was trained with lower learning rate multiplied by factor of .
One of the biggest challenge of learning multimodal neural networks is the fact, that they are exposed to severe overfitting. Usual regularization techniques are often ineffective  for these networks. To mitigate this problem we use label smoothing with , which makes neural network be less ”confident” about class it predicts.
We use accuracy as evaluation metric of our model. It achieves an overall accuracy of 60.37% and outperforms baseline by a margin of 8.5%. InTable 2 comparison of all available teams results, who used audio and video modalities for their final predictions, is shown. It can be seen, that our model has a competitive results compared to other works. The best model consists of ResNet101d and Hubert  as encoders for video and audio. The most challenging class for our model is ”Positive”, it can be explained by the challenging nature of the dataset and hard interpretation of emotions by themselves. Moreover, some videos in dataset have similar context, but different emotions and labels for them. Classifying ”Neutral” videos as ”Negative” ones can be a problem, which emerges from the fact, that there are big number of protests in dataset, where some of them are peaceful and others are aggressive.
Group video emotion recognition is a challenging task, especially for ”in the wild” data. In this paper we present model for VGAF dataset from Audio-Visual Group Emotion Recognition challenge of EmotiW2020. Two novel approaches is used for our model. Our model was trained end-to-end and optimized fully during training process, which help us achieve noticeable result of 60.37% validation accuracy, which outperforms baseline significantly and can perform practically on par with existing bimodal audio-visual models.
VATT: transformers for multimodal self-supervised learning from raw video, audio and text. CoRR abs/2104.11178. External Links: Cited by: §2.
-  (2018-07) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2236–2246. External Links: Cited by: §2.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §4.2.
-  (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation 42, pp. 335–359. Cited by: §2.
-  (2021-06) Multimodal end-to-end sparse model for emotion recognition. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5305–5316. External Links: Cited by: §2.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §4.2.
-  (2017) From individual to group-level emotion recognition: emotiw 5.0. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI ’17, New York, NY, USA, pp. 524–528. External Links: Cited by: §1.
-  (2020) EmotiW 2020: driver gaze, group emotion, student engagement and physiological signal based challenges. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 784–789. External Links: Cited by: §1, §2, §3, Table 2.
ReXNet: diminishing representational bottleneck on convolutional neural network. External Links: Cited by: Table 1.
-  (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Cited by: Table 1.
-  (2018) Bag of tricks for image classification with convolutional neural networks. External Links: Cited by: §4.1, Table 1.
-  (2016) Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR abs/1606.08415. External Links: Cited by: §4.1.
-  (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. CoRR abs/2106.07447. External Links: Cited by: §4.1, §6.
-  (2020) Fusical: multimodal fusion for video sentiment. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 798–806. External Links: Cited by: §1, §2.
-  (2021) ViLT: vision-and-language transformer without convolution or region supervision. External Links: Cited by: §2.
-  (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.
-  (2019) Towards unsupervised image captioning with shared multimodal embeddings. CoRR abs/1908.09317. External Links: Cited by: §2.
-  (2021) Less is more: clipbert for video-and-language learning via sparse sampling. CoRR abs/2102.06183. External Links: Cited by: §2, §4.1.
-  (2020) Group level audio-video emotion recognition using hybrid networks. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 807–812. External Links: Cited by: §2.
Towards multimodal sentiment analysis: harvesting opinions from the web. In Proceedings of the 13th International Conference on Multimodal Interfaces, ICMI ’11, New York, NY, USA, pp. 169–176. External Links: Cited by: §2.
-  (2020) Group-level speech emotion recognition utilising deep spectrum features. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 821–826. External Links: Cited by: §2.
-  (2020) Group-level speech emotion recognition utilising deep spectrum features. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 821–826. External Links: Cited by: §1.
-  (2010) Affective computing: from laughter to ieee. IEEE Transactions on Affective Computing 1 (1), pp. 11–17. External Links: Cited by: §2.
-  (2020) Audiovisual classification of group emotion valence using activity recognition networks. In 2020 IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS), Vol. , pp. 114–119. External Links: Cited by: §2, Table 2.
-  (2017) Learning spatio-temporal representation with pseudo-3d residual networks. CoRR abs/1711.10305. External Links: Cited by: §4.1.
-  (2021) Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. arXiv preprint arXiv:2103.17107. Cited by: §1.
-  (2020) Recognizing emotion in the wild using multimodal data. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 849–857. External Links: Cited by: §2.
-  (2020) Multi-modal fusion using spatio-temporal and static features for group emotion recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 835–840. External Links: Cited by: §2.
-  (2021) MultiModalQA: complex question answering over text, tables and images. CoRR abs/2104.06039. External Links: Cited by: §2.
-  (2019) LXMERT: learning cross-modality encoder representations from transformers. CoRR abs/1908.07490. External Links: Cited by: §2.
-  (2017) End-to-end multimodal emotion recognition using deep neural networks. CoRR abs/1704.08619. External Links: Cited by: §2.
-  (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Cited by: §4.2, §4.2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. CoRR abs/1608.00859. External Links: Cited by: §4.1.
-  (2019) What makes training multi-modal networks hard?. CoRR abs/1905.12681. External Links: Cited by: §5.
-  (2020) Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In Proceedings of the 2020 International Conference on Multimodal Interaction, ICMI ’20, New York, NY, USA, pp. 827–834. External Links: Cited by: §2, Table 2.
-  (2017) Tensor fusion network for multimodal sentiment analysis. CoRR abs/1707.07250. External Links: Cited by: §2.
-  (2016) MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR abs/1606.06259. External Links: Cited by: §2.