Automatic human affect recognition from visual cues is an important area of computer vision that has attracted increased interest over the last two decades, due to its many applications. Indeed, social robotics, psychiatric care , and edutainment  are all areas that can benefit from automatic recognition of emotion.
Most past approaches to the problem have focused on facial expressions in order to determine the emotional state of the person of interest [7, 18, 22]. This is reasonable due to the fact that facial expressions have been studied extensively in the psychology and emotion literature . For example, the Facial Action Coding System (FACS)  identifies the units of facial movements, based on facial muscle groups. Combinations of the so-called action units (AUs) have also been linked with emotional states with extensions of the basic FACS such as EMFACS (Emotion FACS) . On the other hand, there is no similar established coding system for body expressions, although some have been proposed .
Compared to facial expression based approaches, recent works have sought alternative modalities and streams of information to detect emotion; one is bodily expressions since many have highlighted the fact that the emotional state is conveyed through bodily expressions as well, and in certain emotions it is the main modality [5, 15, 26], or can be used to correctly disambiguate the corresponding facial expression . Simultaneously, it is important to note that in cases and applications where the emotion needs to be identified, the human body is more frequently available than the face since the face can be occluded, hidden, or far in the distance. Another auxiliary stream of information besides the face and the body that can help in identifying emotions is the context and the surrounding environment of the person [16, 21]. It is apparent that both the place, as well as objects and other humans can influence a person’s emotions.
We should also note that inherently emotion recognition is a multi-label problem - the subject might be feeling two or more emotions. This is true, especially when considering an extended set of emotions, as in . The emotions in extended sets do not have the same “semantic” distance between them. For example, anger is more close to annoyance than to happiness. Considering that previous works have showed the superiority of methods that attempt to learn a joint embedding space that contains both word embeddings and visual representations [6, 12, 24], we believe that trying to attach a semantic meaning to the extracted visual feature is a natural way forward.
In this paper, based on the above, we describe the method of our team in the First International Workshop on Bodily Expressed Emotion Understanding (BEEU) challenge. Our method combines Temporal Segment Networks (TSNs) 
focusing on the body, using the context in each video as an additional stream, and also uses an extra visual-semantic embedding loss, based on GloVE (Global Vectors) word embedding representations. Our experiments in the validation set verify the better performance of our method compared to the traditional TSNs, while our emotion recognition score on the test set was 0.26235.
2 Related Work
While most past approaches in visual detection of affect have been focused on facial expressions , recent approaches have started taking into account the body language  of the person in question, as well as its surrounding context/environment.
, Dael et al. analyzed and classified body emotional expressions using a body action and posture coding system which was proposed in. The 3D pose of children was also utilized in  by Marinoui et al. to detect emotions in continuous dimensions, while in , 2D pose was used and fused with facial expressions for child emotion recognition. Luo et al.  introduced a large scale video dataset (BoLD) annotated with categorical and continuous emotions, which is the one used in the BEEU challenge.
Regarding the context modality, Kosti et al. 
introduced a large scale dataset for emotion recognition (EMOTIC) in different contexts (e.g., other people, places, or objects) and a convolutional neural network (CNN) based two-stream architecture that focused on the body and context of the subjects. The CAER video dataset for context-based emotion recognition was presented in, along with a two-stream architecture which employed adaptive-fusion to merge the two steams. In , Mittal et al. designed a deep architecture with several branches, focusing on different interpretations of the surrounding context (e.g., environment and interaction context) to significantly increase resulting predictions in the EMOTIC dataset.
Finally, some recent works have also focused on extracting visual representations from images that present the semantic relations found in embeddings built from words. The DeViSE embedding model  extracted semantically-meaningful visual representations by introducing a similarity loss between the feature vector extracted from a CNN and the word embedding from a skip-gram text model. Using a similar method, Wei et al.  built joint text and visual embeddings as emotion representation from web images, and in , Ye and Li built semantic embeddings for a multi-label classification problem.
The dataset used in the challenge is the BoLD (Body Language Dataset) corpus  consisting of 9,876 video clips of humans expressing emotion, primarily through body movements. Each clip can contain multiple characters, yielding a total of 13,239 annotations, split into a training, validation, and test set. The dataset has been annotated by crowdsourcing employing two widely accepted categorizations of emotion. The first one is the categorical annotation with a total of 26 labels first used in , by collecting and processing an extensive affective vocabulary. The second annotation regards the continuous emotional dimensions of the VAD (Valence - Arousal - Dominance) Emotional State Model . The methods in the challenge are evaluated using the following Emotion Recognition Score (ERS):
where is the mean coefficient of determination () score for the three dimensional emotions (VAD), and and
is the mean Average Precision and the mean area under receiver operating characteristic curve (ROC AUC) of the multilabel categorical predictions.
4 Model Architecture
Our model is based on the TSN architecture , which has been widely used in action recognition and can be seen in Fig. 1. During training, different segments are selected from the input video, and then consecutive frames are selected from each segment. This is done to deal with the fact that consecutive frames have usually redundant information. Traditionally, two different modalities are used, one is the spatial (RGB) modality and the second one is the optical flow. TSNs have already been shown to achieve good results for the BoLD dataset in its introductory paper .
In our approach, we modify the original version of TSNs mainly in two directions:
We introduce one additional stream based on the context-environment surrounding the annotated human. For the RGB modality, we input the context in the network in the same way as in , by masking out the instance body (we set all pixels to 0). We call this stream RGB-c, and the body streams RGB-b and Flow-b. During training, the RGB-b and RGB-c streams are combined at the feature level (RGB-bc) and are trained jointly while the Flow-b TSN is trained independently.
Our second extension is the introduction of an embedding loss on the feature vector extracted by the Convolutional Neural Network (ConvNet). This is done to exploit the fact that some emotions are closer semantically to others. This is also revealed by examining the correlation matrix of the dataset labels in, where some labels occur more frequently in combination with others (e.g. Happiness and Pleasure, Annoyance and Anger, etc.). Due to this result, we try to attach a semantic meaning to the feature vector extracted by the backbone image network.
, where it is apparent that the distances between embeddings are indicative of their “semantic” distance. We then use a fully connected layer to map the feature extracted from the image to a 300-dimensional space and introduce the following mean-squared based loss:
where is the feature vector extracted by applying the convNet on the image ,
is a linear transformation from the space of the feature vector to the word embedding space,is the word embedding of the label , and is the set of all positive labels for the image . That is, we try to reduce the Euclidean distance between the projected image feature and the arithmetic mean of the GloVE embeddings of the positive labels for image/video.
Finally, after extracting for each sampled image its feature vector, we use two fully connected layers, one to classify to the 26 different categorical labels, and one to regress over the 3 different categorical emotions. The two TSNs are trained using the following loss:
Specifically, since the dataset does not provide explicitly the multilabel targets, but the crowdsourced scores between 0 and 1, we include two different losses for the classification part: that is the binary cross-entropy between the predicted scores and the multilabel target (obtained after thresholding the multilabel scores at 0.5) and that is the mean squared error between the predicted scores and the multilabel scores. We empirically found that the inclusion of slightly boosted performance. For the regression part, is the mean-squared error between the regressed values and the continuous emotions. Finally is as in (2).
5 Experimental Results
. The backbone networks used is a residual network (ResNet) with 101 layers for the body convNets and a ResNet with 50 layers for the context convNet. We use the default hyperparameters of TSNs: 3 segments, 1 frame from each segment for the RGB streams, and 5 frames from each segment for the optical flow stream. The consensus used for segment fusion is averaging. For each network, we select the epoch with the best validation ERS. We have also found experimentally that the partialBN (Batch Normalization) technique used in gives a nontrivial boost to the performance of the network.
First, in Table 1 we present two ablation experiments regarding the addition of . We can see that adding the embedding loss increases slightly the performance in the RGB-b stream, and gives a boost to the performance of the Flow-b stream.
|RGB-b + Flow-b||0.1623||0.6307||0.078||0.2375|
|RGB-b + Flow-b||0.1637||0.6327||0.0874||0.2428|
Then, in Table 2 we present our experimental results on the validation set of BoLD including the RGB context stream. From the results we can see that including the context along with the body in the RGB modality boosts the validation ERS of the architecture. We also experimented with including the context in the Flow network, but this resulted in worse performance. Our final submission for the test set was the model with the best validation score (0.2439 employing RGB-bc + Flow-b), using 25 segments instead of 3. The results of the different metrics on the test set can also be seen in Table 2, while the final ERS is 0.26235, improving upon the previous best result of 0.2530.
|RGB-bc + Flow-b||0.1656||0.6266||0.0917||0.2439|
|test||RGB-bc + Flow-b||0.1796||0.6416||0.1141||0.26235|
In this paper we presented our method submitted at the BEEU challenge, winning first place. Our method extended the TSN framework to include a visual-semantic embedding loss, by utilizing GloVE word embeddings, and also included an additional context stream for the RGB modality. We verified the superiority of our extensions compared to the baseline on the validation set of the challenge, and submitted the best system which achieved 0.26235 Emotion Recognition Score on the BoLD test set, surpassing the previous best result of 0.2530.
This research is carried out / funded in the context of the project “Intelligent Child-Robot Interaction System for designing and implementing edutainment scenarios with emphasis on visual information” (MIS 5049533) under the call for proposals “Researchers’ support with an emphasis on young researchers- 2nd Cycle”. The project is co-financed by Greece and the European Union (European Social Fund- ESF) by the Operational Programme Human Resources Development, Education and Lifelong Learning 2014-2020.
-  (2012) Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science 338 (6111), pp. 1225–1229. Cited by: §1.
-  (2018) Emotion modelling for social robotics applications: a review. Journal of Bionic Engineering 15 (2), pp. 185–203. Cited by: §1.
-  (2012) Emotion expression in body action and posture.. Emotion 12 (5), pp. 1085. Cited by: §2.
-  (2012) The body action and posture coding system (BAP): development and reliability. J. Nonverbal Behavior 36 (2), pp. 97–121. Cited by: §1, §2.
-  (2009) Why bodies? twelve reasons for including bodily expressions in affective neuroscience. Philosophical Transactions of the Royal Society of London B: Biological Sciences 364 (1535), pp. 3475–3484. Cited by: §1, §2.
-  (2016) Word2visualvec: image and video to sentence matching by visual feature prediction. arXiv preprint arXiv:1604.06838. Cited by: §1.
-  (2014) Compound facial expressions of emotion. Proceedings of the National Academy of Sciences 111 (15), pp. E1454–E1462. Cited by: §1.
-  (1997) Universal facial expressions of emotion. Segerstrale U, P. Molnar P, eds. Nonverbal communication: Where nature meets culture, pp. 27–46. Cited by: §1.
-  (1997) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (facs). Oxford University Press, USA. Cited by: §1.
-  (2019) Fusing body posture with facial expressions for joint recognition of affect in child–robot interaction. IEEE Robotics and Automation Letters 4 (4), pp. 4011–4018. Cited by: §1, §2.
-  (1983) EMFACS-7: emotional facial action coding system. Unpublished manuscript, University of California at San Francisco 2 (36), pp. 1. Cited by: §1.
-  (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §1, §2.
-  (2016) Improving facial emotion recognition in schizophrenia: a controlled study comparing specific and attentional focused cognitive remediation. Frontiers in psychiatry 7, pp. 105. Cited by: §1.
-  (2006) A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In Proc. ICPR, Vol. 1, pp. 1148–1153. Cited by: §2.
-  (2013) Affective body expression perception and recognition: a survey. IEEE Trans. on Affective Computing 4 (1), pp. 15–33. External Links: Cited by: §1, §2.
Emotion recognition in context.
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1960–1968 (en). Cited by: §1, §2, §3.
-  (2019) Context-aware emotion recognition networks. In Proc. IEEE International Conference on Computer Vision, pp. 10143–10152. Cited by: §2.
-  (2010) The extended Cohn-Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In Proc. IEEE computer society conference on computer vision and pattern recognition-workshops, pp. 94–101. Cited by: §1.
-  (2020) ARBEE: Towards automated recognition of bodily expression of emotion in the wild. International Journal of Computer Vision 128 (1), pp. 1–25 (en). Cited by: §1, §2, §3, §4, §4, §5.
-  (2018) 3D human sensing, action and emotion recognition in robot assisted therapy of children with autism. In Proc. CVPR, pp. 2158–2167. Cited by: §2.
-  (2020) EmotiCon: context-aware multimodal emotion recognition using frege’s principle. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14234–14243. Cited by: §1, §2, §4.
-  (2017) AffectNet: a database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10 (1), pp. 18–31. Cited by: §1.
GloVE: global vectors for word representation.
Proc. Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: §1, §4.
-  (2017) Multiple instance visual-semantic embedding.. In Proc. BMVC, Cited by: §1.
-  (1977) Evidence for a three-factor theory of emotions. Journal of Research in Personality 11 (3), pp. 273–294 (en). Cited by: §3.
-  (2004) Show your pride: evidence for a discrete emotion expression. Psychological Science 15 (3), pp. 194–197. Cited by: §1.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European Conference on Computer Vision, pp. 20–36. Cited by: §1, §4, §5.
-  (2020) Learning visual emotion representations from web data. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13106–13115. Cited by: §2.
-  (2020) Multilabel deep visual-semantic embedding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (6), pp. 1530–1536 (en). Cited by: §2.