Human affect recognition from visual cues is an important area of computer vision due to the large amount of applications it can be employed to[14, 39]. However, most past studies have focused on automatic visual recognition of emotion from facial expressions [28, 32], with few works including emotional body expressions into the emotion recognition loop .
Research into expression of emotion suggests that emotion is equally conveyed through the body and the face in most cases [46, 20], while both the static body posture as well as the dynamics [3, 14] contribute into its perception. Furthermore, there are emotions such as pride  which are more discernible through body rather than face observation. An also constant finding in multiple studies is the fact that considering both the body and the face concurrently increases emotion recognition rates . Aviezer et al. also point out in  that the body can be a deciding factor in determining intense positive or negative emotions.
One of the primary areas that can benefit from visual affect recognition is social robotics, which is enjoying a swift rise in its applications, some of which include Robot Assisted Therapy in adults and children , Activities of Daily Living , and Education . A critical capability of social robots is empathy: the capacity to correctly interpret the social cues of humans that are manifestations of their affective state. Empathic agents are able to change their behavior and actions according to the perceived affective states and as a result establish rapport, trust, and healthy long-term interactions . Especially in the field of education, empathic robot behaviors that are congruent with the child’s feelings increase trust and have a positive impact to the child-robot relationship whilst incongruent behavior has a significantly negative effect .
An important factor in many social robots applications, and especially in child-robot interaction (CRI) is the fact that the flow of interaction is unpredictable and constantly fluctuating. Although interaction with adults can usually be restricted and controlled, the spontaneous nature of children fails to meet these criteria and becomes a true challenge. A direct implication of this is the fact that robots can no longer rely only on facial expressions to recognize emotion, but also have take advantage of body expressions which can stay visible and detectable even when the face is unobservable.
Motivated by the above, in this paper, we study automatic affect recognition through body and facial cues, under the umbrella of child-robot interaction. By leveraging the latest advancements in visual recognition of human pose in RGB images , we design an end-to-end system of automatic affect recognition that hierarchically fuses body and facial features using multiple labels. We show that even without appearance information, a small number of joints can be used in order to satisfactorily assess the human emotion and increase the performance of automatic affect recognition system in CRI scenarios. In summary, our contributions are as follows:
We propose a method based on Deep Neural Networks (DNN) that fuses body posture information as a skeleton, with facial expressions, for automatic recognition of emotion. These methods can be trained both separately and jointly, and present significant performance boosts when compared with the facial expression baselines.
We present, and analyze a database containing both acted and spontaneous affective expressions of children while participating in a CRI scenario and discuss the challenges of building an automatic emotion recognition system for children. The database contains emotional expressions both in face and posture and allows us to both observe as well as automatically recognize patterns of bodily emotional expressions across children in various ages.
We use hierarchical multi-label annotations that describe not only the emotion of the person in a frame as a whole, but also the separate body and facial expressions. These annotations allow us to train either jointly or separately, our hierarchical multi-label method which gives us computational models for the different modalities of expressions as well as their fusion.
2 Related work
The overwhelming majority of previous works in emotion recognition from visual cues have focused on using only faces as cues . Latest surveys [38, 29, 31] highlight the need for taking into account bodily expression as additional stimuli for automatic emotion recognition systems, as well as the lack of large scale annotated data.
Gunes and Piccardi [24, 25] focused on combining handcrafted facial and body features for recognizing 12 different affective states in a subset of the FABO dataset  which contains upper body affective recordings of 23 subjects. Barros et al.  used Sobel filters combined with convolutional layers in the same database while Sun et al. 
Bänziger et al.  introduced the GEMEP (GEneva Multimodal Emotion Portrayal) corpus, the CoreSet of which includes 10 actors performing 12 emotional expression. In , Dael et al. introduce a body action and posture coding system (BAP) similar to the Facial Action Coding System (FACS) which is used for coding the human facial expressions and used it subsequently in 
for classifying and analyzing body emotional expressions found in the GEMEP corpus.
In  Castellano et al. recorded a database of 10 participants performing 8 emotions, using the same framework as the GEMEP dataset. Afterwards, they fused audio, facial and body movement features using different Bayesian classifiers for automatically recognizing the depicted emotions. In  a two branch face-body late-fusion scheme is presented by combining handcrafted features from 3D body joints and action units detection using facial landmarks.
Regarding the application of affect recognition in CRI, the necessity of empathy as a primary capability of social robots for the establishment of positive long-term human-robot interaction has been the research item of several studies [11, 34]. In  Castellano et al. presented a system that learned to perceive affective expressions of children playing chess with an iCat robot by processing their facial expressions. This knowledge was then used to modify the behavior of the robot which resulted in a more engaging and friendly interaction. Under the same application,  used body motion and affective posture for detecting the engagement of children in the task. In 
, 3D human pose was used as a way of identifying actions in child-robot-therapist interaction for children with autism. In the same study the 3D pose was also successfully used for estimating the affective state of the child in the continuous arousal and valence dimensions.
3 Whole Body Emotion Recognition
In this section we first present an analysis of bodily expression of emotion. Then, we will present our method for automatic recognition of affect.
3.1 Bodily expression of emotion
While the face is the primary medium through which humans express their emotions (i.e., an affect display ), in real life scenarios it is more often that we find ourselves decoding the emotions of our interlocutor or people in our surroundings by observing their body language, especially in cases where either the face of the subject in question is occluded, hidden, or far in the distance.
Figure 2 shows some primary examples where body language is useful for correctly decoding emotions. In Figure 1(a) we can deduce that the child expresses a negative emotion. It is also important to note that the direction of the hands show us the source of the negative feelings, which is something that facial expressions do not reveal. Another example is presented in Figure 1(b) (Figure from ), where it can be seen that without the whole body we cannot identify whether the emotion of the person is positive or negative, due to its intensity. In Figure 1(c) we can identify sadness by the head pose while in 1(d) the body acts as a supportive modality; we can also deduce anger just by the facial expression.
A problem that arises when dealing with spontaneous (i.e., not acted) or in-the-wild data is the fact that different individuals express themselves through different modalities, depending on which cue they prefer using (body, face, voice) 
. This fact is cumbersome for supervised learning algorithms, e.g., in samples where an emotion label corresponds to the facial expression only and not the body which means that the subject in question preferred to use only his/her face while the body remained neutral. In such data, one way to alleviate this issue is to include hierarchical labels, which first denote the ground truth labels of the different modalities. Examples of hierarchical multi-labels are shown in Figure1. denotes the emotion the human is feeling in the image (which we call the “whole” body label), the emotion that is conveyed through the face (i.e., if the subject uses its face, else ) and the emotion that is conveyed through the body (i.e., if the subject uses its body, else ).
Based on the aforementioned analysis, Figure 3 presents our DNN architecture for automatic multi-cue affect recognition using hierarchical multi-label training (HMT). We assume that we have both the whole body label , as well as the hierarchical labels for the face and for the body.
The model consists of two initial pathways through which information flows from the input video to the output (which is the recognized emotional state).
Facial Expression Recognition Branch
The facial expression recognition branch of the network is responsible for recognizing emotions by decoding facial expressions. If we consider a frame of a video sequence , at each frame we first apply a head detection and alignment algorithm in order to obtain the cropped face image, which, in turn, is input into a Residual Network  CNN architecture in order to get a 2048-long vector feature description of each frame
. Then, we apply temporal max pooling over all frames of the sequence in order to get the representation of the whole sequence of facial frames:
We apply a fully connected (FC) layer on in order to get the scores for the facial emotion, .
We can then calculate the loss obtained through this branch as the cross entropy between the face labels
and the probabilities of the face scoresobtained via a softmax function:
with denoting the number of emotion classes.
Body Expression Recognition Branch
In the second branch, for each frame of the input video , we apply a 2D pose detection method in order to get the skeleton where is the number of joints in the detected skeleton. The 2D pose is then flattened and input as a vector into a DNN in order to get a representation . We then apply global temporal average pooling (GTAP) over the entire input sequence:
The scores for the body emotion are obtained by passing the pose representation of the video over an FC layer. The loss in this branch is the cross entropy loss (Eq. 3) between the body labels and the probabilities , .
Whole Body Expression Recognition Branch
In order to obtain whole body emotion recognition scores , we concatenate and and feed them through another FC. We then use the whole body emotion labels in order to get the whole body cross entropy loss between the whole body labels and the probabilities , .
Finally, we employ a fusion scheme as follows: we concatenate the scores , , and and use a final FC in order to obtained the fused scores . This way we get a final loss which is the cross entropy between the whole body labels and .
During training, the loss that is backpropagated through the network is:
The final prediction of the network for the affect of the human in the video is obtained by the fusion score vector .
4 The BabyRobot Emotion Database
In order to evaluate our method, we have collected a database which includes multimodal recordings of children interacting with two different robots (Zeno , Furhat ), in a laboratory setting that has been decorated in order to resemble a children’s room (Figure 4).
We call the database the BabyRobot Emotion Database (BRED). BRED includes two different kind of recordings: Acted Recordings during which children were asked to express one of six emotions, and Spontaneous Recordings during which children were playing a game called “Express the feeling” with the Zeno and Furhat robots.
The game was touchscreen-based and throughout its duration, children selected face down cards, each of which represented a different emotion. After seeing the cards, the children were asked to try to express the emotion, and then one of the robots followed up with a facial gesture that expressed the emotion as well. A total of 30 children with ages between 6 to 12 took part in the recordings.
It is important to note that we did not give any guidelines or any information to the children on how to express their emotions.
The emotions included in the database are: Anger, Happiness, Fear, Sadness, Disgust, and Surprise, the basic emotions included in Ekman and Freisen’s initial studies . These are also the emotions that are found more often across different databases of emotional depictions .
Hierarchical Database Annotations
In total, the initial recordings included samples of emotional expressions from the Acted session and samples from the Spontaneous session ( children emotions for both sessions). The annotation procedure included three different phases. The first phase included annotation by different annotators, which filtered out recordings where the children did not perform any emotion (due to shyness, lack of attention, or other reasons), and identified the temporal segments during which the expression of emotions takes place (starting with the onset of the emotion and ending just before the offset). The second step included 2 expert annotators which validated the annotations of the previous phase. In the final third step, one of the two previous annotators annotated the videos hierarchically, by indicating for each video whether the child was using its face, its body, or both, to express the emotion.
From Table 2 we can see that fear and anger are emotional expressions where children utilized their body more than their face. Especially in fear, almost all children used their body to express the emotion. On the other hand, to indicate happiness, surprise and disgust, all children used facial expressions.
Table 3 also contains some of the annotators observations regarding the bodily expression of emotion in BRED, as well as examples from the database. All images include facial landmarks (although we do not use them in any way in our method) in order to protect privacy.
The database is very challenging as it features many intra-class variations, multiple poses, and in many cases similar body expressions for different classes.
|# valid sequences||211|
|avg. sequence length||72 frames (30fps)|
|# different subjects||30|
|Total (#)||Facial (#)||Body (#)|
|mainly facial, rare jumping and/or open raised hands, body erect, upright head||crying (with hands in front on face), motionless, head looking down, contracted chest||expanded chest, hand movement without specific patterns, either positive or negative surprise|
|quick eye gaze, weak facial expressions, arms crossed in front of body, head sink||mainly with facial expression (tongue out), movement away from/hands against robot||clenched fists, arms crossed, squared shoulders|
In this section we will present our experimental procedure and results. We first did an exploratory analysis of the different branches and pathways of the HMT architecture of 3 in the GEMEP (GEneva Multimodal Emotion Portrayals) database . As far as we are aware, this is the only publicly available database that includes annotated whole body expressions of emotions where videos are available. We believe that databases of upper body depictions such as FABO  where the subjects are sitting, restrict body posture expression and force the subjects to focus mostly on using their hands.
Our main experimental evaluation is then done on the BabyRobot Emotion Database where we experiment with many different methods and variations of the HMT network.
5.1 Network training
Network Setup and Initialization
In order to avoid overfitting due to the small number of sequences in both GEMEP and BRED, and especially in the facial branch which includes a large number of parameters, we pretrain the branch in the AffectNet Database . The AffectNet Database is a large scale database which contains more than 1 million images which were collected from the internet, containing face images annotated with one of the following labels: Neutral, Happiness, Anger, Sadness, Disgust, Contempt, Fear, Surprise, None, Uncertain and Non-face. The manually annotated images amount to with about
falling into one of the emotion categories (neutral plus 7 emotions). The database also includes a validation set of 500 images for each class while the test set is not yet released. We initialize the network with weights learned from ImageNet training, and then train for 20 epochs using a batch size of 128 and the Adam optimizer. The branch achieves the best accuracy at the AffectNet validation set on the 13th epoch (). The posture branch of the database is initialized as in .
For detecting, cropping, and aligning the face for each frame, we use the OpenFace 2  toolkit. We then use the facial branch in order to extract a
length feature vector which is used during training. That means that during training the parameters of the feature extraction layers of the facial branch remain fixed. Similarly, we extract the 2D pose of the subjects in the database using OpenPose along with the 2D hand keypoints . In order to filter out badly detected keypoints, we set all keypoints with a confidence score lower than as 0 for BRED and lower than for the GEMEP database. We use a lower threshold for the BRED because the images and pose detections are more challenging. The total size of the input vector for the Body Expression Recognition Branch is : 25 2D keypoints of the skeleton and 21 2D keypoints for each hand.
Exploratory Results on the GEMEP Database
The GEMEP database includes videos of 10 actors performing 17 different emotions: admiration, amusement, anger, anxiety, contempt, despair, disgust, fear, interest, irritation, joy, pleasure, pride, relief, sadness, surprise and tenderness. In this work we use the CoreSet of the database which includes the first 12 of the aforementioned emotions.
We use 10-fold leave-one-subject-out cross-validation and repeat the process for 10 iterations, averaging the scores in the end.
For all different evaluation setups, we train for 200 epochs, reducing the learning rate by a factor of 10 at 150 epochs.
We report Top 1 accuracy for several experimental setups in Table 4
. For the body expression recognition branch we compare three different implementations: the implementation with global temporal average pooling (GTAP) using an hidden FC layer of 256 neurons followed by a ReLU activation, a temporal convolutional network (TCN) with 8 temporal convolutional residual blocks, 128 channels and kernel size 2, and a bidirectional long short-term memory network (LSTM)  with 100 hidden units and two layers preceded by an FC layer of 128 neurons and a ReLU activation. For both TCN and LSTM we average the outputs over all time steps. We see that GTAP achieves the highest accuracy although it’s a much simpler method. We believe that due to the small amount of data the methods focus only on certain representative postures that occur during the expression of emotions, and ignore sequential information. The face branch achieves a higher accuracy score than the body branch which is an expected result. Our main observation is the fact that the whole body emotion recognition branch achieves a significant improvement over the face branch baseline.
In the Table we also include experiments on a frame level, where we take only the middle frame of each video sequence and skip the temporal pooling structures in each branch. We see that the face and body branches in this case achieve similar results with the whole body emotion recognition branch giving a large performance boost again.
|Body br. (TCN)||0.31|
|Body br. (LSTM)||0.28|
|Body br. (GTAP)||0.34|
|Whole Body br.||0.51|
|Human Baseline||0.47 |
|Whole Body br.||0.33|
Emotion specific details can be seen in the confusion matrices of Fig. 7. We show the confusion matrices for the separately trained body, face and whole body branches. We can see that in cases such as the pride emotion, the face emotion can not correctly distinguish it, as opposed to the postural branch which achieves the highest accuracy for pride. This result also is in line with  where Tracy and Robins argue that pride includes a set of body configurations which are more distinguishable than the corresponding facial expression. In other emotions such as joy and anger, combination of face and posture results in a higher accuracy. There are also emotions in which the body branch fails to learn any patterns such as anxiety or pleasure. In these cases, the whole body branch achieves a lower accuracy than the face branch.
|Label||(6 classes)||(7 classes)||(7 classes)|
||Joint-1L||0.64 (0.66)||0.65 (0.67)||-||-||-||-|
|Body br.||0.29 (0.29)||0.35 (0.33)||-||-||0.36 (0.53)||0.38 (0.56)|
|Face br.||0.57 (0.60)||0.61 (0.63)||0.50(0.59)||0.52 (0.61)||-||-|
|Sum Fusion||0.63 (0.66)||0.65 (0.67)||-||-||-||-|
|Body br.||0.32 (0.33)||0.36 (0.35)||-||-||0.35 (0.48)||0.39 (0.47)|
|Face br.||0.54 (0.57)||0.59 (0.63)||0.48 (0.57)||0.52 (0.61)||-||-|
|Fusion||0.64 (0.67)||0.66 (0.68)||-||-||-||-|
|Body br.||0.32 (0.32)||0.36 (0.34)||-||-||0.36 (0.50)||0.39(0.49)|
|Face br.||0.53 (0.56)||0.59(0.63)||0.51 (0.60)||0.54 (0.63)||-||-|
|Whole body br.||0.64 (0.66)||0.66 (0.68)||-||-||-||-|
|Body br.||0.32 (0.32)||0.36 (0.34)||-||-||0.34 (0.47)||0.38(0.46)|
|Face br.||0.53 (0.56)||0.58(0.62)||0.49 (0.58)||0.53 (0.62)||-||-|
|Fusion||0.69 (0.71)||0.71 (0.72)||-||-||-||-|
Results on the BabyRobot Emotion Database
For BRED we follow the exactly same procedure as with the GEMEP database: training for 200 epochs with a learning rate step at 150 epochs, and 10-fold cross validation for 10 iterations. For the 10-fold cross validation, we make sure that each subject (30 in total) cannot appear both in the training and test set during the same split. Because the database is highly unbalanced, especially for the body labels, we report results on balanced and unbalanced F1-score and accuracy. Due to this imbalance we also use a balanced cross entropy loss for since the amount of instances labeled as neutral are much larger than the emotion instances.
We also note that for BRED, the annotations and include 7 classes (all emotions plus neutral) while the whole body annotation includes 6 classes (all emotions).
We report our results in Table 5. The column labeled with reports the metrics on the whole body labels while the columns and report results on the hierarchical face and body labels, respectively. For calculating the metrics of the face and body branches against we ignore the scores of the “neutral” label. Numbers outside parenthesis report balanced scores and inside parenthesis unbalanced scores.
The Table contains results of different methods: SEP denotes independent training of the body and face branch using their corresponding labels. Joint-1L denotes training of the whole body emotion branch and only using the loss. HMT-3a denotes joint training of the hierarchical multi-label training network, if we omit the branch of the whole body emotion recognition, i.e., with the losses , , and . HMT-3b denotes joint training of the three losses: , , and , by omitting the final score fusion. Finally, HMT-4 denotes the joint training with all four losses of the HMT network. In the methods that include the fusion branch, we get the final prediction by the scores of the fusion . In the case of HMT-3b where we omit the final fusion we get the final whole body label prediction by the whole body branch.
Our initial observation is the fact that the combination of body posture and facial expression results in a significant improvement over the facial expression baselines, for all different methods.
Secondly, we see that HMT-4 achieves the highest scores for all metrics, across all methods, as far as the whole body emotion label is concerned while HMT-3a and HMT-3b have a similar performance which is also comparable with separate training of the body and face branches and fusing them with sum fusion.
We remind that and have one more class than (neutral) which is why the scores appear lower for the face branch in the column. This is not the case for the body branch due to the fact that and are different by a lot more labels (99), while and differ in only labels.
In Fig. 7 we also show the confusion matrices for the body, face and fusion predictions, when fared against the whole body labels . We can see that generally, due to the fact that children relied more on facial expressions than bodily expressions, only including the body branch in a system would result in a low performance. We also see that the face branch achieves a low recognition rate for fear and anger. However, fusing the two using the HMT network results in a model that can reliably recognize all emotions.
In Fig. 5 we present several results (both positive and negative) of our method.
In this work we proposed a method for automatic recognition of affect that combines whole body posture and facial expression cues. The proposed method can be trained both end-to-end as well as individually, and leverage multiple hierarchical labels in order to give us computational models that can be used jointly and individually.
We did an extensive evaluation of the proposed method on the BabyRobot Emotion Database which features whole body emotional expressions of children during a CRI scenario. CRI presents a challenging application which requires leveraging body posture for emotion recognition, and cannot rely only of facial expressions.
Our results show that fusion of body and facial expression cues can be used to significantly enforce the emotion recognition baselines that are based only on facial expressions, and that even 2D posture can be used with promising results, for recognition of emotion from body posture. We also show that hierarchical multi-label training can be used for increasing the performance of the system.
We believe our research shows promising results towards establishing body posture as a necessary direction for multiple applications of computer vision, and highlights the need for large scale whole body emotional expression databases.
-  Furhat Robotics. http://furhatrobotics.com.
-  Robokind. Advanced Social Robots. http://robokind.com/.
-  A. P. Atkinson, W. H. Dittrich, A. J. Gemmell, and A. W. Young. Emotion perception from dynamic and static body expressions in point-light and full-light displays. Perception, 33(6):717–746, 2004.
-  H. Aviezer, Y. Trope, and A. Todorov. Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science, 338(6111):1225–1229, 2012.
-  S. Bai, J. Z. Kolter, and V. Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
-  T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L. Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), pages 59–66, May 2018.
-  T. Bänziger, M. Mortillaro, and K. R. Scherer. Introducing the geneva multimodal expression corpus for experimental research on emotion perception. Emotion, 12(5):1161, 2012.
-  P. Barros, D. Jirak, C. Weber, and S. Wermter. Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Networks, 72:140–151, 2015.
-  T. Belpaeme, P. Baxter, J. De Greeff, J. Kennedy, R. Read, R. Looije, M. Neerincx, I. Baroni, and M. C. Zelati. Child-robot interaction: Perspectives and challenges. In International Conference on Social Robotics, pages 452–459. Springer, 2013.
-  T. Belpaeme, J. Kennedy, A. Ramachandran, B. Scassellati, and F. Tanaka. Social robots for education: A review. Science Robotics, 3(21):eaat5954, 2018.
-  T. W. Bickmore and R. W. Picard. Establishing and maintaining long-term human-computer relationships. ACM Transactions on Computer-Human Interaction (TOCHI), 12(2):293–327, 2005.
-  E. Broadbent, R. Stafford, and B. MacDonald. Acceptance of healthcare robots for the older population: Review and future directions. International journal of social robotics, 1(4):319, 2009.
-  R. Calvo, S. D’Mello, J. Gratch, A. Kappas, M. Lhommet, and S. C. Marsella. Expressing emotion through posture and gesture, 07 2014.
-  R. A. Calvo, S. D’Mello, J. Gratch, and A. Kappas. The Oxford handbook of affective computing. Oxford Library of Psychology, 2015.
-  Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  G. Castellano, L. Kessous, and G. Caridakis. Emotion recognition through multiple modalities: face, body gesture, speech. In Affect and emotion in human-computer interaction, pages 92–103. Springer, 2008.
-  G. Castellano, I. Leite, A. Pereira, C. Martinho, A. Paiva, and P. W. McOwan. Multimodal affect modeling and recognition for empathic robot companions. International Journal of Humanoid Robotics, 10(01):1350010, 2013.
-  N. Dael, M. Mortillaro, and K. R. Scherer. The body action and posture coding system (bap): Development and reliability. Journal of Nonverbal Behavior, 36(2):97–121, 2012.
-  N. Dael, M. Mortillaro, and K. R. Scherer. Emotion expression in body action and posture. Emotion, 12(5):1085, 2012.
-  B. De Gelder. Why bodies? twelve reasons for including bodily expressions in affective neuroscience. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1535):3475–3484, 2009.
-  P. Ekman and W. V. Friesen. Head and body cues in the judgment of emotion: A reformulation. Perceptual and motor skills, 24(3 PT 1):711–724, 1967.
-  P. Ekman and W. V. Friesen. Constants across cultures in the face and emotion. Journal of personality and social psychology, 17(2):124, 1971.
-  H. Gunes and M. Piccardi. A bimodal face and body gesture database for automatic analysis of human nonverbal affective behavior. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 1, pages 1148–1153. IEEE, 2006.
-  H. Gunes and M. Piccardi. Bi-modal emotion recognition from expressive face and body gestures. Journal of Network and Computer Applications, 30(4):1334–1345, 2007.
-  H. Gunes and M. Piccardi. Automatic temporal segment detection and affect recognition from face and body display. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(1):64–84, 2009.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  H. Jung, S. Lee, J. Yim, S. Park, and J. Kim. Joint fine-tuning in deep neural networks for facial expression recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 2983–2991, 2015.
-  M. Karg, A. Samadani, R. Gorbet, K. Kühnlenz, J. Hoey, and D. Kulić. Body movements for affective expression: A survey of automatic recognition and generation. IEEE Transactions on Affective Computing, 4(4):341–359, Oct 2013.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  A. Kleinsmith and N. Bianchi-Berthouze. Affective body expression perception and recognition: A survey. IEEE Transactions on Affective Computing, 4(1):15–33, Jan 2013.
C.-M. Kuo, S.-H. Lai, and M. Sarkis.
A compact deep learning model for robust facial expression recognition.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
-  Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
-  I. Leite, G. Castellano, A. Pereira, C. Martinho, and A. Paiva. Empathic robots for long-term interaction. International Journal of Social Robotics, 6(3):329–341, 2014.
-  S. Li and W. Deng. Deep facial expression recognition: A survey. arXiv preprint arXiv:1804.08348, 2018.
-  E. Marinoiu, M. Zanfir, V. Olaru, and C. Sminchisescu. 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2158–2167, 2018.
-  A. Mollahosseini, B. Hasani, and M. H. Mahoor. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, pages 1–1, 2018.
-  F. Noroozi, C. A. Corneanu, D. Kamińska, T. Sapiński, S. Escalera, and G. Anbarjafari. Survey on emotional body gesture recognition. arXiv preprint arXiv:1801.07481, 2018.
-  R. W. Picard et al. Affective computing. 1995.
-  A. Psaltis, K. Kaza, K. Stefanidis, S. Thermos, K. C. Apostolakis, K. Dimitropoulos, and P. Daras. Multimodal affective state recognition in serious games applications. In Imaging Systems and Techniques (IST), 2016 IEEE International Conference on, pages 435–439. IEEE, 2016.
-  J. Sanghvi, G. Castellano, I. Leite, A. Pereira, P. W. McOwan, and A. Paiva. Automatic analysis of affective postures and body motion to detect engagement with a game companion. In Proceedings of the 6th international conference on Human-robot interaction, pages 305–312. ACM, 2011.
-  T. Simon, H. Joo, I. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
-  B. Sun, S. Cao, J. He, and L. Yu. Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy. Neural Networks, 105:36–51, 2018.
-  J. L. Tracy and R. W. Robins. Show your pride: Evidence for a discrete emotion expression. Psychological Science, 15(3):194–197, 2004.
-  J. Van den Stock, R. Righart, and B. De Gelder. Body expressions influence recognition of emotions in the face and voice. Emotion, 7(3):487, 2007.
-  H. G. Wallbott. Bodily expression of emotion. European journal of social psychology, 28(6):879–896, 1998.