1 Introduction
Human emotion recognition using intelligent systems is an important sociobehavioral task that arises in various applications, including behavior prediction [13], surveillance [2], robotics [4], affective computing [52, 3], etc. Current research in perceiving human emotion predominantly uses facial cues [17], speech [24], or physiological signals such as heartbeats and respiration rates [55]. These techniques have been used to identify and classify broad emotions including happiness, sadness, anger, disgust, fear and other combinations [15].
Understanding the perceived emotions of individuals using nonverbal cues, such as face expressions or body movement, is regarded as an important and challenging problem in both AI and psychology, especially when selfreported emotions are unreliable or misleading [37]. Most prior work has focused on facial expressions, due to the availability of large datasets [16]. However, facial emotions can be unreliable in contexts such as referential expressions [14] or the presence or absence of an audience [20]. Therefore, we need better techniques that can utilize other nonverbal cues.
In this paper, we mainly focus on using movement features corresponding to gaits in a walking video for emotion perception. A gait is defined as an ordered temporal sequence of body joint transformations (predominantly translations and rotations) during the course of a single walk cycle. Simply stated, a person’s gait is the way the person walks. Prior work in psychology literature has reported that participants were able to identify sadness, anger, happiness, and pride by observing affective features corresponding to arm swinging, long strides, erect posture, collapsed upper body, etc.
[36, 34, 35, 30].There is considerable recent work on pose or gait extraction from a walking video using deep convolutional network architectures and intricately designed loss functions
[10, 21]. Gaits have also been used for a variety of applications including action recognition [46, 49] and person identification [54]. However, the use of gaits for automatic emotion perceptions has been fairly limited, primarily due to a lack of gait data or videos annotated with emotions [7]. It is difficult and challenging to generate a large dataset with many thousands of annotated realworld gait videos to train a network.Main Results: We present a learningbased approach to classify perceived emotions of an individual walking in a video. Our formulation consists of a novel classifier and a generative network as well as an annotated gait video dataset. The main contributions include:

A novel endtoend Spatial Temporal Graph ConvolutionBased Network (STEP), which implicitly extracts a person’s gait from a walking video to predict their emotion. STEP combines deeply learned features with affective features to form hybrid features.

A Conditional Variational Autoencoder (CVAE) called STEPGen, which is trained on a sparse realworld annotated gait set and can easily generate thousands of annotated synthetic gaits. We enforce the temporal constraints (e.g., gait drift and gait collapse) inherent in gaits directly into the loss function of the CVAE, along with a novel pushpull regularization loss term. Our formulation helps to avoid overfitting by generating more realistic gaits. These synthetic gaits improve the accuracy of STEP by in our benchmarks.

We present a new dataset of human gaits annotated with emotion labels (EGait). It consists of realworld and synthetic gait videos annotated with different emotion labels.
We have evaluated the performance of STEP on EGait. The gaits in this dataset were extracted from videos of humans walking in both indoor and outdoor settings and labeled with one of four emotions: angry, sad, happy, or neutral. In practice, STEP results in classification accuracy of on EGait. We have compared it with prior methods and observe:

An accuracy increase of over prior learningbased method [38]. This method uses LSTMs for modeling their input, but for an action recognition task.

Accuracy improvement of on the absolute over prior gaitbased emotion recognition methods reported in the psychology literature that use affective features.
2 Related Work
We provide a brief overview of prior work in emotion perception and generative models for gaitlike datasets.
Emotion Perception. Face and speech data have been widely used to perceive human emotions. Prior methods that use faces as input commonly track action units on the face such as points on the eyebrow, cheeks and lips [16], or track eye movements [39] and facial expressions [33]. Speechbased emotion perception methods use either spectral features or prosodic features like loudness of voice, difference in tones and changes in pitch [24]
. With the rising popularity of deep learning, there is considerable work on developing learned features for emotion detection from largescale databases of faces
[51, 53] and speech signals [12]. Recent methods have also looked at the crossmodality of combined face and speech data to perform emotion recognition [1]. In addition to faces and speech, physiological signals such as heartbeats and respiration rates [55] have also been used to increase the accuracy of emotion perception. Our approach for emotion perception from walking videos and gaits is complimentary to these methods and can be combined.Different methods have also been proposed to perceive emotions from gaits. Karg et al. [26] use PCAbased classifiers, and Crenn et al. [8] use SVMs on affective features. Venture et al. [45] use autocorrelation matrices between joint angles to perform similaritybased classification. Daoudi et al. [11] represent joint movements as symmetric positive definite matrices and perform nearest neighbor classification.
Gaits have also been widely used in the related problem of action recognition [25, 18, 19, 32, 49]. In our approach, we take motivation from prior works on both, emotion perception and action recognition from gaits.
Gait Generation. Collecting and compiling a large dataset of annotated gait videos is indeed a challenging task. As a result, it is important to develop generative algorithms for gaits conditioned on emotion labels. Current learningbased generation models are primarily based on Generative Adverserial Networks (GANs) or Variational Autoencoders (VAEs). MoCoGAN [43] uses a GANbased model, the latent space of which is divided into motion space (for generating temporal features) and content space (for generating spatial features). It can generate tiny videos of facial expressions corresponding to various emotions. vid2vid [47] is a stateoftheart GANbased network that uses a combined spatial temporal adversarial objective to generate highresolution videos, including videos of human poses and gaits when trained on relevant real data. Other generative methods for gaits learn the initial poses and the intermediate transformations between frames in separate networks, and then combine the generated samples from both networks to develop realistic gaits [50, 6]. In this work, we model gaits as skeletal graphs and use spatialtemporal graph convolutions [49] inside a VAE to generate synthetic gaits.
3 Background
In this section, we give a brief overview of Spatial Temporal Graph Convolutional Networks (STGCNs) and Conditional Variational Autoencoders (CVAE).
3.1 GCN and STGCN
The Graph Convolutional Network (GCN) was first introduced in [5] to apply convolutional filters to arbitrarily structured graph data. Consider a graph = with = nodes. Also consider a feature matrix , where row corresponds to a feature for vertex . The propagation rule of a GCN is given as
(1) 
where and are the inputs to the th and the th layers of the network, respectively. =, is the weight matrix between the th and the th layers, is the adjacency matrix associated with the graph and
is a nonlinear activation function (
e.g., ReLU). Thus, a GCN takes in a feature matrix
as an input and generates another feature matrix as the output, being the number of layers in the network. In practice, each weight matrixin a GCN represents a convolutional kernel. Multiple such kernels can be applied to the input of a particular layer to get a feature tensor as output, similar to a conventional Convolutional Neural Network (CNN). For example, if
kernels, each of dimension are applied to the input , then the output of the first layer will be an feature tensor.Yan et al. [49] extended GCNs to develop the spatial temporal GCN (STGCN), which can be used for action recognition from human skeletal graphs. The graph in their case is the skeletal model of a human extracted from videos. Since they extract poses from each frame of a video, their input is a temporal sequence of such skeletal models. “Spatial” refers to the spatial edges in the skeletal model, which are the limbs connecting the body joints. “Temporal” refers to temporal edges which connect the positions of each joint across different time steps. Such a representation enables the gait video to be expressed as a single graph with a fixed adjacency matrix, and thus can be passed through a GCN network. The feature per vertex in their case is the 3D position of the joint represented by that vertex. In our work, we use the same representation for gaits, described later in Section 4.1.
3.2 Conditional Variational Autoencoder
The variational autoencoder [28]
is an encoderdecoder architecture that is used for data generation based on Bayesian inference. The encoder transforms the training data into a latent lowerdimensional distribution space. The decoder draws random samples from that distribution and generates synthetic data that are as similar to the training data as possible.
In conditional VAE [41], instead of generating from a single distribution space learned by the encoder, it learns separate distributions for the separate classes in the training data. Thus, given a class, the decoder produces random samples from the conditional distribution of that class, and generates synthetic data of that class from those samples. Furthermore, if we assume that the decoder generates Gaussian variables for every class, then the negative log likelihood for each class is given by the MSE loss
(2) 
where denotes the decoder function for class , represents the training data, and
the latent random variable. We incorporate a novel pushpull regularization loss on top of this standard CVAE loss, as described in Section
5.3.4 STEP and STEPGen
Our objective is to perform emotion perception from gaits. Based on prior work [30, 26, 8]
, we assume that emotional cues are largely determined by localized variances in gaits, such as swinging speed of the arm (movement of 3 adjacent joints: shoulder, elbow and hand), stride length and speed (movement of 3 adjacent joints: hip, knee and foot), relative position of the spine joint w.r.t. the adjacent root and neck joints and so on. Convolutional kernels are known to capture such local variances and encode them into meaningful feature representations for learningbased algorithms
[31]. Additionally, since we treat gaits as a periodic motion that consists of a sequence of localized joint movements in 3D, we therefore use GCNs for our generation and classification networks to capture these local variances efficiently. In particular, we use Spatial Temporal GCNs (STGCNs) developed by [49] to build both our generation and classification networks. We now elaborate our entire approach in detail.4.1 Extracting Gaits from Videos
Naturally collected human gait videos contain a wide variety of extraneous information such as attire, items carried (e.g.
, bags or cases), background clutter, etc. We use a stateofthe art pose estimation method
[10] to extract clean, 3D skeletal representations of the gaits from videos. Moreover, gaits in our dataset are collected from varying viewpoints and scales. To ensure that the generative network does not end up generating an extrinsic mean of the input gaits, we perform view normalization. Specifically, we transform all gaits to a common point of view in the world coordinates using the Umeyama method [44]. Thus, a gait in our case is a temporal sequence of view normalized skeletal graphs extracted per frame from a video. We now provide a formal definition for gait.Definition 4.1.
A gait is represented as a graph , where denotes the set of vertices and denotes the set of edges, such that

, represents the 3D position of the th joint in the skeleton at time step and is the total number of joints in the skeleton.

is the set of all nodes that are adjacent to as per the skeletal graph at time step ,

denotes the set of positions of of the th joint across all time steps ,

, , , .
A key prerequisite for using GCNs is to define the adjacency between the nodes in the graph [5, 29, 49]. Note that as per definition 4.1, given fixed and , any pair of gaits and can have different sets of vertices, and respectively, but necessarily have the same edge set and hence the same adjacency matrix . This useful property of the definition allows us to maintain a unique notion of adjacency for all the gaits in a dataset, and thus develop STGCNbased networks for the dataset.
4.2 STEPGen: The Generation Network
We show our generative network in Figure 2. Our network architecture is based on the Conditional Variational Autoencoder (CVAE) [41].
In the encoder, each dimensional input gait, preprocessed from a video (as per Section 4.1), is appended with the corresponding label, and passed through a set of 3 STGCN layers (yellow boxes). = is the feature dimension of each node in the gait, representing the 3D position of the corresponding joint. The first STGCN layer has kernels and the next two have kernels each. The output from the last STGCN layer is average pooled along both the temporal and joint dimensions (blue box). Thus, the output of the pooling layer is a tensor. This tensor is passed through two convolutional layers in parallel (red boxes). The outputs of the two convolutional layers are
dimensional vectors, which are the mean and the logvariance of the latent space respectively (purple boxes). All STGCN layers are followed by the ReLU nonlinearity, and all the layers are followed by a BatchNorm layer (not shown separately in Figure
2).In the decoder, we generate random samples from the dimensional latent space and append them with the same label provided with the input. As commonly performed in VAEs, we use the reparametrization trick [28] to make the overall network differentiable. The random sample is passed through a deconvolutional layer (red box), and the output feature is repeated (“unpooled”) along both the temporal and the joint dimension (green box) to produce a dimensional tensor. This tensor is then passed through 3 spatial temporal graph deconvolutional layers (STGDCNs) (yellow boxes). The first STGDCN layer has kernels, the second one has channels, and the last one has = channels. Hence, we finally get a dimensional tensor at the output, which is a synthetic gait for the provided label. As in the encoder part, all STGDCN layers are followed by a ReLU nonlinearity, and all layers are followed by a BatchNorm layer (not shown separately in Figure 2).
Once the network is trained, we can generate new synthetic gaits by drawing random samples from the dimensional latent distribution space parametrized by the learned and .
The original CVAE loss is given by:
(3) 
where , where each is assumed to be a row vector consisting of the 3D position of the joint at frame . The subscripts and stand for real and synthetic data respectively.
Each gait corresponds to a temporal sequence. Therefore, for any gait representation, it is essential to incorporate such temporal information. This is even more important as temporal changes in a gait provide significant cues for emotion perception [30, 26, 8]. But, the baselineCVAE architecture does not take into account the temporal nature of the gaits. We therefore modify the original reconstruction loss of the CVAE by adding regularization terms that enforce the desired temporal constraints (Equation 8).
We propose a novel “pushpull” regularization scheme. We first make sure that sufficient movement occurs in a generated gait across the frames so that the joint configurations at different time frames do not collapse into a single configuration. This is the “push” scheme. Simultaneously, we make sure that the generated gaits do not drift too far from the real gaits over time due to excessive movement. This is the “pull” scheme.

Push: We require the synthetic data to resemble the joint velocities and accelerations of the real data as closely as possible. The velocity of a node at a frame can be approximated as the difference between the positions of the node at frames and , i.e.,
(4) Similarly, acceleration of a node at a frame can be approximated as the difference between the velocities of the node at frame and , i.e.,
(5) We use the following loss for gait collapse:
(6) where and .

Pull: When the synthetic gait nodes are enforced to have nonzero velocity and acceleration between the frames, the difference between the synthetic node positions and the corresponding real node positions tends to increase as the number of frames increases. This is commonly known as the drift error. In order to constrain this error, we use the notion of anchor frames. At the anchor frames, we impose additional penalty on the loss between the real and synthetic gaits. In order to be effective, we need to ensure that there are a high number of anchor frames and they are as far apart as possible. Based on this trade off, we choose 3 anchor frames in the temporal sequence — the first frame, the middle frame and the last frame of the gait. We use the following loss function for gait drift:
(7) where denotes the set of anchor frames.
Finally, our modified reconstruction loss of the CVAE is given by
(8) 
where and are the regularization weights. Note that this modified loss function still satisfies the ELBO bound [28]
, if we assume that the decoder generates variables from a mixture of Gaussian distributions for every class, with the original loss, the push loss ad the pull loss representing the 3 Gaussian distributions in the mixture.
4.3 STEP: The Classification Network
We show out classifier network in Figure 3. In the base network, each input gait is passed through a set of STGCN layers (yellow boxes). The first STGCN layer has kernels and the next two have kernels each. The output from the last STGCN layer is average pooled (blue box) in both the temporal and joint dimensions and passed through a convolutional layer (red box). The output of the convolutional layer is passed through a fully connected layer of dimension (corresponding to the emotion labels that we have), followed by a softmax operation to generate the class labels. All the STGCN layers are followed by the ReLU nonlinearity and all layers except the fully connected layer are followed by a BatchNorm layer (not shown separately in Figure 3). We refer to this version of the network as the BaselineSTEP.
Prior work in gait analysis has shown that affective features for gaits provide important information for emotion perception [30, 26, 8]. Affective features are comprised of two types of features:

Posture features. These include angle and distance between the joints, area of different parts of the body (e.g., area of the triangle formed by the neck, the right hand and the left hand), and the bounding volume of the body.

Movement features. These include the velocity and acceleration of individual joints in the gait.
We exploit the affective feature formulation [30, 9] in our final network. We append the dimensional affective feature (purple box) to the final layer feature vector learned by our BaselineSTEP network, thus generating hybrid feature vectors. These hybrid feature vectors are passed through two fully connected layers of dimensions and respectively, followed by a softmax operation to generate the final class labels. We call this combined network STEP.
5 Experiments and Results
We list all the parameters and hardware used in training both our generation and classification networks in Section 5.1. In Section 5.2, we give details of our new dataset. In Sections 5.3, we list the standard metrics used to compare generative models and classification networks and in Section 5.4, we list the stateoftheart methods against which we compare our algorithms. In Section 5.5, we present the evaluation results. Finally, in Section 5.6, we analyse the robustness of our system and show that both STEP and STEPGen do not overfit on the EGait Dataset.
5.1 Training Parameters
For training STEPGen, we use a batch size of and train for epochs. We use the Adam optimizer [27] with an initial learning rate of , which decreases to th of its current value after , and epochs. We also use a momentum of and and weightdecay of .
For training STEP, we use a split of for training, validation and testing sets. We use a batch size of and train for epochs using the Adam optimizer [27] with an initial learning rate of . The learning rate decreases to th of its current value after , and epochs. We also use a momentum of and and weightdecay of . All our results were generated on an NVIDIA GeForce GTX 1080 Ti GPU.
5.2 Dataset: EmotionGait
EmotionGait (EGait) consists of real gaits and synthetic gaits each of the 4 emotion classes generated by STEPGen, for a total for gaits. We collected of the real gaits ourselves. We asked participants to walk while thinking of the four different emotions (angry, neutral, happy and sad). The total distance of walking for each participant was meters. The videos were labeled by domain experts. The remaining gaits are taken as is from the Edinburgh Locomotion MOCAP Database [22]. However, since these gaits did not have any associated labels, we got them labeled with the 4 emotions by the same domain experts.
5.3 Evaluation Metrics
Generation: For generative models, we compute the Fréchet Inception Distance (FID) score [23] that measures how close the generated samples are to the real inputs while maintaining diversity among the generated samples. The FID score is computed using the following formula:
(9) 
Classification: For classifier models, we report the classification accuracy given by , where are the number of true positives, true negatives, and total data, respectively.
5.4 Evaluation Methods
Generation: We compare our generative network with both GAN and VAEbased generative networks, as listed below.

vid2vid (GANbased) [47]: This is the stateoftheart video generation method. It can take human motion videos as input and generate highresolution videos of the same motion.

Baseline CVAE (VAEbased): We use a CVAE with the same network architecture as STEPGen, but with only the original CVAE loss given in Equation 3 as the reconstruction loss.
Classification: We compare our classifier network with both prior methods for emotion recognition from gaits, and prior methods for action recognition from gaits, as listed below.
We also perform the following ablation experiments with our classifier network:

BaselineSTEP: It predicts emotions based only on the networklearned features from gaits. This network is trained on the real gaits in EGait.

STEP+Aug: This is the same implementation as STEP, but trained on both the real and the synthetic gaits in EGait.
5.5 Results on EGait
Generation: All the generative networks are trained on the real data in EGait. We report an FID score of , while the FID score of BaselineCVAE is . Lower FID indicates higher fidelity to the real data. However, we also note that vid2vid [47] completely memorizes the dataset and thus gives an FID score of . This is undesirable for our task since we require the generative network to be able to produce diverse data that can be augmented to the training set of the classifier network.
Additionally, to show that our novel “PushPull” regularization loss function (Equation 8) generates gaits with joint movements, we measure the decay of the value of the loss function for the baselineCVAE and STEPGen with time (Figure 5). We add the and terms from equation 8 (without optimizing them) to the baselineCVAE loss function (Equation 3). We observe that STEPGen converges extremely quickly to a smaller loss value in around 28 epochs. On the other hand, the baseline CVAE produces oscillations and fails to converge as it does not optimize and .
We also perform qualitative tests of gait generated by all the methods. vid2vid [47] uses GANs to produce highquality videos. However, in our experiments, vid2vid memorizes the dataset and does not produce diverse samples. BaselineCVAE produces static gaits that do not move in time. Finally, our gaits are both diverse (different from input) and realistic (successfully mimics walking motion). We show all these results in our demo video^{1}^{1}1demo video available at: https://gamma.umd.edu/researchdirections/affectivecomputing/step.
Venture et al. [45]  Karg et al. [26]  Daoudi et al. [11]  Wang et al. [48]  Crenn et al. [8]  STGCN [49]  LSTM [38]  BaseSTEP  STEP  STEP + Aug 
30.83  39.58  42.52  53.73  66.22  65.62  75.10  78.24  83.15  89.41 
and shown in increasing order. We choose methods from both psychology and computer vision literature. BaseSTEP and STEP+Aug are variations of STEP.
Classification: In Table 1, we report the mean classification accuracies of all the methods using the formula in Section 5.3. We observe that most of the prior methods for emotion recognition from gaits have less than accuracy on EGait. Only Crenn et al. [8], where the authors manually compute the same features we use in our novel “pushpull” regularization loss function (enforce i.e. distances between joints across time) has greater than accuracy. The two prior action recognition from gait methods we compare with have and accuracy respectively. By comparison, our BaselineSTEP has an accuracy of . Combining networklearned and affective features in STEP gives an accuracy of . Finally, augmenting synthetic gaits generated by STEPGen in STEP+Aug gives an accuracy of .
To verify that our classification accuracy is statistically significant and not due to random chance, we perform two statistical tests:

Hypotheses Testing: Classification as a task, depends largely on the test sample to be classified. To ensure that the classification accuracy of STEP is not achieved due to random positive examples, we determine the statistical likelihood of our results. Note that we do not test on STEP+Aug as accuracy of STEP+Aug is also dependent on the augmentation size. We generate a population of size accuracy values of STEP with mean
. We set , i.e.the reported mean accuracy of STEP as the null hypothesis,
. To accept our null hypothesis, we require the pvalue to be greater than . We compute the pvalue of this population as . Therefore, we fail to reject the null hypothesis, thus corroborating our classification accuracy statistically. 
Confidence Intervals:
This metric determines the likelihood of a value residing in an interval. For a result to be meaningful and statistically significant, we require a tight interval with high probability. With a
likelihood, we report a confidence interval of
with a standard deviation of . Simply put, our classification accuracy will lie between and with a probability of .
Finally, we show the discriminatory capability of our classifier through a confusion matrix in Figure
6.5.6 Overfitting Analysis
Effect of Generated Data on Classification: We show in Figure 4 that the synthetic data generated by STEPGen increases the classification accuracy of STEP+Aug. This, in turn, shows that STEPGen does not memorize the training dataset, but can produce useful diverse samples. Nevertheless, we see that to achieve every percent improvement in the accuracy of STEP+Aug, we need to generate an exponentially larger number of synthetic samples as training saturation sets in.
Saliency Maps:
We show that STEP does not memorize the training dataset, but learns meaningful features, using saliency maps obtained via guided backpropagation on the learned network
[40, 42]. Saliency maps determine how the loss function output changes with respect to a small change in the input. In our case, the input consists of 3D joint positions over time, therefore, the corresponding saliency map highlights the joints that cause the most influence the output. Intuitively, we expect the saliency map for a positively classified example to capture the joint movements that are most important for predicting the perceived emotion from a psychological point of view [8]. We show the saliency map given by our trained network for both a positively classified and a negatively classified example for the label ‘happy’ in Figure 7. The saliency map only shows magnitude of the gradient along the axis (in and out of the plane of the paper), which is the direction of walking in both the examples. Black represents zero magnitude, and bright red represents a high magnitude. In the positive example, we see that the network detects simultaneous movement in the right leg and the left hand, followed by a transition period, followed by simultaneous movement of the left leg and the right hand. This is the expected behavior, as the movement of hands and the stride length and speed are important cues for emotion perception [8]. Note that other movements, such as that of the spine, lie along the other axes directions, and hence are not captured in the shown saliency map. By the contrast, there is no intuitive pattern to the detected movements in the saliency map for the negative example. For the sake of completeness, we provide the saliency maps along the other axes directions in the demo video.6 Limitations and Future Work
Our generative model is currently limited to generating gait sequences of a single person. The accuracy of the classification algorithm is also governed by the quality of the video and the pose extraction algorithm. There are many avenues for future work as well. We would like to extend the approach to deal with multiperson or crowd videos. Given the complexity of generating annotated realworld videos, we need better generators to improve the accuracy of classification algorithm. Lastly, it would be useful to combine gaitbased emotion classification with other modalities corresponding to faceexpressions or speech to further improve the accuracy.
References
 [1] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman. Emotion recognition in speech using crossmodal transfer in the wild. arXiv:1808.05561, 2018.
 [2] J. Arunnehru and M. K. Geetha. Automatic human emotion recognition in surveillance video. In ITSPMS, pages 321–342. Springer, 2017.
 [3] M. Atcheson, V. Sethu, and J. Epps. Gaussian process regression for continuous emotion recognition with global temporal invariance. In IJCAIW, pages 34–44, 2017.
 [4] A. Bauer et al. The autonomous city explorer: Towards natural humanrobot interaction in urban environments. IJSR, 1(2):127–140, 2009.
 [5] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. arXiv:1312.6203, 2013.
 [6] H. Cai, C. Bai, Y.W. Tai, and C.K. Tang. Deep video generation, prediction and completion of human action sequences. In ECCV, pages 366–382, 2018.
 [7] M. Chiu, J. Shu, and P. Hui. Emotion recognition through gait on mobile devices. In PerCom Workshops, pages 800–805. IEEE, 2018.
 [8] A. Crenn, R. A. Khan, A. Meyer, and S. Bouakaz. Body expression recognition from animated 3d skeleton. In IC3D, pages 1–7. IEEE, 2016.
 [9] A. Crenn, R. A. Khan, A. Meyer, and S. Bouakaz. Body expression recognition from animated 3d skeleton. In IC3D, pages 1–7. IEEE, 2016.
 [10] R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain. Learning 3d human pose from structure and motion. Computer Vision – ECCV 2018, 2018.
 [11] M. Daoudi, S. Berretti, P. Pala, Y. Delevoye, and A. Del Bimbo. Emotion recognition by body movement representation on the manifold of symmetric positive definite matrices. In ICIAP, pages 550–560. Springer, 2017.
 [12] J. Deng, X. Xu, Z. Zhang, S. Frühholz, and B. Schuller. Semisupervised autoencoders for speech emotion recognition. IEEE/ACM ASLP, 26(1):31–43, 2018.
 [13] S. A. Denham, E. Workman, P. M. Cole, C. Weissbrod, K. T. Kendziora, and C. ZAHNWAXLER. Prediction of externalizing behavior problems from early to middle childhood. Development and Psychopathology, 12(1):23–45, 2000.
 [14] P. Ekman. Facial expression and emotion. American psychologist, 48(4):384, 1993.
 [15] P. Ekman and W. V. Friesen. Head and body cues in the judgment of emotion: A reformulation. Perceptual and motor skills, 1967.
 [16] C. Fabian BenitezQuiroz, R. Srinivasan, and A. M. Martinez. Emotionet: An accurate, realtime algorithm for the automatic annotation of a million facial expressions in the wild. In CVPR, June 2016.
 [17] Y. Fan, X. Lu, D. Li, and Y. Liu. Videobased emotion recognition using cnnrnn and c3d hybrid networks. In ICMI, pages 445–450. ACM, 2016.
 [18] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, pages 3468–3476, 2016.
 [19] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional twostream network fusion for video action recognition. In CVPR, pages 1933–1941, 2016.
 [20] J.M. FernándezDols and M.A. RuizBelda. Expression of emotion versus expressions of emotions. In Everyday conceptions of emotion, pages 505–522. Springer, 1995.
 [21] R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detectandtrack: Efficient pose estimation in videos. CoRR, abs/1712.09184, 2017.
 [22] I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura. A recurrent variational autoencoder for human motion synthesis. In BMVC, 2017.
 [23] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, pages 6626–6637, 2017.
 [24] A. Jacob and P. Mythili. Prosodic feature based speech emotion recognition at segmental and supra segmental levels. In SPICES, pages 1–5. IEEE, 2015.
 [25] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. PAMI, 35(1):221–231, 2013.
 [26] M. Karg, K. Kuhnlenz, and M. Buss. Recognition of affect based on gait patterns. Cybernetics, 40(4):1050–1061, 2010.
 [27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
 [28] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv:1312.6114, 2013.
 [29] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv:1609.02907, 2016.
 [30] A. Kleinsmith and N. BianchiBerthouze. Affective body expression perception and recognition: A survey. IEEE Transactions on Affective Computing, 4(1):15–33, 2013.
 [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [32] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager. Temporal convolutional networks for action segmentation and detection. In CVPR, pages 156–165, 2017.

[33]
A. Majumder, L. Behera, and V. K. Subramanian.
Emotion recognition from geometric facial features using selforganizing map.
Pattern Recognition, 47(3):1282–1293, 2014.  [34] H. K. Meeren, C. C. van Heijnsbergen, and B. de Gelder. Rapid perceptual integration of facial expression and emotional body language. Proceedings of NAS, 102(45):16518–16523, 2005.
 [35] J. Michalak, N. F. Troje, J. Fischer, P. Vollmar, T. Heidenreich, and D. Schulte. Embodiment of sadness and depression—gait patterns associated with dysphoric mood. Psychosomatic Medicine, 71(5):580–587, 2009.
 [36] J. M. Montepare, S. B. Goldstein, and A. Clausen. The identification of emotions from gait information. Journal of Nonverbal Behavior, 11(1):33–42, 1987.
 [37] K. S. Quigley, K. A. Lindquist, and L. F. Barrett. Inducing and measuring emotion and affect: Tips, tricks, and secrets. Cambridge University Press, 2014.
 [38] T. Randhavane, A. Bera, K. Kapsaskis, U. Bhattacharya, K. Gray, and D. Manocha. Identifying emotions from walking using affective and deep features. arXiv:1906.11884, 2019.
 [39] M. Schurgin, J. Nelson, S. Iida, H. Ohira, J. Chiao, and S. Franconeri. Eye movements during emotion recognition in faces. Journal of vision, 14(13):14–14, 2014.
 [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv:1409.1556, 2014.
 [41] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In NIPS, pages 3483–3491, 2015.
 [42] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv:1412.6806, 2014.
 [43] S. Tulyakov, M.Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
 [44] S. Umeyama. Leastsquares estimation of transformation parameters between two point patterns. TPAMI, pages 376–380, 1991.
 [45] G. Venture, H. Kadone, T. Zhang, J. Grèzes, A. Berthoz, and H. Hicheur. Recognizing emotions conveyed by human gait. IJSR, 6(4):621–632, 2014.
 [46] L. Wang, T. Tan, W. Hu, and H. Ning. Automatic gait recognition based on statistical shape analysis. TIP, 12(9):1120–1131, 2003.
 [47] T.C. Wang, M.Y. Liu, J.Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Videotovideo synthesis. In NeurIPS, 2018.
 [48] W. Wang, V. Enescu, and H. Sahli. Adaptive realtime emotion recognition from body movements. TiiS, 5(4):18, 2016.
 [49] S. Yan, Y. Xiong, and D. Lin. Spatial temporal graph convolutional networks for skeletonbased action recognition. In AAAI, 2018.
 [50] C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and D. Lin. Pose guided human video generation. In ECCV, pages 201–216, 2018.
 [51] H. Yang, U. Ciftci, and L. Yin. Facial expression recognition by deexpression residue learning. In CVPR, pages 2168–2177, 2018.

[52]
H. Yates, B. Chamberlain, G. Norman, and W. H. Hsu.
Arousal detection for biometric data in built environments using machine learning.
In IJCAIW, pages 58–72, 2017.  [53] F. Zhang, T. Zhang, Q. Mao, and C. Xu. Joint pose and expression modeling for facial expression recognition. In CVPR, pages 3359–3368, 2018.
 [54] Z. Zhang and N. F. Troje. Viewindependent person identification from human gait. Neurocomputing, 69(13):250–256, 2005.
 [55] M. Zhao, F. Adib, and D. Katabi. Emotion recognition using wireless signals. In ICMCN, pages 95–108. ACM, 2016.
Comments
There are no comments yet.