1. Introduction
Emotions are intrinsic characteristics of most living species, particularly overt in human behaviour (Darwin and Prodger, 1998; Panksepp, 2004; Izard, 2013). Intelligent systems must employ means to incorporate emotions for a more natural interaction (Picard, 2000). This surge for “emotional intelligence” has evolved into the field of affective computing, which by definition encompasses the creation of and interaction with machines that can sense, recognize, respond to, and influence emotions (Picard and Klein, 2002). Several models of emotion have been developed over the years, which are considered as the backbone of affective computing (Gratch et al., 2009; Marsella et al., 2010; Tracy and Randles, 2011; Hamann, 2012). Among these models, a popular choice is the Categorical Model which describes six basic discrete emotions, namely, happiness, anger, disgust, sadness, fear, and surprise (Ekman and Friesen, 1971). However, this model failed to capture relations between the discrete emotions. Moreover, there is a lack of consistency in the choice of these fundamental emotions (Ekman and Cordaro, 2011). As a result, Russell & Mehrabian (Russell and Mehrabian, 1977) developed the Dimensional Model which suggests that each emotional state can be defined in terms of Valence (pleasure of an emotion), Arousal (energy of an emotion) and Dominance (controlling nature of an emotion). The Dominance dimension is commonly ignored since the valence arousal (VA) dimensional model was shown to possess adequate reliability, convergent validity, and discriminant validity (Russell et al., 1989). This led to the conceptualization of the Circumplex Model to represent affective states as a circle in a 2D bipolar VA space (Russell, 1980). The VA variables are typically considered independent (Feldman Barrett and Russell, 1998).
The existence of different models of emotions result in a range of possible annotation strategies for affective data (Fabian BenitezQuiroz et al., 2016; Nicolaou et al., 2010, 2011; Lucey et al., 2010; Dhall et al., 2011)
. This poses two challenges: (i) building deep learning models on affective data, and (ii) drawing collective insights from multiple datasets having potentially different formats of annotations
(De Bruyne et al., 2019). In this paper, we present a novel algorithm for mapping annotations of the Categorical Model to those of the Dimensional Model through annotation transfer across affective facial image datasets.The subsequent task following annotation mapping is to obtain meaningful data representations. With the increased use of deep neural networks and generative models, there have been significant advances in emotion modelling and affective computing
(Han et al., 2019; Rouast et al., 2019; Jolly et al., 2019). Variational Autoencoders (VAEs) (Kingma and Welling, 2013) are known to yield disentangled latent representations and generate new data samples (Hu et al., 2018; Shukla et al., 2019; Higgins et al., 2017). They have been used extensively in affective computing to represent text, audio, image and electroencephalography (EEG) data (Wu et al., 2019; Latif et al., 2017). Applying VAEs on affective facial images to obtain disentangled image representations can (i) provide high quality feature representations for downstream tasks (Bengio et al., 2013; Peters et al., 2017), and (ii) serve applications like facial editing and data augmentation (Lindt et al., 2019). In our study, we obtain interpretable features by aligning the latent space of a VAE with the VA space. This enables improved affect classification and regression as demonstrated on two benchmark affective image datasets using a series of evaluation tasks.Our major contributions are as follows:

an annotation transfer algorithm for label transfer between Categorical and Dimensional models of emotion

a regularised VAE model “LeVAsa” (Latent Encodings for ValenceArousal Structure Alignment) that yields an interpretable latent space with an implicit structure aligned with the VA space
The rest of the paper is organized as follows. Section 2 presents our annotation transfer algorithm, the VAE model architectures and the datasets used in our experiments. Section 3 outlines the evaluation tasks conducted along with the obtained results. Section 4 concludes the paper and motivates future work.
2. Methods
In this section, we present our annotation transfer algorithm and describe our VAE model architectures. Our code and models are publically available ^{1}^{1}1https://github.com/vishaal27/LeVAsa.
2.1. Annotation Transfer Algorithm
For the task of annotation transfer between Categorical and Dimensional emotion models, we use an external reference dataset () containing both discrete categorical emotion labels (, where are the n discrete emotional labels) and valence, arousal values (, where are the lower and upper limits for valence and arousal values respectively). Each data sample thus has an emotion label , a valence value and an arousal value . serves as the standard based on which continuous or discrete VA values can be sampled for data points in a working dataset () with only emotion labels, or conversely, the most likely emotion labels can be obtained for data points in dataset with only VA tuples (Figure 2). Algorithm 1 is detailed as follows.
2.2. VAE model architectures
We train a generative model with an interpretable latent space with an implicit structure given a raw distribution of affective face images. We employ variational autoencoder based models because of their simple training protocols and structured inductive priors. We compare two VAE models, Vanilla VAE and LeVAsa. The latent space for both models was constructed to comprise three chunks. Figure 1 depicts our model architectures.
For the Vanilla VAE, no explicit alignment was imposed on the latent space, whereas for LeVAsa, we take inspiration from recent work (Jha et al., 2018; Bhagat et al., 2020) and model the latent space as follows:

– subspace consisting of valence attributes that learn to encode the valence features of image samples

– subspace consisting of arousal attributes that learn to encode the arousal features of image samples

– subspace consisting of other miscellaneous generative attributes that are required for highfidelity reconstruction of the input data distribution.
Given a dataset of affective images , our VAE backbone consists of an encoder and a decoder given by:
We train the Vanilla VAE with a simple reconstruction loss along with a modified KullbackLeibler (KL) loss (Eq. 1). We induce a prior on all three attributes , and .
(1) 
We employ the same backbone Vanilla VAE architecture for the LeVAsa model with two major modifications:

Projection Heads: We make use of two nonlinear projection heads and which map the encoded valence and arousal representations and to the valence and arousal label space (giving label representations and ) where VAregularisation loss is applied. The projections obtained are represented as follows:

VAregularization loss: To impose an explicit alignment of the and attributes with the VA ground truth factors, we introduce a VAregularization loss as follows:
(2) where takes the form of MSE for continuous and BCE for discrete annotation type.
The overall optimization objective for the LeVAsa model is:
(3) 
where and
are hyperparameters.
2.3. Datasets
We use the following datasets in our experiments.
Annotation Transfer: AffectNet

AffectNet (Mollahosseini et al., 2017) is the largest facial expression dataset, with over 420,000 annotated images and contains both continuous VA annotations in [1, 1] and discrete emotional labels in Neutral, Anger, Happiness, Sadness, Surprise, Fear, Disgust, Contempt, None, Uncertain and Nonface. The dataset also incorporates a wide diversity in gender, age and ethnicity, hence is an ideal choice for the reference dataset in the annotation transfer algorithm (Algorithm 1). The generated ellipses are shown in Figure 2.
Model Training: IMFDB, AFEW

IMFDB (Setty et al., 2013) contains around 34,000 annotated zoomedin facial images of 100 Indian actors, with only emotional labels Neutral, Anger, Happiness, Sadness, Surprise, Fear, Disgust and no VA supervision. Continuous and discrete VA supervision for IMFDB is obtained from annotation transfer using AffectNet. This is particularly well suited due to the similar nature of images in IMFDB and AffectNet datasets.

AFEW (Dhall et al., 2011) on the other hand, contains around 24,000 annotated images from videos of real world scenes of approximately 600 actors with only discrete VA values in {10, 9,…, 9, 10} and no discrete emotional labels.
The different nature of IMFDB and AFEW datasets allow us to analyse and compare model performance based on different factors including image type (zoomed in faces/video scenes) and annotation type (discrete VA supervision/continuous VA supervision).
3. Experiments
We perform our analyses and evaluations through a series of qualitative and quantitative experiments. This enables comparisons based on three aspects: (i) architecture (Vanilla VAE vs LeVAsa), (ii) dataset (IMFDB vs AFEW), and (iii) nature of annotations (Continuous VA vs Discrete VA). Altogether, we train five models: (i) Vanilla VAE on IMFDB, (ii) LeVAsa on IMFDB with continuous VA annotations, (iii) LeVAsa on IMFDB with discrete VA annotations, (iv) Vanilla VAE on AFEW, and (v) LeVAsa on AFEW with discrete VA annotations.
3.1. LatentCircumplex Alignment
We measure the alignment of LeVAsa’s latent space with the VA ground truths using normalized Euclidean and Manhattan distance metrics for continuous annotations, and Cross Entropy measure for discrete annotations. This helps quantify the degree of latentcircumplex alignment. For the Vanilla VAE, we determine the and
chunks heuristically by considering the two latent chucks which aligned best with the corresponding valence and arousal ground truths. Further, we reduce the dimensionality of the
and latent chunks and plot them alongside the ground truth to replicate the circumplex representation.It is found that LeVAsa outperformed Vanilla VAE for both continuous and discrete annotations (Table 1). This clearly exhibits the superior latentcircumplex alignment achieved by LeVAsa. For discrete annotations, in case of AFEW, the difference between the cross entropy measures of the Vanilla VAE and LeVAsa is greater than in case of IMFDB. This could be attributed to the different image types in both datasets. The circumplex plots (Figure 3
) for LeVAsa reveal reduced variance and increased alignment with true labels. This validates the quantitative results in Table
1.IMFDB  Valence  Arousal  Combined  

MSE  MAE  MSE  MAE  MSE  MAE  
Vanilla VAE  1.83  0.29  1.49  0.26  3.31  0.55 
LeVAsa  0.14  0.14  0.06  0.09  0.2  0.23 
Model  IMFDB  AFEW 
Vanilla VAE  8.9  8.9 
LeVAsa  6.63  2.54 
To gain further insights, we assess the regressive power of the latent chunks and
by their ability to predict the corresponding VA ground truths. We used Multi Layer Perceptron (MLP) Regression for this task. This analysis applies to continuous annotations hence it was conducted only on the LeVAsa and Vanilla VAE models trained on IMFDB dataset with continuous VA values.
Axis  Model  MSE  MAE  EV  

Valence 






Arousal 






It is observed that the MSE and MAE values computed for LeVAsa were lower by 2.25% and 1.42% as compared to Vanilla VAE for valence, and lower by 19.13% and 7.18% as compared to Vanilla VAE for arousal (Table 2). Furthermore, the goodness of fit metrics (explained variance and ) showed better performance in the case of LeVAsa. These results further strengthen our hypothesis.
3.2. Categorical Emotion Predictive Power
We predict the discrete emotion labels using different combinations of latent representations obtained from Vanilla VAE and LeVAsa (Table 3
). Due to lack of discrete emotion labels in the AFEW dataset, it was excluded from this analysis. We randomized the data splits across Continuous and Discrete experiments to ensure an unbiased setup. Model performance is evaluated using classification accuracy. We utilize a simple onelayered MLP to ensure that the accuracy is a direct measure of representation quality and not influenced by the complexity of the classifier.
Annotation  Chunk  Vanilla VAE  LeVAsa  Difference= 

Type  Combination  (V)  (L)  L  V (in %) 
0.29  0.36  7  
0.32  0.35  3  
Continuous  0.32  0.36  4  
0.32  0.38  6  
0.29  0.33  4  
0.30  0.35  5  
0.27  0.30  3  
Discrete  0.24  0.30  6  
0.25  0.33  8  
0.26  0.30  4  
represents vector concatenation)
It is seen that LeVAsa has significantly better predictive power as compared to the Vanilla VAE. Moreover, for LeVAsa, the VA chunks alone are more informative in emotion prediction as compared to chunks altogether. Also, the improvement in classification accuracy by employing LeVAsa in place of Vanilla VAE can be compared under the continuous and discrete settings. This reveals that LeVAsa representations from the model trained with discrete annotations and BCE loss (Eq. 2) proves to be better at classifying emotion labels. This is due to the discrete nature of emotion labels which correlate well with the model representations.
3.3. Reconstruction Quality
VAE models are prone to posterior collapse and can produce unreliable reconstructions (He et al., 2019; Rybkin et al., 2020). Thus, along with analyses of the latent representations, we also study the quality of the reconstructed faces (Figure 4).
It is observed that the quality of the reconstructed faces is slightly compromised in the case of LeVAsa as compared to Vanilla VAE. This can be attributed to the slightly higher variance of the learnt LeVAsa decoding distribution (Higgins et al., 2017; Alemi et al., 2018). By Shannon’s ratedistortion theory (Berger, 2003), there is a tradeoff between the distortion (reconstruction quality) and rate (representation quality). Since we are imposing an explicit compression bottleneck on the latent representations, it is expected that the reconstruction quality is slightly compromised in order to achieve better interpretability of latent representations.
4. Conclusion
In this paper, we have developed an annotationtransfer algorithm for mapping between Categorical and Dimensional emotion model annotations. Using them, we generated interpretable image features with a VAregularized VAE model called LeVAsa. We conducted a series of evaluation tasks to verify and validate our experiments and compare performance based on three factors: (i) architecture (Vanilla VAE vs LeVAsa), (ii) dataset (IMFDB vs AFEW), and (iii) nature of annotations (Continuous VA vs Discrete VA). The results showed that the LeVAsa model obtains robust and interpretable representations enabling improved downstream affective task performance. In the future, we hope to (i) extend the annotationtransfer algorithm to actionunit annotations, and (ii) perform latent traversals for data augmentation and facial editing.
Acknowledgements
This work was supported by the Infosys Center for Artificial Intelligence at IIIT Delhi, India.
References
 (1)

Alemi et al. (2018)
Alexander Alemi, Ben
Poole, Ian Fischer, Joshua Dillon,
Rif A Saurous, and Kevin Murphy.
2018.
Fixing a broken ELBO. In
International Conference on Machine Learning
. 159–168.  Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
 Berger (2003) Toby Berger. 2003. Ratedistortion theory. Wiley Encyclopedia of Telecommunications (2003).
 Bhagat et al. (2020) Sarthak Bhagat, Vishaal Udandarao, and Shagun Uppal. 2020. DisCont: SelfSupervised Visual Attribute Disentanglement using Context Vectors. arXiv preprint arXiv:2006.05895 (2020).
 Darwin and Prodger (1998) Charles Darwin and Phillip Prodger. 1998. The expression of the emotions in man and animals. Oxford University Press.
 De Bruyne et al. (2019) Luna De Bruyne, Pepa Atanasova, and Isabelle Augenstein. 2019. Joint Emotion Label Space Modelling for Affect Lexica. arXiv preprint arXiv:1911.08782 (2019).
 Dhall et al. (2011) Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2011. Acted facial expressions in the wild database. Australian National University, Canberra, Australia, Technical Report TRCS11 2 (2011), 1–13.
 Ekman and Cordaro (2011) Paul Ekman and Daniel Cordaro. 2011. What is meant by calling emotions basic. Emotion review 3, 4 (2011), 364–370.
 Ekman and Friesen (1971) Paul Ekman and Wallace V Friesen. 1971. Constants across cultures in the face and emotion. Journal of personality and social psychology 17, 2 (1971), 124–129.

Fabian BenitezQuiroz et al. (2016)
C Fabian BenitezQuiroz,
Ramprakash Srinivasan, and Aleix M
Martinez. 2016.
Emotionet: An accurate, realtime algorithm for the
automatic annotation of a million facial expressions in the wild. In
Proceedings of the IEEE conference on computer vision and pattern recognition
. 5562–5570.  Feldman Barrett and Russell (1998) Lisa Feldman Barrett and James A Russell. 1998. Independence and bipolarity in the structure of current affect. Journal of personality and social psychology 74, 4 (1998), 967–984.
 Gratch et al. (2009) Jonathan Gratch, Stacy Marsella, Ning Wang, and Brooke Stankovic. 2009. Assessing the validity of appraisalbased models of emotion. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops. IEEE, 1–8.
 Hamann (2012) Stephan Hamann. 2012. Mapping discrete and dimensional emotions onto the brain: controversies and consensus. Trends in cognitive sciences 16, 9 (2012), 458–466.

Han
et al. (2019)
Jing Han, Zixing Zhang,
and Bjorn Schuller. 2019.
Adversarial training in affective computing and sentiment analysis: Recent advances and perspectives.
IEEE Computational Intelligence Magazine 14, 2 (2019), 68–81.  He et al. (2019) Junxian He, Daniel Spokoyny, Graham Neubig, and Taylor BergKirkpatrick. 2019. Lagging inference networks and posterior collapse in variational autoencoders. arXiv preprint arXiv:1901.05534 (2019).
 Higgins et al. (2017) Irina Higgins, Loïc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. betaVAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In ICLR. 1–22.
 Hu et al. (2018) Qiyang Hu, Attila Szabó, Tiziano Portenier, Paolo Favaro, and Matthias Zwicker. 2018. Disentangling factors of variation by mixing them. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3399–3407.
 Izard (2013) Carroll E Izard. 2013. Human emotions. Springer Science & Business Media.
 Jha et al. (2018) Ananya Harsh Jha, Saket Anand, Maneesh Singh, and VSR Veeravasarapu. 2018. Disentangling factors of variation with cycleconsistent variational autoencoders. In European Conference on Computer Vision. Springer, 829–845.
 Jolly et al. (2019) Baani Leen Kaur Jolly, Palash Aggrawal, Surabhi S Nath, Viresh Gupta, Manraj Singh Grover, and Rajiv Ratn Shah. 2019. Universal EEG Encoder for Learning Diverse Intelligent Tasks. In 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM). IEEE, 213–218.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
 Latif et al. (2017) Siddique Latif, Rajib Rana, Junaid Qadir, and Julien Epps. 2017. Variational autoencoders for learning latent representations of speech emotion: A preliminary study. arXiv preprint arXiv:1712.08708 (2017).
 Lindt et al. (2019) Alexandra Lindt, Pablo Barros, Henrique Siqueira, and Stefan Wermter. 2019. Facial expression editing with continuous emotion labels. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019). IEEE, 1–8.
 Lucey et al. (2010) Patrick Lucey, Jeffrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and Iain Matthews. 2010. The extended cohnkanade dataset (ck+): A complete dataset for action unit and emotionspecified expression. In 2010 ieee computer society conference on computer vision and pattern recognitionworkshops. IEEE, 94–101.
 Marsella et al. (2010) Stacy Marsella, Jonathan Gratch, Paolo Petta, et al. 2010. Computational models of emotion. A Blueprint for Affective ComputingA sourcebook and manual 11, 1 (2010), 21–46.
 Mollahosseini et al. (2017) Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. 2017. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10, 1 (2017), 18–31.
 Nicolaou et al. (2010) Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. 2010. Audiovisual classification and fusion of spontaneous affective data in likelihood space. In 2010 20th International Conference on Pattern Recognition. IEEE, 3695–3699.
 Nicolaou et al. (2011) Mihalis A Nicolaou, Hatice Gunes, and Maja Pantic. 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valencearousal space. IEEE Transactions on Affective Computing 2, 2 (2011), 92–105.
 Panksepp (2004) Jaak Panksepp. 2004. Affective neuroscience: The foundations of human and animal emotions. Oxford University Press.
 Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of causal inference. The MIT Press.
 Picard (2000) Rosalind W Picard. 2000. Affective computing. The MIT Press.
 Picard and Klein (2002) Rosalind W Picard and Jonathan Klein. 2002. Computers that recognise and respond to user emotion: theoretical and practical implications. Interacting with computers 14, 2 (2002), 141–169.
 Rouast et al. (2019) Philipp V Rouast, Marc Adam, and Raymond Chiong. 2019. Deep learning for human affect recognition: insights and new developments. IEEE Transactions on Affective Computing (2019), 1–20.
 Russell (1980) James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology 39, 6 (1980), 1161–1178.
 Russell and Mehrabian (1977) James A Russell and Albert Mehrabian. 1977. Evidence for a threefactor theory of emotions. Journal of research in Personality 11, 3 (1977), 273–294.
 Russell et al. (1989) James A Russell, Anna Weiss, and Gerald A Mendelsohn. 1989. Affect grid: a singleitem scale of pleasure and arousal. Journal of personality and social psychology 57, 3 (1989), 493–502.
 Rybkin et al. (2020) Oleh Rybkin, Kostas Daniilidis, and Sergey Levine. 2020. Simple and Effective VAE Training with Calibrated Decoders. arXiv preprint arXiv:2006.13202 (2020).

Setty
et al. (2013)
Shankar Setty, Moula
Husain, Parisa Beham, Jyothi Gudavalli,
Menaka Kandasamy, Radhesyam Vaddi,
Vidyagouri Hemadri, JC Karure,
Raja Raju, B Rajan, et al.
2013.
Indian movie face database: a benchmark for face recognition under wide variations. In
2013 fourth national conference on computer vision, pattern recognition, image processing and graphics (NCVPRIPG). IEEE, 1–5. 
Shukla et al. (2019)
Ankita Shukla, Sarthak
Bhagat, Shagun Uppal, Saket Anand, and
Pavan K. Turaga. 2019.
Product of Orthogonal Spheres Parameterization for Disentangled Representation Learning. In
BMVC. 1–13.  Tracy and Randles (2011) Jessica L Tracy and Daniel Randles. 2011. Four models of basic emotions: a review of Ekman and Cordaro, Izard, Levenson, and Panksepp and Watt. Emotion Review 3, 4 (2011), 397–405.
 Wu et al. (2019) Chuhan Wu, Fangzhao Wu, Sixing Wu, Zhigang Yuan, Junxin Liu, and Yongfeng Huang. 2019. Semisupervised dimensional sentiment analysis with variational autoencoder. KnowledgeBased Systems 165 (2019), 30–39.
Comments
There are no comments yet.