Can vocal emotions be emulated? This question has led to long standing debates in the speech community regarding natural versus acted emotions. To conduct any speech based emotion research, an important factor is the nature of the speech samples or the vocal stimuli, and whether those samples are representative of the natural emotions. Natural emotions can best be defined as emotions that are spontaneous, and involuntary. Acted emotions, on the other hand, are prompted and voluntary. Because acted emotions are volitional, researchers argue that the physiological and psychological responses that natural emotions induce are absent from acted emotions [8, 25]. Nevertheless, most of the emotion perception research uses acted emotions as convenient proxies for natural emotions. While many past studies have focused on presenting perception tests for natural versus acted emotions with mixed conclusions [25, 18, 1], there is a lack of an on-scale systematic framework to study the differences and similarities in those classes. In this paper, we present a novel framework to study natural and acted vocal emotions in the context of their phonetic bases.
For the purposes of our study, we think it is useful to understand the problem in the context of our proposed non-actor, actor, and observer (NAO) framework as shown in figure 1. The communication of vocal emotions is primarily based on its encoding and decoding process. Encoding of emotions is what we call an expression, whereas the process of decoding is called perception. The subject decoding the emotions and hence perceiving them is an observer whose only cue in terms of vocal emotions is the vocal stimuli. From an evolutionary perspective, an observer is primarily cognizant of natural emotions and their cues. Encoding of emotions, on the other hand, can differ between natural and acted classes. Natural emotions are expressions which, for the most part, are involuntary, and therefore include physiological and psychological responses as concomitants. These emotions may or may not have an intended audience. On the other hand, acted emotions are volitional and prompted, and are intended for an audience. The job of the actor here is to emulate the natural emotions (of the non-actors) such that the observer is not able to differentiate between natural and acted stimuli, and somewhat be deceived. In some sense, our NAO model is a test for real versus acted emotions. The question then is whether the observer is able to tell apart natural emotions from acted emotions in which case the test passes, else the test fails. If the test fails, that would mean natural emotions can be emulated, and that acted emotions can be used as proxies for natural emotions. If the test passes, however, that would signal towards dichotomy between acted and natural emotions, leading to a low validity and value in using acted stimuli.
In our study, we investigate the phonetic bases of vocal emotions. We define phonetic bases as the choice of phonemes and their manner of delivery. These phonetic bases are a result of physiological and psychological changes in the speaker such as the heart rate, breathing rate, muscle tension, and mood. The physiological changes manifest in the voice by changing the spectro-temporal structure of individual sounds [7, 17]. For example, [24, 12] argue that vowels and consonants produced in fear are often more precisely articulated than they are in neutral situations. The physiological changes along with the psychological factors also define the choice of phonemes (i.e. the lexical content). As an example, words of aggressive nature are more likely to be used by an individual in an aggressive mood . Our hypothesis is that these phonetic bases form the cues for observer to recognize emotions. Therefore, these phonetic bases should characterize natural and acted classes of emotions, and should work as polygraphs. We use a self-attention based neural network framework to model speech emotion classification task for natural and acted stimuli, and use the most attentive phonemes and their distributions as phonetic bases.
2 Background Work
There has been some work done to understand the differences between acted and natural emotional speech. The results found from these studies are somewhat contradictory to each other. Studies like , which analyze acted and natural emotional speech with the help of human listeners have concluded that the listeners are not able to distinguish between the two categories. The problem is also studied in the domain of false expression, where the truthfulness of the expressed emotion is studied. It also reaches the conclusion that humans are less likely to differentiate between the two. On the other hand, studies like , which also use human listeners, conclude that about of listeners were able to differentiate between the two kinds of emotions with only audio clues and even more could differentiate when provided with audio-visual cues.  concludes that acted and natural speech innately differs based on voice quality. Acted speech is considered to be delivered in a more emotionally intense fashion but also that acted speech affects the vocal expression in a more general way, without the nuances of the changes caused by the natural emotion .  concludes that the two are different based on the acoustic and prosodic properties of the speech.
However, the domain of these studies is only in the analysis of the difference in the acted versus natural emotional speech without a model and without the context of emotion classification. In this study, the presence of both adds reliability to the analysis and makes it applicable to languages other than English.
3 Designing the Model to capture phonetic correlates of emotion
The objective is to devise a system to capture the acoustic and lexical correlates of emotion. Designing the model needs to take into consideration both the lexical and acoustic aspects of the input and also the relationship between the two. The linguistics should guide the model about the important parts of the acoustic; which follows from our original hypothesis that linguistics also indicates the presence of a specific emotion. To create a vector representation of the linguistic part of the input, we pass in through an LSTM which captures the contextual information of the linguistics. This forms a context-sensitive lexical vector.
To capture the relationship between the two modalities, we utilize attention-based mechanism. This enables the context-sensitive lexical vector to put attention on some parts of the audio, forming importance weights. The weights, when applied back to the input audio, make the output high in parts that the lexical vector points to and others become low in value. A feature vector is created from this weighted output, which thereafter goes into the classification layer for emotion. Training the model maximizes the classification accuracy, but in doing so, it teaches the model to create feature vectors which would be differentiable for the emotion classes. This, in turn, optimizes the attention mechanism, whereby allowing the lexical vector to focus only on those parts of the audio which would lead to the highest classification accuracy. This lays the basis of the model we have used, as shown in figure2.
We propose neural model architecture for the emotion classification task. The model can be summarized as in figure2. From the speech utterance, mel-spectrograms are calculated and fed into the first part of the network (shown in pink). The mel-spectrograms are passed through three convolutional layers, each followed by an ELU activation, except for the last. Let represent the frequency dimension and represents the time dimension, and the number of channels, then the output of the last convolution layer is of size xx.
The speech is passed through ASR to get the transcripts (shown in the blue). We used Google API  to extract the transcription. We use BERT  contextualized word embeddings to represent each word in the sentence. These embeddings are passed through an LSTM layer. The last hidden state of the LSTM is of dimension 64. For the attention layer we represent the keys as the output of the convolutional layer, a 3-dimensional output of size xx. The query is the last hidden state of the LSTM passed through a linear projection, and is of dimension . The attended output is the same dimension as the output of the convolutional layers.
The hyperparameters used in the model include: ADAM with learning rate of
is used for optimizing the model, batch size of 16 is used. The network is optimized using the Cross-Entropy loss, with weights for individual labels due to the class imbalance in the dataset used to evaluate the task. The model is implemented using PyTorch. The ASR based transcript of the utterance is segmented into different phonemes using an HMM based phoneme segmentor . Once the model is trained, the attended outputs are inspected. The idea is that the attended output, used in the prediction of the emotion class would indicate which frames are most important in deciding that class label. With the output of the phoneme segmentor, we can learn what phonemes are present in the high attention region of the audio.
We use datasets from two sources; acted and natural. We work with only four emotions: angry, happy, sad and neutral.
4.3 Acted Data
The acted dataset used in the experiment is IEMOCAP . It consists of ten sessions, each of which is a conversation between two actors. The conversations are divided into labeled sentences. We implement a 10-fold cross validation training setup. In each fold, data from 9 speakers is used for training the model and data from 1 speaker is used for testing. This speaker segregation is important in training and testing, where overlapping speakers might enable the model to learn speaker-dependent emotions, beating the value of the task at hand. The data contains many angry, happy, neutral and sad utterances.
4.4 Natural Data
The dataset we use in this experiment is called CMU-SER data. It is collected from NPR podcasts , and television programs hosted by the Internet Archive . The dataset is annotated using Amazon Mechanical Turk . We had a total of utterances in the training data and many in the test set. We had many angry, happy, neutral, many sad utterances. Further details of our collected dataset can be found in .
5 Results and Discussion
We first present the distribution of the phonemes found in both kinds of the datasets: natural and acted. Figure 3 shows distribution under acted dataset and figure 4 shows distribution under natural dataset. The distribution is calculated like so: from the attention output, the phonemes in the frames with the highest attention are accumulated. For now, we only consider the phonemes with the highest attention number, but exploring phonemes with second-highest and so on and finding a pattern within those is an interesting problem left to pursue. We normalize the frequency of the highest-attended phoneme by the total occurrence of that phoneme in the dataset so as to factor out the scenario where the high frequency of the phoneme is the reason for its high attendance.
There are two levels of distinctions we want to highlight: (i) distinctions between distribution under the acted versus natural datasets, and (ii) interesting distinctions between the phonemes distributions under each emotion.
First, the frequency of fricatives and stops is higher in natural than in acted speech. According to previous research done, fricatives and stops are used more frequently in anger and sadness . This may be explained by the fact that in natural emotion, the intensity of emotion and spontaneity of the speech, makes the speaker use fricatives and stops more. General findings include the overall frequency of vowels being higher in acted than in natural. The use of the phonemes /AA/, /B/ and /IH/ is more in acted than in natural, and an overall higher percentage of nasal phonemes in natural.
Secondly, it can be seen from the figures that the distribution is distinct in all 4 emotions. Some obvious observations include the very dominant presence of the vowel /OY/ in sadness but not at all in the other emotions in acted speech.
Fricatives like /TH/ and /S/ and aspirate phoneme like /HH/ are highest in anger. This follows from the physiological changes that are caused by anger, as mentioned in , including higher breathing rate, accounting for the aspirate phoneme. The kind of words used in anger, linguistically have the characteristic of glottal stop, which accounts for the higher fricatives in anger.
It is important to note that the phonemes which have high occurrences in all the emotions are actually important phonemes. Although they do not distinguish between the emotions, they are acoustic phonetic correlates of emotion. Examples of such phonemes include all the vowels, for which although the distribution in Figure 3 is different in all emotions, the apparent difference is not significant. This can also be found in studies which conclude that vowels are very important in deciding the emotion class [22, 13].
In the end, in the context of natural versus acted classes, our analysis conclude there are significant differences between the phonetic bases of the two classes. We therefore conclude moderate to low validity and value in using action emotions as proxies for the natural emotions. This should affect the conclusions reached by researchers in this domain, and make them wary of using acted emotion datasets.
We would like to note that this study is only inspecting English language. The results found in one language may not be applicable to another. However, the framework provided can easily be applied to any other language and it will be interesting to investigate the phonetic correlates in another language, and its comparison with the conclusions provided in this study.
A limitation of this study is the lack of same set of observers for both the acted and natural dataset. The comparison based results from the same observer over the datasets would give an idea of how good the datasets are in delivering the intended emotion. Currently, two different neural models accomplish this task, but a unified model is an area for further exploration.
In this paper, we present a study of the differences observed between natural and acted emotion with respect to their phonetic bases. We model the task as an attention-based emotion classification problem. The objective of the attention mechanism is to capture the phonemes which are most important in the classification problem. We then calculate the distribution of these important phonemes and examine how that distribution is different in natural versus acted classes. We observe several differences, for example, a higher presence of fricatives and stops in natural emotion than in acted speech. These phonetic differences constitute one factor for the difference in natural and acted emotion among others. The differences in phonetic bases signal towards a dichotomy between natural and acted classes of emotions. This study has applications in speech emotion recognition, emotional speech synthesis, and human computer interaction. The dynamics of the model allow us to not only use it for marking distinction between acted versus natural speech, but also apply it to other problems, such as of exploring the phonetic bases of voice disorders, e.g vocal palsy.
This material is based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. DM19-0798.
-  (2008) How we are not equally competent for discriminating acted from spontaneous expressive speech. In Proceedings of speech prosody, pp. 693–696. Cited by: §1, §2.
-  (2010) Prosodic correlates of acted vs. spontaneous discrimination of expressive speech: a pilot study. In Speech Prosody 2010-Fifth International Conference, Cited by: §2.
-  (2008) IEMOCAP: interactive emotional dyadic motion capture database. Language resources and evaluation 42 (4), pp. 335. Cited by: §4.3.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
-  () Google Speech To Test. Note: https://cloud.google.com/speech-to-text/ Cited by: §4.1.
-  (2014) Physiological sensing of emotion. The Oxford handbook of affective computing, pp. 204. Cited by: §5.
-  (1999) The effects of emotions on voice quality. In Proceedings of the XIVth international congress of phonetic sciences, pp. 2029–2032. Cited by: §1.
-  (2015) Effect of acting experience on emotion expression and recognition in voice: non-actors provide better stimuli than expected. Journal of nonverbal behavior 39 (3), pp. 195–214. Cited by: §1.
-  (2011) Authentic and play-acted vocal emotion expressions reveal acoustic differences. Frontiers in psychology 2, pp. 180. Cited by: §2.
-  (2003) The cmu sphinx-4 speech recognition system. In IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2003), Hong Kong, Vol. 1, pp. 2–5. Cited by: §4.1.
-  Detecting gender differences in perception of emotion in crowdsourced data. Note: https://www.researchgate.net/publication/333825376_Detecting_Gender_Differences_in_Perception_of_Emotion_from_Speech_in_the_Wild Cited by: §4.4.
-  (2013) Inherent emotional quality of human speech sounds. Cognition & emotion 27 (6), pp. 1105–1113. Cited by: §1.
-  (2014) Continuous emotion recognition with phonetic syllables. Speech Communication 57, pp. 155–169. Cited by: §5.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.1.
-  Podcast directory, npr. NPR. External Links: Cited by: §4.4.
-  (2011) Emotion expression: the evolutionary heritage in the human voice. In Interdisciplinary anthropology, pp. 105–129. Cited by: §2.
-  (1977) Cue utilization in emotion attribution from auditory stimuli. Motivation and emotion 1 (4), pp. 331–346. Cited by: §1.
-  (2013) Vocal markers of emotion: comparing induction and acting elicitation. Computer Speech & Language 27 (1), pp. 40–58. Cited by: §1.
-  Top collections at the archive. External Links: Cited by: §4.4.
-  (2009) How does real affect affect affect recognition in speech?. Cited by: §2.
-  (2012) Amazon mechanical turk. Retrieved August 17, pp. 2012. Cited by: §4.4.
-  (2011) Vowels formants analysis allows straightforward detection of high arousal emotions. In 2011 IEEE International Conference on Multimedia and Expo, pp. 1–6. Cited by: §5.
-  (1999) Phonosymbolism and the emotional nature of sounds: evidence of the preferential use of particular phonemes in texts of differing emotional tone. Perceptual and Motor Skills 89 (1), pp. 19–48. Cited by: §1, §5.
-  (1972) Emotions and speech: some acoustical correlates. The Journal of the Acoustical Society of America 52 (4B), pp. 1238–1250. Cited by: §1.
-  (2006) Real vs. acted emotional speech. In Ninth International Conference on Spoken Language Processing, Cited by: §1.