Log In Sign Up

How Should Voice Assistants Deal With Users' Emotions?

There is a growing body of research in HCI on detecting the users' emotions. Once it is possible to detect users' emotions reliably, the next question is how an emotion-aware interface should react to the detected emotion. In a first step, we tried to find out how humans deal with the negative emotions of an avatar. The hope behind this approach was to identify human strategies, which we can then mimic in an emotion-aware voice assistant. We present a user study in which participants were confronted with an angry, sad, or frightened avatar. Their task was to make the avatar happy by talking to it. We recorded the voice signal and analyzed it. The results show that users predominantly reacted with neutral emotion. However, we also found gender differences, which opens a range of questions.


Recognizing Developers' Emotions while Programming

Developers experience a wide range of emotions during programming tasks,...

The phonetic bases of vocal expressed emotion: natural versus acted

Can vocal emotions be emulated? This question has been a recurrent conce...

A Deep Learning Based Chatbot for Campus Psychological Therapy

In this paper, we propose Evebot, an innovative, sequence to sequence (S...

Integrating Voice-Based Machine Learning Technology into Complex Home Environments

To demonstrate the value of machine learning based smart health technolo...

Leveraging Acoustic Cues and Paralinguistic Embeddings to Detect Expression from Voice

Millions of people reach out to digital assistants such as Siri every da...

Challenges in Providing Automatic Affective Feedback in Instant Messaging Applications

Instant messaging is one of the major channels of computer mediated comm...

Manipulating emotions for ground truth emotion analysis

Text data are being used as a lens through which human cognition can be ...

1. Introduction

Voice Assistants (VAs), which are embedded in smartphones (e.g., Siri) or smart home devices (e.g., Alexa), have a growing number of users. The advantage of VAs, compared to classical interaction with keyboard, mouse and display, is that they do not demand visual attention and leave the hands free for other tasks (Sayago et al., 2019)

. Despite the ’artificial intelligence’ in VAs, users consider current voice assistants stupid, as they are not fully-fledged conversational partners. One reason for this is a missing awareness of the users’ emotions.

With the advancements in neural networks and machine learning it is possible to recognize users’ emotions in speech 

(Akçay and Oğuz, 2020; Fahad et al., 2021). However, even if it is possible to reliably detect emotion, it is not clear how a VA should react. This is the research question of our study.

Some emotions, such as sadness, anger, and fear, are classified as negative, and an often-heard idea is to develop VAs, which turn such emotion into a positive one, i.e., make the user happy. The question is how a VA should behave and how the VAs voice should sound in an emotional response to achieve this goal. To get closer to an answer, we decided to study how humans would try to achieve this. For a study we set up a website with animated emojis talking in negative emotions. We asked the participants to cheer the emojis up by talking, and we recorded their voices. We analyzed the recorded voice samples with Vokaturi

111, an emotion detector based on neural networks, and also with classical methods such as RMS and ZCR. The predominant result from Vokaturi was ’neutral’ and there were not many differences in the voices for different moods of the emoji. However, we found surprising differences in strategy by the participants’ gender. Our study did therefore not deliver conclusive answers, but created questions for further research, especially on the gender issue.

2. Related Work

There is a growing body of research on how to detect emotion in speech. One possibility to detect emotion is speech emotion recognition (SER), which analyzes how something was said. A good overview on this approach is provided by Schuller (2018). Another possibility is the semantic analysis of natural language (Maulud et al., 2021), which analyzes what was said.

Research on how to react to emotion is mainly done by psychologists, but it typically investigates the communication between humans. Li et al. (2021) worked on emotional reaction by focusing on the feelings of the user and presented the ’EmoElicitor’ model to elicit the particular emotions of users. Clos et al. (2017)

tried to predict the emotional reaction of readers of social network posts. Other researchers concluded that their approach outperforms other approaches they took as a baseline, e.g., estimating emotion from text by using ’EmoLex’ 

(Mohammad and Turney, 2010). Thornton and Tamir (2017) studied whether humans can predict the future emotion of others from the currently observed emotion through a learned mental model.

In the field of HCI, some researchers proposed self-reported and concurrent expression which can help computers to efficiently sense, recognize and respond to human communication of emotions (Picard, 2000; Picard and Klein, 2002; Picard, 2003). Pascual-Leone and Greenberg (2007) used a sequential model of emotional processing and an accompanying measure to conduct an emotional transformation study involving 310 clinical and 130 sub-clinical cases (Pascual-Leone, 2018).

3. Study

Figure 1. The web page offered emojis in three moods - angry, sad, fearful. The emoji talked to the participants and waited for an answer, which we recorded. This was repeated with different emoji utterances five times for each emotion.

To understand how VAs should react to users’ emotion, we used the approach of role-swapping. In other words, the roles of VAs and individuals are switched in our study. We designed a website and invited participants with the web link. Besides a privacy statement, instructions, and a microphone test, the website presented animated avatars (see Figure 1) in three different emotional states – sad, angry, and fearful. The participants’ task was to change the avatars’ emotion into happiness by speaking to them. For every emotion, the conversation consisted of five avatar utterances and corresponding user answers. After each answer the avatar become gradually happier, independently of the user’s actual answer. We recorded the first two seconds of every answer. After the recordings of each emotion we asked ’How sad was the emoji’s voice?” and ”How sad was the emoji’s face?” and the analog for the other emotions to validate the emotional input. We also asked for suggestions how to cheer up people in this mood to collect users’ intentions.

4. Results and Analysis

The study was conducted during three weeks and involved 52 participants. Six participants did not complete the study and therefore, 46 valid data records remained for evaluation. The participants, 22 female and 24 male, had an average age of 30.5, where the youngest participant was 11 years old and the oldest participant was at the age of 69. The average age of all female participants (30.4) was nearly equal to the average age of all male participants (30.5).

The evaluation of the users’ perception of the stimulus showed that the avatar’s mood was perceived as intended. Figure  3

shows the users’ average emotion as reported by Vokaturi. The high standard deviations indicate a high dispersion of individuals’ emotion. Vokaturi reports five values for five basic emotions and typically four values are small and only one value is high. If we eliminate values below 15% as noise we get Figure  

3, which shows that most answers were given with neutral emotion.

Figure 2. Emotion in the participants’ voice (analysed with Vokaturi) averaged over avatar’s mood.
Figure 3. Vokaturi reports at least small values for all emotions, which we consider as noise. The figure above takes only values above 0.15 into account.
Figure 4. Emotion in the participants voice (analysed with Vokaturi) for the three avatar’s mood.
Figure 5. Mean RMS values for voice samples for the different avatar moods.
Figure 6. Emotion in the participants voice over gender (analysed with Vokaturi) There is a clear gender difference.

Figure  5 shows the users’ average emotion for the three negative avatar moods. There is not much difference in the emotional response to the different avatar’s mood. However, we found a difference in the RMS (root mean square) (see Figure  5). Figure 6 shows the users’ emotion by gender averaged over the three different avatar moods. It shows that there is a clear difference by gender. We did not calculate the significance of this result as it is not clear how reliable Vokaturi’s results are. Vokaturi’s documentation states ”The accuracy on the five built-in emotions is 76.1%”222

5. Discussion and Future Work

Emotion in speech signals is independent from language and culture (Pell et al., 2009) and there was no reason to assume a difference for gender. The fact that we got a gender difference needs an explanation. Although it is unlikely, it could be by chance.

Another explanation could be a gender bias in the emotion detector. Vokaturi’s training databases333 are the Berlin Database of Emotional Speech or Emo-DB444 (Burkhardt et al., 2005), which contains five female and five male speakers, and SAVEE555, which has voice samples of four males. This raises the question whether the training databases for emotion detection have to be gender-balanced. Vogt et al. showed that gender differentiation improves emotion recognition (Vogt and André, 2006).

A further possibility is that there is a gender difference in the reaction to emotion. The consequence would be, that a female voice assistant should have a different reaction to emotion than a male voice assistant. It may be the case that male users need a different treatment to cheer them up than female users. Although there are hints that this could be the case (Seo, 2022), we are not very comfortable with this idea, as it means that emotion-aware voice assistants also need gender-awareness and this would manifest gender differences in technology. The general question is whether there is a difference in conversations from man to man, women to woman, or man to woman, and if there are differences, whether we want to implement this in voice assistants.

There are critical voices on gender issues for voice assistants in a UNESCO report666 and in the media777 There is also an initiative for a genderless voice to end gender bias in AI888 A follow-up study could offer a male and a female stimulus. Alternatively, it would be possible to offer a gender-neutral human voice or even the voice of a comic character. This raises a more general questions: Should research on future voice interfaces investigate all interfaces which are possible, only those for which there is a market, or only interfaces which are in accordance with current morals?


  • M. B. Akçay and K. Oğuz (2020) Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication 116, pp. 56–76. Cited by: §1.
  • F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss (2005) A database of german emotional speech. Vol. 5, pp. 1517–1520. External Links: Document Cited by: §5.
  • J. Clos, A. Bandhakavi, N. Wiratunga, and G. Cabanac (2017) Predicting emotional reaction in social networks. In European Conference on Information Retrieval, pp. 527–533. Cited by: §2.
  • M. S. Fahad, A. Ranjan, J. Yadav, and A. Deepak (2021) A survey of speech emotion recognition in natural environment. Digital Signal Processing 110, pp. 102951. Cited by: §1.
  • S. Li, S. Feng, D. Wang, K. Song, Y. Zhang, and W. Wang (2021) EmoElicitor: an open domain response generation model with user emotional reaction awareness. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 3637–3643. Cited by: §2.
  • D. H. Maulud, S. R. Zeebaree, K. Jacksi, M. A. M. Sadeeq, and K. H. Sharif (2021)

    State of art for semantic analysis of natural language processing

    Qubahan Academic Journal 1 (2), pp. 21–28. Cited by: §2.
  • S. Mohammad and P. Turney (2010)

    Emotions evoked by common words and phrases: using mechanical turk to create an emotion lexicon

    In Proceedings of the NAACL HLT 2010 workshop on computational approaches to analysis and generation of emotion in text, pp. 26–34. Cited by: §2.
  • A. Pascual-Leone and L. S. Greenberg (2007) Emotional processing in experiential therapy: why” the only way out is through.”. Journal of Consulting and Clinical Psychology 75 (6), pp. 875. Cited by: §2.
  • A. Pascual-Leone (2018) How clients “change emotion with emotion”: a programme of research on emotional processing. Psychotherapy Research 28 (2), pp. 165–182. Cited by: §2.
  • M. D. Pell, S. Paulmann, C. Dara, A. Alasseri, and S. A. Kotz (2009) Factors in the recognition of vocally expressed emotions: a comparison of four languages. Journal of Phonetics 37 (4), pp. 417–435. External Links: ISSN 0095-4470, Document, Link Cited by: §5.
  • R. W. Picard and J. Klein (2002) Computers that recognise and respond to user emotion: theoretical and practical implications. Interacting with computers 14 (2), pp. 141–169. Cited by: §2.
  • R. W. Picard (2000) Toward computers that recognize and respond to user emotion. IBM systems journal 39 (3.4), pp. 705–719. Cited by: §2.
  • R. Picard (2003) Computers that recognize and respond to user emotion. In International Conference on User Modeling, pp. 2–2. Cited by: §2.
  • S. Sayago, B. B. Neves, and B. R. Cowan (2019) Voice assistants and older people: some open issues. In Proceedings of the 1st International Conference on Conversational User Interfaces, pp. 1–3. Cited by: §1.
  • B. W. Schuller (2018) Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the ACM 61 (5), pp. 90–99. Cited by: §2.
  • S. Seo (2022) When female (male) robot is talking to me: effect of service robots’ gender and anthropomorphism on customer satisfaction. International Journal of Hospitality Management 102, pp. 103166. External Links: ISSN 0278-4319, Document, Link Cited by: §5.
  • M. A. Thornton and D. I. Tamir (2017) Mental models accurately predict emotion transitions. Proceedings of the National Academy of Sciences 114 (23), pp. 5982–5987. Cited by: §2.
  • T. Vogt and E. André (2006) Improving automatic emotion recognition from speech via gender differentiaion. In LREC, Cited by: §5.