Mugeetion: Musical Interface Using Facial Gesture and Emotion

09/14/2018 ∙ by Eunjeong Stella Koh, et al. ∙ University of California, San Diego 0

People feel emotions when listening to music. However, emotions are not tangible objects that can be exploited in the music composition process as they are difficult to capture and quantify in algorithms. We present a novel musical interface, Mugeetion, designed to capture occurring instances of emotional states from users' facial gestures and relay that data to associated musical features. Mugeetion can translate qualitative data of emotional states into quantitative data, which can be utilized in the sound generation process. We also presented and tested this work in the exhibition of sound installation, Hearing Seascape, using the audiences' facial expressions. Audiences heard changes in the background sound based on their emotional state. The process contributes multiple research areas, such as gesture tracking systems, emotion-sound modeling, and the connection between sound and facial gesture.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Electronic music researchers use various components as inputs for their music generation process [1, 2, 3, 4]. Music and emotion are strongly linked, and listeners can feel different emotions directly or indirectly through music. Engaging emotion as a component of a musical interface has great potential for composing creative music and expressing messages in an effective way [5]. However, there are several difficulties in using emotion for sonification [6, 7]. First, emotion is qualitative and thus hard to utilize for sound generation applications, which rely on quantitative inputs. Second, emotion is represented on a continuous spectrum. Measurement of affect requires a complex and multi-faceted approach. In this paper, we use a facial gesture tracking system to define emotional states based on facial gesture information.

Facial gestures express various information related to emotion, cognition, and inspiration [8, 9]. Further, facial gestures are more straightforward indicators of emotion than other bodily gestures. There are several studies related to the connection between facial gestures and sound itself [10, 11]. In this paper, we propose an interactive audio interface that sonifies emotion. The idea is to use facial gesture data to detect emotion and categorize these into several emotional states for sonification. We implement two approaches for this prototype: (1) music style transition based on user’s emotion and (2) auditory interface based on the connection between facial components and musical metadata. We also installed our system in a digital exhibition for facial interaction with an audience at the Hearing Seascape installation. During the exhibition, Mugeetion detected audience’s facial expression in real-time and audience were able to hear the sounds simultaneously which was mapped with their specific facial gestures.

2 Backgrounds and Related Work

There has been a rich history of creating novel sound interfaces using gesture-based motion-tracking for live performance and improvisation [12, 13, 14]. A motion tracking system can allow a musician to generate their own creative music in real-time [15, 16]. Previous studies demonstrated interesting new audio interfaces for sonification through body gesture. A number of systems have looked at capturing gestures and utilizing gesture data for the sonification process either stepwise or in real-time [14, 17]. Regarding previous studies, there are two approaches, which have used sound as an input for tracking facial gestures or facial components as input for sound generation. Some studies utilized auditory input for focusing on the visualization of facial gesture [18, 19, 20]. For example, Kapuscinski [21] conducted listening tests of Chopin pieces and recorded facial expressions from the participants. Other experiments have focused on sound generation using facial parameters as an input [22]. These studies use FaceOSC software to apply facial gesture data to the sound generation process. McDonald [11] created FaceOSC software to track facial gestures directly to Max as input. There are several interesting experiments linking facial gestures and sound on Youtube [22]. However, these experiments are more targeted toward application, rather than music cognition research. Few computer music researchers delve into the relationship between emotion and sound itself.

Music psychologists have studied the relationship between emotion and sound, and tried to model its connection. However, music cognition research has not contributed to music sonification research. In this paper, we propose a musical interface with the facial gesture tracking system and Facial Action Coding System (FACS) [8] in order to capture emotional states. FACS can allow a concrete data representation of the facial gesture and its corresponding emotional state. Thus, facial gestures can be reference points for observing emotion and translating emotion into sound. In this context, we use the tracked facial data to the sonification process. We will discuss its musical implementation in the following sections.

3 Methods

3.1 Understanding Facial Gesture

Figure 1: System structure: connection between facial gesture, compound facial expressions of emotion, and sound

Figure 1 gives an overview of integrating the proposed system to connect facial expression to sound generation111In Figure 1 and 2, printed images are copyrighted by \⃝raisebox{-0.9pt}{c}Jeffrey Cohn, which come from Cohn-Kanade (Ck & CK+) database. We generated the musical style based on facial expressions. In Figure 1, our system includes three sequential steps: (1) capturing facial gesture using FaceOSC, (2) connecting to compound facial expressions of emotion, and (3) synthesizing musical features based on the emotional state. FaceOSC software is used to help the Mugeetion system understand the user’s facial gestures and generate sound based on the user’s emotional state. The emotion detection module uses the software for real-time facial gesture tracking and transmits raw-level facial data over the Open Sound Control (OSC) protocol. If the detector finds multiple potential faces within the frame, the closest face will get the priority of recognition, analyzing a single face at a time. For analyzing facial expression, we use the FACS and Action Unit classification222Description of Facial Action Coding System and Action Units We chose the Action Unit (AU) combinations of three basic emotions: (A) happy, (B) neutral, and (C) sad. Each emotional state is combined with several individual AUs. For example, the facial expression of happy includes AU 6 (cheek raiser), AU 12 (lip corner puller), and AU 25 (lips part) (See Figure 1).

We practiced our sonification method with face images from The Cohn-Kanade AU-Coded Facial Expression Data-base [9, 23]. By training with multiple images, we made the system work well with different faces. We selected representative facial images for linking with our sound generation process. We used 20 images for each emotional state: happy, neutral, sad (60 images total). We measured these data to create a data range for each emotional state and defined the differences between each emotion. We manually annotated the range of facial gestures for mapping each muscle activation to AU components (See Table 1)333The unit in this table is followed by FaceOSC data measurement.. Figure 2

shows the data range for AU components with our training images. For example, the average AU6 scale of the happy face is 2.6605, AU12 is 18.2263, and AU25 is 2.3777. After learning the range of AUs, the system can classify facial gestures to pre-defined states with each individual photo or real-time face input through the connected web-cam. Along with categorizing, the system attempts to translate the musical style based on the input of emotional states (See Figure


position details data range (min/max)

width 6.0244/19.2747
height 0.8893/3.0010
eyebrow left 6.7666/8.0714
right 6.6787/7.9785
eye left 2.4329/3.4357
right 2.3950/3.3144
jaw - 18.9888/22.9718
nostrils - 5.6477/8.8061

Table 1: Facial data configuration from FaceOSC
Figure 2: Data transition process: from facial gesture to emotion

3.2 Sonification with Action Units

In this section, we focus on sonification with AUs in detail. We generate musical output based on the connection between AUs and emotional state. We then apply the formula between emotion and sound features, such as how the energetic happy face is mapped to the pitch/loudness increasing, and the dynamics in the sad face are mapped to white noise/distortion parameters. We also connect specific AUs to MIDI notes for sonification. The MIDI packets are mapped to controls of different parameters, resulting in different musical sounds based on how the emotional state moves. For example, when a user moves their mouth, the mouth height data is inputted and we normalize the data between 0-127 scales for generating MIDI notes or dynamics. Then, these 0-127 scales correspond to MIDI note scales. There are a few studies have explored this method before

[10, 17, 19], and we explore the linkage between other sound features and the emotion conveyed in the AUs.

3.3 Connecting between Emotion and Sound

In this section, we explain how the system has been implemented for connecting emotion to sound. The system can interpolate the sound results from facial gesture inputs. In this approach, we generate the sound based on pre-recorded sound. We can play different sounds based on the user’s happy, neutral, or sad emotional state. Our Mugeetion interface automatically plays the specific song related to the user’s emotional state. We list our sound files below, which have been played to a number of subjects interacting with the system.

Mozart - The Piano Sonata No 16 in C major
Mozart - Eine Kleine Nachtmusik K 525 Allegro

Mozart - Piano Sonata No 11 in A major K 331

Mozart - Symphony No 25 in G Minor K 183
1st Movement
Mozart - Requiem in D minor

The selection of the list is based on the study of the Mozart Effect [24]. For the sound files, we use dataset444

4 Prototypes

4.1 Demo

For the prototype of our system, we explored adding more musical variation, such as pitch height, loudness, distortion, or tempo change, as parameters to be controlled. We show a possibility of sound generation in real-time. Our preliminary demo video is uploaded on Youtube555 In order to allow users to easily interact with their sound generation process, we built a Max application, which utilizes facial data and FAC for sound creation.

4.2 Sound Installation Work with Mugeetion

Our sonification method, Mugeetion, has also been used in the sound installation exhibition, Hearing Seascape (See Figure 3) at the Qualcomm Institute at UC San Diego in February 2018666Photo by Alex Matthews \⃝raisebox{-0.9pt}{c}2018 Regents of the University of California.. This exhibition was a part of a collaborative effort with the Scripps Institution of Oceanography at UC San Diego to interpret their coral reef image data in a musical way. To convey the importance of engaging in the soundscapes of coral reefs, we suggested that our Mugeetion would be effective in fulfilling the goal of the project. Our prototype of the exhibition can be found in Youtube777 The main goals of this project were to display different aspects of sound and innovative graphic design to create an enjoyable environment for the audience, and to create an inviting soundscape with a synergy among voices, images, synthesized sounds, and human emotion.

Figure 3: Left: Hearing Seascape exhibition, Right: Interaction with Mugeetion during the exhibition (Neutral state)

4.2.1 Characteristics of the sounds in the Hearing Seascape

There were two sound components in the sound installation. First, regarding sound input, we made recordings of singing and speaking in bowls of water. We recorded various sounds, such as giggling, clicking with tongue, singing, spoken dialogue, low/high pitches, both in the air and in the water. This specific sonification process was related to the goal of the project. The sound of voices underwater showed a variation of pitch and vagueness of speech. This is representative of the confusion and misunderstanding that surrounds coral reef research [25]– there is so much yet to be discovered and understood about these creatures. Second, using Mugeetion, our method detected the audience’s facial expression in real-time, and the detected emotional state was used to display of the coral reef images and synthesize the soundscape. The audience can hear the sound that is simultaneously mapped with their specific facial gestures. For instance, when audiences expressed strong emotions with their facial gestures, these dynamics connected to sound components to increase intensity, tempo, and pitch height. The interaction through Mugeetion invited the audience to participate in the exhibition.

5 Discussion and Future Work

Mugeetion makes several contributions to previous work. Rather than simply detecting facial gesture data, it also automatically extracts emotional states and produces sound output transition. Mugeetion provides a sound generation model to users based on the components of emotion and musical metadata. We focus on how sound can be changed based on users’ emotional movement. In the presented soundscape installation, the interaction between emotion and sound occurred based on user’s emotional states. We explore how audience participation in artwork can be utilized in interactive systems and how it changes the sound generation output. In future work, we will collect continuous auditory feedback during the exhibition in order to evaluate the sound generation output. For example, audiences would be asked how satisfied they were with the reflection between sound output and their emotional states.

Furthermore, the system would be able to store a collection of data, which creators can use to improve their sonification process. Every AU per second and audio files would be automatically saved. The system would collect and store a repository of the memory units that users can look back on in order to re-utilize their composition process.

We will further develop the system based on the following issues:

increasing training images for covering multiple faces and optimizing different emotional states

implementing an AU indicator or other emotional measure on the FaceOSC display for better interaction with users

exploring other similar emotion interactive system to compare the sonification result

This paper has been generously supported by Dr. Lei Liang, other members from Calit2 Center of Graphics, Visualization and Virtual Reality, and The Smith Lab in Scripps Institution of Oceanography at UC San Diego.


  • [1] I. Poupyrev, M. J. Lyons, S. Fels et al., “New interfaces for musical expression,” in CHI’01 Extended Abstracts on Human Factors in Computing Systems.   ACM, 2001, pp. 491–492.
  • [2] M. J. Lyons, “Machine Intelligence, New Interfaces, and the Art of the Soluble,” arXiv preprint arXiv:1707.08011, 2017.
  • [3] M. J. Lyons and N. Tetsutani, “Facing the music: a facial action controlled musical interface,” in CHI’01 extended abstracts on Human factors in computing systems.   ACM, 2001, pp. 309–310.
  • [4] A. Çamcı, “A cognitive approach to electronic music: theoretical and experiment-based perspectives,” in Proceedings of the International Computer Music Conference, 2012, pp. 1–4.
  • [5] F. Ventura, A. Oliveira, and A. Cardoso, “An emotion-driven interactive system,” in

    Portuguese Conference on Artificial Intelligence

    , 2009.
  • [6] M. Leman, “Embodied music cognition and music mediation technology,” 2008.
  • [7] R. M. Winters and M. M. Wanderley, “Sonification of emotion: Strategies for continuous display of arousal and valence,” in The 3rd International Conference on Music & Emotion, Jyväskylä, Finland, June 11-15, 2013.   University of Jyväskylä, Department of Music, 2013.
  • [8] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of emotion,” Proceedings of the National Academy of Sciences, vol. 111, no. 15, pp. E1454–E1462, 2014.
  • [9] T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” in Automatic Face and Gesture Recognition, 2000. Proceedings. Fourth IEEE International Conference on.   IEEE, 2000, pp. 46–53.
  • [10] N. d’Alessandro, M. Astrinaki, and T. Dutoit, “MAGEFACE: Performative Conversion of Facial Characteristics into Speech Synthesis Parameters,” in International Conference on Intelligent Technologies for Interactive Entertainment.   Springer, 2013, pp. 179–182.
  • [11] K. McDonald, “FaceOSC,” Latest Release: 2016, available at
  • [12] M. B. Küssner, D. Tidhar, H. M. Prior, and D. Leech-Wilkinson, “Musicians are more consistent: Gestural cross-modal mappings of pitch, loudness and tempo in real-time,” Frontiers in psychology, vol. 5, 2014.
  • [13] A. R. Jensenius, “Motion-sound interaction using sonification based on motiongrams,” 2012.
  • [14] A. Churnside, C. Pike, and M. Leonard, “Musical movements—gesture based audio interfaces,” in Audio Engineering Society Convention 131.   Audio Engineering Society, 2011.
  • [15] C. Wang and M. S. Brandstein, “A hybrid real-time face tracking system,” in Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on, vol. 6.   IEEE, 1998, pp. 3737–3740.
  • [16] Y. Liu, F. Xu, J. Chai, X. Tong, L. Wang, and Q. Huo, “Video-audio driven real-time facial animation,” ACM Transactions on Graphics (TOG), vol. 34, no. 6, p. 182, 2015.
  • [17] A. Migicovsky, J. Scheinerman, and G. Essl, “MoveOSC—Smart Watches in Mobile Music Performance,” in ICMC, 2014.
  • [18] G. Kramer, Auditory display: sonification, audification and auditory interfaces.   Addison-Wesley Longman Publishing Co., Inc., 2000.
  • [19] A. Sedes, B. Courribet, and J.-B. Thiebaut, “From the visualization of sound to real-time sonification: different prototypes in the Max/MSP/Jitter environment.” in ICMC, 2004.
  • [20] T. Hawkins, “Emoter,” 2002, available at
  • [21] J. Kapuscinski, “Where is Chopin?” 2010, available at
  • [22] V. Artists, “Search ”FaceOSC” at Youtube,” Last checked: 2018/02/28, available at
  • [23] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on.   IEEE, 2010, pp. 94–101.
  • [24] L. Perlovsky, A. Cabanac, M.-C. Bonniot-Cabanac, and M. Cabanac, “Mozart effect, cognitive dissonance, and the pleasure of music,” Behavioural Brain Research, vol. 244, pp. 9–14, 2013.
  • [25] J. E. Smith, R. Brainard, A. Carter, S. Grillo, C. Edwards, J. Harris, L. Lewis, D. Obura, F. Rohwer, E. Sala et al., “Re-evaluating the health of coral reef communities: baselines and evidence for human impacts across the central Pacific,” in Proc. R. Soc. B, vol. 283, no. 1822.   The Royal Society, 2016, p. 20151985.