1. Introduction and Background
A large part of human communication is non-verbal (Knapp et al., 2013) and often takes place through co-speech gestures (McNeill, 1992; Kendon, 2004). Co-speech gesture behavior in embodied agents has been shown to help with learning tasks (Bergmann and Macedonia, 2013) and lead to greater emotional response (Wu et al., 2014). Gesture generation is hence an important part of both automated character animation and human-agent interaction.
Early dominance of rule-based approaches (Cassell et al., 1994; Kopp and Wachsmuth, 2004; Ng-Thow-Hing et al., 2010; Marsella et al., 2013) has been challenged by data-driven gesture generation systems (Neff et al., 2008; Bergmann and Kopp, 2009; Yoon et al., 2019; Kucherenko et al., 2020; Yoon et al., 2020; Ahuja et al., 2020; Ferstl et al., 2020). These latter systems first only considered a single speech modality (either audio or text) (Neff et al., 2008; Bergmann and Kopp, 2009; Yoon et al., 2019; Kucherenko et al., 2021a), but are now shifting to use both audio and text together (Kucherenko et al., 2020; Yoon et al., 2020; Ahuja et al., 2020).
While rule-based systems provide control over the communicative function of output gestures, they lack variability and require much manual effort to design. Data-driven systems, on the other hand, need less manual work and are very flexible, but most existing systems do not provide much control over communicative function and generated gestures have little relation to speech content(Kucherenko et al., 2021b).
|Gesture category [Macro ]||Gesture semantics [Macro ]||Gesture phase |
|(lr)2-13 (lr)14-25 (lr)26-40 Label||deictic||beat||iconic||discourse||amount||shape||direction||size||pre-hold||post-hold||stroke||retr.||prep.|
This paper continues recent efforts to bridge the gap between the two paradigms (Ferstl et al., 2020; Saund et al., 2021; Yunus et al., 2021). The most similar prior work is Yunus et al. (Yunus et al., 2021) where gesture timing and duration were predicted based on acoustic features only. The method proposed here differs from their approach in three ways: 1) it considers not only audio but also text as input; 2) it models not only gesture phase, but multiple gesture properties; 3) it also provides a framework for integrating these gesture properties in a data-driven gesture-generation system.
The proposed approach helps decouple different aspects of gesticulation and can leverage database information about gesture timing and content with modern, high-quality data-driven animation.
2. Proposed Model
Our unified model uses speech text and audio as input to generate gestures as a sequence of 3D poses. As depicted in Figure 1
, it is composed of three neural networks:
Speech2GestExist: A temporal CNN which takes speech as input and returns a binary flag indicating if the agent should gesture (similar to (Yunus et al., 2019));
Speech2GestProp: A temporal CNN which takes speech as input and predicts a set of binary gesture properties, such as gesture type, gesture phase, etc.;
In this study, we experiment with the first two neural networks only. We implemented the Speech2GestProp and Speech2GestExist
components using dilated CNNs. Their inputs are sequences of aligned speech text and audio frames, and they return a binary vector of gesture properties (forSpeech2Prop) or a binary flag of gesture existence (for Speech2GestExist) as its output. By sliding a window over the speech and predicting poses, frame-by-frame properties are generated at 5 fps. Text features were extracted using DistilBERT (Sanh et al., 2019). Audio features were log-scaled mel-spectrograms.
3. Preliminary results
We evaluated our model on the SaGA direction-giving dataset (Lücking et al., 2013) designed to contain many representational gesturesThe dataset contains audio/video recordings of 25 participants (all German native speakers) describing the same route to other participants and includes detailed annotations of gesture properties.
We considered the following three gesture properties: 1) Phase (preparation, pre-stroke hold, stroke, post-stroke hold, and retraction); 2) Type (deictic, beat, iconic (McNeill, 1992), and discourse); 3) Semantic information (amount, shape, direction, size, as described in (Bergmann and Kopp, 2006)).
For each of our experiments we calculated the mean and standard deviation of thescore across 20-fold cross-validation. The score is preferable over accuracy here since the data is highly unbalanced and accuracy does not represent overall performance well. For gesture category and phase we report Macro F1 score (Yang and Liu, 1999), since those properties are not mutually exclusive.
First we validated that gesture presence can be predicted from the speech in our dataset. We achieved a Macro score for binary classification, which aligns with previous work (Yunus et al., 2019).
Next, we experimented with predicting gesture properties. Table 1 contains results for predicting the gesture category, gesture semantic information, and gesture phase from speech text and audio. We can see that this is a challenging task, but we are still able to predict most of the values better than chance. This was unexpected given how complex gesture semantics tend to be and could be due to the focused scope of the direction-giving task. For a deeper study with more results and analyses, please see the follow-up work (Kucherenko et al., 2021c).
In this section we discuss the feasibility of the proposed approach. Our proposal to use probabilistic models (especially normalizing flows) is inspired by a recent application of MoGlow (Henter et al., 2020) to perform gesture synthesis by Alexanderson et al. (Alexanderson et al., 2020). They showed that such models can be seamlessly conditioned on various kinematic gesture properties (such as speed, range, and hand height), suggesting that it is possible to condition gestures on semantic properties as well.
We obtained good results for the gesture-property prediction part of our proposed system, as described in Section 3. Since we can predict several important properties with scores significantly above chance level, we believe that our predictions are reasonable and will be useful for more appropriate gesture synthesis.
Our two-stage approach lets the machine learning model leverage additional information (such as detailed annotation) about human gestures. It also allows direct control of gesture frequency, by adjusting the threshold on the output ofSpeech2GestExist needed to trigger a gesture. Finally, it helps the model learn from small datasets, since each sub-module has a more straightforward task than learning everything at once and also can be trained separately.
We presented a novel gesture generation framework aiming to bridge the semantic gap between rule-based and data-driven models.††Authors are grateful to Stefan Kopp for providing the SaGA dataset and fruitful discussions about it and to Olga Abramov for advising on its gesture-property processing. This work was partially supported by the Swedish Foundation for Strategic Research Grant No. RIT15-0107 and by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Our method first predicts if a gesture is appropriate for a given point in the speech and what kind of gesture is appropriate. Once this prediction is made, it is used to condition the gesture generation model. Our gesture-property prediction results are promising and indicate that the proposed approach is feasible.
No gestures left behind: learning relationships between spoken language and freeform gestures.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1884–1895. Cited by: §1.
- Style-controllable speech-driven gesture synthesis using normalising flows. Computer Graphics Forum 39 (2), pp. 487–496. Cited by: §4.
- Verbal or visual? how information is distributed across speech and gesture in spatial dialog. In Proceedings of the 10th Workshop on the Semantics and Pragmatics of Dialogue, pp. 90–97. Cited by: §3.
- GNetIc–using Bayesian decision networks for iconic gesture generation. In International Workshop on Intelligent Virtual Agents, pp. 76–89. Cited by: §1.
- A virtual agent as vocabulary trainer: iconic gestures help to improve learners’ memory performance. In Proceedings of the International Workshop on Intelligent Virtual Agents, pp. 139–148. Cited by: §1.
Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420. Cited by: §1.
- Understanding the predictability of gesture parameters from speech and their perceptual importance. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pp. 1–8. Cited by: §1, §1.
- MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics 39, pp. 236:1–236:14. Cited by: §4.
- Gesture: visible action as utterance. Cambridge University Press. Cited by: §1.
- Nonverbal communication in human interaction. Wadsworth, Cengage Learning. Cited by: §1.
- Normalizing flows: an introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: item 3.
- Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15 (1), pp. 39–52. Cited by: §1.
- Moving fast and slow: analysis of representations and post-processing in speech-driven automatic gesture generation. International Journal of Human–Computer Interaction, pp. 1–17. Cited by: §1.
- Gesticulator: a framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction, pp. 242–250. Cited by: §1.
- A large, crowdsourced evaluation of gesture generation systems on common data: the GENEA Challenge 2020. In Proceedings of the 26th International Conference on Intelligent User Interfaces, pp. 11–21. External Links: Cited by: §1.
- Multimodal analysis of the predictability of hand-gesture properties. arXiv preprint arXiv:2108.05762. Cited by: §3.
- Data-based analysis of speech and gesture: the Bielefeld speech and gesture alignment corpus (SaGA) and its applications. Journal on Multimodal User Interfaces 7 (1), pp. 5–18. Cited by: §3.
- Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 25–35. Cited by: §1.
- Hand and mind: what gestures reveal about thought. University of Chicago Press. Cited by: §1, §3.
- Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Transactions on Graphics 27 (1). Cited by: §1.
- Synchronized gesture and speech production for humanoid robots. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4617–4624. Cited by: §1.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Cited by: §2.
- CMCF: an architecture for realtime gesture generation by clustering gestures by motion and communicative function. In Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems, pp. 1136–1144. Cited by: §1.
- Effects of virtual human animation on emotion contagion in simulated inter-personal experiences. IEEE Transactions on Visualization and Computer Graphics 20 (4), pp. 626–635. Cited by: §1.
- A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. Cited by: §3.
- Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39, pp. 222:1–222:16. Cited by: §1.
- Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4303–4309. Cited by: §1.
Gesture class prediction by recurrent neural network and attention mechanism. In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 233–235. Cited by: item 1, §3.
- Sequence-to-sequence predictive model: from prosody to communicative gestures. In Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body, Motion and Behavior, Cham, pp. 355–374. External Links: Cited by: §1.