Log In Sign Up

Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech

We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to generate gestures that are both diverse and representational. Follow-ups and more information can be found on the project page: .


Gesticulator: A framework for semantically-aware speech-driven gesture generation

During speech, people spontaneously gesticulate, which plays a key role ...

Quantitative analysis of robot gesticulation behavior

Social robot capabilities, such as talking gestures, are best produced u...

CGAP2: Context and gap aware predictive pose framework for early detection of gestures

With a growing interest in autonomous vehicles' operation, there is an e...

Prosody Based Co-analysis for Continuous Recognition of Coverbal Gestures

Although speech and gesture recognition has been studied extensively, al...

GestureMap: Supporting Visual Analytics and Quantitative Analysis of Motion Elicitation Data by Learning 2D Embeddings

This paper presents GestureMap, a visual analytics tool for gesture elic...

A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents

Embodied conversational agents (ECAs) benefit from non-verbal behavior f...

It's A Match! Gesture Generation Using Expressive Parameter Matching

Automatic gesture generation from speech generally relies on implicit mo...

1. Introduction and Background

A large part of human communication is non-verbal (Knapp et al., 2013) and often takes place through co-speech gestures (McNeill, 1992; Kendon, 2004). Co-speech gesture behavior in embodied agents has been shown to help with learning tasks (Bergmann and Macedonia, 2013) and lead to greater emotional response (Wu et al., 2014). Gesture generation is hence an important part of both automated character animation and human-agent interaction.

Early dominance of rule-based approaches (Cassell et al., 1994; Kopp and Wachsmuth, 2004; Ng-Thow-Hing et al., 2010; Marsella et al., 2013) has been challenged by data-driven gesture generation systems (Neff et al., 2008; Bergmann and Kopp, 2009; Yoon et al., 2019; Kucherenko et al., 2020; Yoon et al., 2020; Ahuja et al., 2020; Ferstl et al., 2020). These latter systems first only considered a single speech modality (either audio or text) (Neff et al., 2008; Bergmann and Kopp, 2009; Yoon et al., 2019; Kucherenko et al., 2021a), but are now shifting to use both audio and text together (Kucherenko et al., 2020; Yoon et al., 2020; Ahuja et al., 2020).

While rule-based systems provide control over the communicative function of output gestures, they lack variability and require much manual effort to design. Data-driven systems, on the other hand, need less manual work and are very flexible, but most existing systems do not provide much control over communicative function and generated gestures have little relation to speech content

(Kucherenko et al., 2021b).

Gesture category [Macro ] Gesture semantics [Macro ] Gesture phase []
(lr)2-13 (lr)14-25 (lr)26-40 Label deictic beat iconic discourse amount shape direction size pre-hold post-hold stroke retr. prep.
Relative frequency 29.05% 14.47% 72.03% 12.78% 4.7% 13.1% 13.7% 1.9% 0.6% 12.2% 40.9% 14.8% 30.8%
0pt0pt RandomGuess 50% 2% 50% 2% 50% 1.5% 50% 02% 49% 1% 49% 2% 49% 2% 50% 1% 1.3% 4% 12% 4% 42% 4% 14% 5% 30% 3%
0pt0pt ProposedModel 60% 6% 53% 6% 63% 5% 59% 07% 63% 8% 65% 6% 62% 8% 59% 9% 0.5% 1.3% 23% 12% 47% 10% 25% 5% 45% 6%
Table 1. Gesture-property prediction scores for random guessing and our trained predictors using text and audio modalities together. Bold, coloured numbers indicate that the label in question can be predicted better than chance.

This paper continues recent efforts to bridge the gap between the two paradigms (Ferstl et al., 2020; Saund et al., 2021; Yunus et al., 2021). The most similar prior work is Yunus et al. (Yunus et al., 2021) where gesture timing and duration were predicted based on acoustic features only. The method proposed here differs from their approach in three ways: 1) it considers not only audio but also text as input; 2) it models not only gesture phase, but multiple gesture properties; 3) it also provides a framework for integrating these gesture properties in a data-driven gesture-generation system.

The proposed approach helps decouple different aspects of gesticulation and can leverage database information about gesture timing and content with modern, high-quality data-driven animation.

2. Proposed Model

Our unified model uses speech text and audio as input to generate gestures as a sequence of 3D poses. As depicted in Figure 1

, it is composed of three neural networks:

  1. [nolistsep,noitemsep]

  2. Speech2GestExist: A temporal CNN which takes speech as input and returns a binary flag indicating if the agent should gesture (similar to (Yunus et al., 2019));

  3. Speech2GestProp: A temporal CNN which takes speech as input and predicts a set of binary gesture properties, such as gesture type, gesture phase, etc.;

  4. GestureFlow: A normalizing flow (Kobyzev et al., 2020)

    that takes both speech and predicted gesture properties as input, and describes a probability distribution over 3D poses, from which motion sequences can be sampled.

In this study, we experiment with the first two neural networks only. We implemented the Speech2GestProp and Speech2GestExist

components using dilated CNNs. Their inputs are sequences of aligned speech text and audio frames, and they return a binary vector of gesture properties (for

Speech2Prop) or a binary flag of gesture existence (for Speech2GestExist) as its output. By sliding a window over the speech and predicting poses, frame-by-frame properties are generated at 5 fps. Text features were extracted using DistilBERT (Sanh et al., 2019). Audio features were log-scaled mel-spectrograms.

3. Preliminary results


We evaluated our model on the SaGA direction-giving dataset (Lücking et al., 2013) designed to contain many representational gesturesThe dataset contains audio/video recordings of 25 participants (all German native speakers) describing the same route to other participants and includes detailed annotations of gesture properties.

We considered the following three gesture properties: 1) Phase (preparation, pre-stroke hold, stroke, post-stroke hold, and retraction); 2) Type (deictic, beat, iconic (McNeill, 1992), and discourse); 3) Semantic information (amount, shape, direction, size, as described in (Bergmann and Kopp, 2006)).

Experimental Results

For each of our experiments we calculated the mean and standard deviation of the

score across 20-fold cross-validation. The score is preferable over accuracy here since the data is highly unbalanced and accuracy does not represent overall performance well. For gesture category and phase we report Macro F1 score (Yang and Liu, 1999), since those properties are not mutually exclusive.

First we validated that gesture presence can be predicted from the speech in our dataset. We achieved a Macro score for binary classification, which aligns with previous work (Yunus et al., 2019).

Next, we experimented with predicting gesture properties. Table 1 contains results for predicting the gesture category, gesture semantic information, and gesture phase from speech text and audio. We can see that this is a challenging task, but we are still able to predict most of the values better than chance. This was unexpected given how complex gesture semantics tend to be and could be due to the focused scope of the direction-giving task. For a deeper study with more results and analyses, please see the follow-up work (Kucherenko et al., 2021c).

4. Discussion

In this section we discuss the feasibility of the proposed approach. Our proposal to use probabilistic models (especially normalizing flows) is inspired by a recent application of MoGlow (Henter et al., 2020) to perform gesture synthesis by Alexanderson et al. (Alexanderson et al., 2020). They showed that such models can be seamlessly conditioned on various kinematic gesture properties (such as speed, range, and hand height), suggesting that it is possible to condition gestures on semantic properties as well.

We obtained good results for the gesture-property prediction part of our proposed system, as described in Section 3. Since we can predict several important properties with scores significantly above chance level, we believe that our predictions are reasonable and will be useful for more appropriate gesture synthesis.

Our two-stage approach lets the machine learning model leverage additional information (such as detailed annotation) about human gestures. It also allows direct control of gesture frequency, by adjusting the threshold on the output of

Speech2GestExist needed to trigger a gesture. Finally, it helps the model learn from small datasets, since each sub-module has a more straightforward task than learning everything at once and also can be trained separately.

5. Conclusion

We presented a novel gesture generation framework aiming to bridge the semantic gap between rule-based and data-driven models.Authors are grateful to Stefan Kopp for providing the SaGA dataset and fruitful discussions about it and to Olga Abramov for advising on its gesture-property processing. This work was partially supported by the Swedish Foundation for Strategic Research Grant No. RIT15-0107 and by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Our method first predicts if a gesture is appropriate for a given point in the speech and what kind of gesture is appropriate. Once this prediction is made, it is used to condition the gesture generation model. Our gesture-property prediction results are promising and indicate that the proposed approach is feasible.


  • C. Ahuja, D. W. Lee, R. Ishii, and L. Morency (2020) No gestures left behind: learning relationships between spoken language and freeform gestures. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings

    pp. 1884–1895. Cited by: §1.
  • S. Alexanderson, G. E. Henter, T. Kucherenko, and J. Beskow (2020) Style-controllable speech-driven gesture synthesis using normalising flows. Computer Graphics Forum 39 (2), pp. 487–496. Cited by: §4.
  • K. Bergmann and S. Kopp (2006) Verbal or visual? how information is distributed across speech and gesture in spatial dialog. In Proceedings of the 10th Workshop on the Semantics and Pragmatics of Dialogue, pp. 90–97. Cited by: §3.
  • K. Bergmann and S. Kopp (2009) GNetIc–using Bayesian decision networks for iconic gesture generation. In International Workshop on Intelligent Virtual Agents, pp. 76–89. Cited by: §1.
  • K. Bergmann and M. Macedonia (2013) A virtual agent as vocabulary trainer: iconic gestures help to improve learners’ memory performance. In Proceedings of the International Workshop on Intelligent Virtual Agents, pp. 139–148. Cited by: §1.
  • J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Becket, B. Douville, S. Prevost, and M. Stone (1994)

    Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents

    In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420. Cited by: §1.
  • Y. Ferstl, M. Neff, and R. McDonnell (2020) Understanding the predictability of gesture parameters from speech and their perceptual importance. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, pp. 1–8. Cited by: §1, §1.
  • G. E. Henter, S. Alexanderson, and J. Beskow (2020) MoGlow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics 39, pp. 236:1–236:14. Cited by: §4.
  • A. Kendon (2004) Gesture: visible action as utterance. Cambridge University Press. Cited by: §1.
  • M. L. Knapp, J. A. Hall, and T. G. Horgan (2013) Nonverbal communication in human interaction. Wadsworth, Cengage Learning. Cited by: §1.
  • I. Kobyzev, S. Prince, and M. Brubaker (2020) Normalizing flows: an introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: item 3.
  • S. Kopp and I. Wachsmuth (2004) Synthesizing multimodal utterances for conversational agents. Computer Animation and Virtual Worlds 15 (1), pp. 39–52. Cited by: §1.
  • T. Kucherenko, D. Hasegawa, N. Kaneko, G. E. Henter, and H. Kjellström (2021a) Moving fast and slow: analysis of representations and post-processing in speech-driven automatic gesture generation. International Journal of Human–Computer Interaction, pp. 1–17. Cited by: §1.
  • T. Kucherenko, P. Jonell, S. van Waveren, G. E. Henter, S. Alexanderson, I. Leite, and H. Kjellström (2020) Gesticulator: a framework for semantically-aware speech-driven gesture generation. In Proceedings of the ACM International Conference on Multimodal Interaction, pp. 242–250. Cited by: §1.
  • T. Kucherenko, P. Jonell, Y. Yoon, P. Wolfert, and G. E. Henter (2021b) A large, crowdsourced evaluation of gesture generation systems on common data: the GENEA Challenge 2020. In Proceedings of the 26th International Conference on Intelligent User Interfaces, pp. 11–21. External Links: ISBN 9781450380171 Cited by: §1.
  • T. Kucherenko, R. Nagy, M. Neff, H. Kjellström, and G. E. Henter (2021c) Multimodal analysis of the predictability of hand-gesture properties. arXiv preprint arXiv:2108.05762. Cited by: §3.
  • A. Lücking, K. Bergman, F. Hahn, S. Kopp, and H. Rieser (2013) Data-based analysis of speech and gesture: the Bielefeld speech and gesture alignment corpus (SaGA) and its applications. Journal on Multimodal User Interfaces 7 (1), pp. 5–18. Cited by: §3.
  • S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and A. Shapiro (2013) Virtual character performance from speech. In Proceedings of the 12th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 25–35. Cited by: §1.
  • D. McNeill (1992) Hand and mind: what gestures reveal about thought. University of Chicago Press. Cited by: §1, §3.
  • M. Neff, M. Kipp, I. Albrecht, and H. Seidel (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Transactions on Graphics 27 (1). Cited by: §1.
  • V. Ng-Thow-Hing, P. Luo, and S. Okita (2010) Synchronized gesture and speech production for humanoid robots. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4617–4624. Cited by: §1.
  • V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, Cited by: §2.
  • C. Saund, A. Bîrlădeanu, and S. Marsella (2021) CMCF: an architecture for realtime gesture generation by clustering gestures by motion and communicative function. In Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems, pp. 1136–1144. Cited by: §1.
  • Y. Wu, S. V. Babu, R. Armstrong, J. W. Bertrand, J. Luo, T. Roy, S. B. Daily, L. C. Dukes, L. F. Hodges, and T. Fasolino (2014) Effects of virtual human animation on emotion contagion in simulated inter-personal experiences. IEEE Transactions on Visualization and Computer Graphics 20 (4), pp. 626–635. Cited by: §1.
  • Y. Yang and X. Liu (1999) A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49. Cited by: §3.
  • Y. Yoon, B. Cha, J. Lee, M. Jang, J. Lee, J. Kim, and G. Lee (2020) Speech gesture generation from the trimodal context of text, audio, and speaker identity. ACM Transactions on Graphics 39, pp. 222:1–222:16. Cited by: §1.
  • Y. Yoon, W. Ko, M. Jang, J. Lee, J. Kim, and G. Lee (2019) Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In Proceedings of the IEEE International Conference on Robotics and Automation, pp. 4303–4309. Cited by: §1.
  • F. Yunus, C. Clavel, and C. Pelachaud (2019)

    Gesture class prediction by recurrent neural network and attention mechanism

    In Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, pp. 233–235. Cited by: item 1, §3.
  • F. Yunus, C. Clavel, and C. Pelachaud (2021) Sequence-to-sequence predictive model: from prosody to communicative gestures. In Digital Human Modeling and Applications in Health, Safety, Ergonomics and Risk Management. Human Body, Motion and Behavior, Cham, pp. 355–374. External Links: ISBN 978-3-030-77817-0 Cited by: §1.