Gesticulator: A framework for semantically-aware speech-driven gesture generation

01/25/2020 ∙ by Taras Kucherenko, et al. ∙ KTH Royal Institute of Technology 0

During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current data-driven co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying “high”): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. We illustrate the model's efficacy with subjective and objective evaluations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While speaking, people spontaneously produce co-speech gestures. Some gestures are linked to acoustics, while others are closely tied to what is being said [17]. To accurately simulate human interactions where nonverbal behavior plays a key role in conveying information, realistic gestures are crucial [10]. Virtual agents have shown to be more engaging when their verbal behavior is accompanied by appropriate nonverbal behavior [27]. However, prior work on gesture generation has mainly used a single input modality: either acoustics or semantics, which restricts the range of gestures that can be modeled. This work uses both input modalities to allow for semantically-aware speech-driven gesture generation.

With recent advances in deep learning, gesture generation for virtual agents has increasingly shifted from rule-based systems 

[3, 28] to data-driven approaches [30, 18]. While early work framed gesture generation as a classification task, aiming to decide among fixed gesture classes [23, 4], more recent work has viewed it as a regression task, aiming to produce continuous motion. This work focuses on continuous gesture generation using a data-driven approach. Our contributions are as follows:

Figure 1:

Overview of our autoregressive model.

  1. The first data-driven model that maps speech acoustic and semantic features to continuous 3D gestures;

  2. A comparison that contrasts the effects of different architectures and important modelling choices;

  3. An evaluation of the effect of the two modalities of the speech – audio and semantics – on the resulting gestures in terms of objective measures (e.g., motion statistics) and observers’ subjective perceptions.

Video samples from our evaluations are provided anonymously at

2 Background

While there are several theories on how gestures are produced by humans [21, 5, 2], there is a consensus that speech and gestures correlate strongly [15, 20, 25, 13]. In this section, we review some concepts relevant to our work, namely gesture classification, the gesture-generation problem formulation and the temporal alignment between gestures and speech.

2.1 Co-Speech Gesture Types

Our work is informed by the gesture classification by McNeill mcneill1992hand, who distinguished the following gesture types:

  1. Iconic gestures represent some aspect of the scene being described in speech;

  2. Metaphoric gestures represent an abstract concept;

  3. Deictic gestures point to an object or orientation;

  4. Beats gestures are used for emphasis and usually correlate with the speech prosody. (e.g., intonation and loudness).

The first three gesture types depend on the content of the speech – its semantics – while the last type only depends on the audio signal – the acoustics. Hence, systems that ignore either aspect of speech can only learn to model a subset of human co-speech gesticulation.

2.2 Gesture Generation

We frame the problem of speech-driven gesture generation as follows: given a sequence of speech features the task is to generate a corresponding pose sequence of gestures that an agent might perform while uttering this speech. Here,

means that we consider a sequence of vectors for

in 1 to .

Each speech segment is represented by a several different features, such as acoustic features (e.g., spectrograms), semantic features (e.g., word embeddings) or a combination of the two. The ground-truth pose and the predicted pose at the same time instance can be represented in 3D space as a sequence of joint rotations:  being the number of keypoints of the human body and , and representing rotations in three axes.

2.3 Gesture-Speech Alignment

Gesture-speech alignment is an active research field covering several languages, including French [7], German [1], and English [20, 25, 13]. We focus on prior work on gesture-speech alignment for the English language.

In English, gestures typically lead the corresponding speech by, on average, 0.22 s (std 0.13 s) [20]; specifically, Pouw et al. [25] aligned different gesture types with the peak pitch of the speech audio and found that the onset of beat gestures usually precedes the corresponding speech by 0.35 s (std 0.3), the onset of iconic gestures precedes speech by 0.45 s (std 0.4), and the onset of pointing gestures precedes speech by 0.38 s (std 0.4).

Informed by these works, we take the widest range among the studies, plus some margin, for the time-span of the speech used to predict the corresponding gesture, and consider 1 s of the future speech and 0.5 s of the past speech as input to our model, as described in Sec. 5.

3 Related Work

As this work contributes toward data-driven gesture generation, we confine our review to these methods.

Most prior work on data-driven gesture generation has used the audio-signal as the only speech-input modality in the model. For example, Sadoughi and Busso sadoughi2017speech trained a probabilistic graphical model to generate a discrete set of gestures based on the speech audio-signal and discourse function. Hasegawa et al. hasegawa2018evaluation developed a more general model capable of generating arbitrary 3D motion using a deep recurrent neural network, applying smoothing as postprocessing step. Kucherenko et al. kucherenko2019analyzing,kucherenko2019importance extended this work by applying representation learning to the human pose and reducing the need for smoothing. Recently, Ginosar et al. ginosar2019learning applied a convolutional neural network to generate 2D poses from spectrogram features. However, driving either virtual avatars or humanoid robots requires 3D joint angles. Our model differs from these systems in that it leverages both the audio-signal and the text transcription for gesture generation, modeling a wider range of gestures.

Several recent works mapped from text transcripts to co-speech gestures. Ishi et al. ishi2018speech generated gestures from text input through a series of probabilistic functions: Words were mapped to word concepts using WordNet [22], which then were mapped to a gesture function (e.g., iconic or beat), which in turn were mapped to clusters of 3D hand gestures. Yoon et al. yoon2018robots learned a mapping from the utterance text to gestures using a recurrent neural network. The produced gestures were aligned with audio in a post-processing step. Although these works capture important information from text transcriptions, they may fail to reflect the strong link between gestures and speech acoustic such as intonation, prosody, and loudness [26].

Only a handful of works have used multiple modalities of the speech to predict matching gestures. The model in Neff et al. neff2008gesture predicted gestures based on text, theme, rheme, and utterance focus. They also incorporated text-to-concept mapping. Concepts were then mapped to a set of 28 discrete gestures in a speaker-dependent manner. Chiu et al chiu2015predicting used both audio signals and text transcripts as input, to predict a total of 12 gesture classes using deep learning. Our approach differs from these works, as we aim to generate a wider range of gestures: rather than predicting a discrete gesture class, our model produces arbitrary gestures as a sequence of 3D poses.

4 Training and Test Data

We develop our gesture generation model using machine learning: we learn a gesture estimator

based on a dataset of human gesticulation, where we have both speech information (acoustic and semantic) and gesture data . For this work, we specifically used the Trinity Gesture Dataset [8], comprising 244 minutes of audio and motion capture recordings of a male actor speaking freely on a variety of topics. We removed lower-body data, retaining 15 upper-body joints out of the original 69. Fingers were not modelled due to poor data quality.

To obtain semantic information for the speech, we first transcribed the audio recordings using Google Cloud automatic speech recognition (ASR), followed by thorough manual review to correct recognition errors and add punctuation. We intend to share this data to facilitate further research.

4.1 Test-Segment Selection

Two 10-minute recordings from the dataset were held out from training. We selected 50 segments of 10 s for testing: 30 random segments and 20 semantic segments, in which speech and recorded gestures were semantically linked. Three human annotators marked time instants where the recorded gesture was semantically linked with the speech content. Instances where all three annotators agreed (with 5 s tolerance) were used as semantic segments in our experiments.

4.2 Audio-Text Alignment

Figure 2: Encoding text as frame-level features. First, the sentence (omitting filler words) is encoded by BERT [6]. We thereafter repeat each vector according to the duration of the corresponding word. Filler words and silence are encoded as fixed vectors, here denoted Vf and Vs.

Text transcriptions and audio typically have different sequence lengths. To overcome this, we encode words into frame-level features as illustrated in Figure 2. First, the sentence, excluding filler words, is encoded by BERT [6]

, which is the state-of-the-art model for many tasks in natural language processing (NLP). We encode filler words and silence, which do not contain semantic information, as special, fixed vectors

and , respectively. Filler words typically indicate a thinking process and can occur with a variety of gestures. Therefore, we set the text feature vector during filler words equal to the average of the feature vectors for the most common filler words in the data. Silence typically has no gesticulation [12], so the silence feature vector was made distinct from all other encodings, by setting all elements equal to −15. Finally, we use timings from the ASR system to nonuniformly upsample the text features, such that both text and audio feature sequences have the same length and timings. This is a standard text-speech alignment method in the closely-related field of speech synthesis [29].

5 Speech-Driven Gesture Generation

This section describes our proposed method for generating upper-body motion from speech acoustics and semantics.

5.1 Feature Types

We base our features on the state of the art in speech audio and text processing. The motion representation also follows common practice. Throughout our experiments, we use frame-synchronised features with 20 fps.

Like previous research in gesture generation [9, 8], we represent speech audio by log-power mel-spectrogram features. For this, we extracted 64-dimensional acoustic feature vectors using a window length of 0.1 s and hop length 0.05 s (giving 20 fps).

For semantic features, we use BERT [6] pretrained on English Wikipedia: each sentence of the transcription is encoded by BERT resulting in 768 features per word, aligned with the audio as described in Sec. 4.2. We supplement these by five frame-wise scalar features that depend on the alignment between text and speech, listed in Table 1.

BERT encoding of current word
Time elapsed from the beginning of the word (in seconds)
Time left until the end of the word (in seconds)
Duration of this word (in seconds)
Relative progress through the word (in %)
Speaking rate of this word (in syllables/second)
Table 1: Text and duration features for each frame.

To extract motion features, the motion-capture data was downsampled to 20 fps and the joint angles were converted to an exponential map representation [11]

relative to a T-pose; this is common in computer animation. We verified that the resulting features did not contain any discontinuities. Thereafter, we reduced the dimensionality by applying PCA and keeping 95% of the variance of the training data, like in

[30]. This resulted in 12 components.

5.2 Model Architecture

Figure 3: Our model architecture. Text and audio features are encoded for each frame and the encodings concatenated. Then, several fully-connected layers are applied. The output pose is fed back into the model in an autoregressive fashion via FiLM conditioning.

Figure 3

illustrates our model architecture. First, the text and audio features of each frame are jointly encoded by a feed-forward neural network to reduce dimensionality. To provide more input context for predicting the current frame, we pass a sliding window spanning 0.5 s (10 frames) of past speech and 1 s (20 frames) of future speech features over the encoded feature vectors. These time spans are grounded in the research on gesture-speech alignment, as described in Sec. 

2.3. The encodings inside the context window are concatenated into a long vector and passed through several fully-connected layers. The model is also autoregressive: we feed preceding model predictions back to the model as can be seen in the figure, to ensure motion continuity. To condition on the information from the previous poses, we use FiLM conditioning [24], which generalizes regular concatenation. FiLM applies element-wise affine transforms to network activations , where scaling and offset vectors are produced by a neural net taking other information (here previous poses) as input. The final layer of the model and of the conditioning network for FiLM are linear to not restrict the attainable output range.

5.3 Training Procedure

We train our model on sequences of aligned speech audio, text, and gestures from the dataset. Each training sequence contains 70 consecutive frames from a larger recording. The first 10 and the last 20 frames establish context for the sliding window, while the 40 central frames are used for training. The model is optimized end-to-end using stochastic gradient descent (SGD) and Adam


to minimize the loss function


here and are the ground-truth position and velocity, and are the same quantities for the model prediction and MSE stands for Mean Squared Error. The weight was set empirically to 0.6. Our velocity penalty can be seen as an improvement on the penalty penalty used by Yoon et al. yoon2018robots. Instead of penalizing the absolute value of the velocity, we enforce velocity to be close to the velocity of the ground truth.

During development, we observed that information from previous poses (the autoregression) tended to overpower the information from the speech: our initial model moved independently of speech input and quickly converged to a static pose. This is a common failure mode in generative sequence models, e.g., [14]

. To counteract this, we pretrain our model without autoregression for the first seven epochs (a number chosen empirically), before letting the model receive autoregressive input. This pretraining helps the network learn to extract useful features from the speech input, an ability which is not lost during further training. Additionally, while full training begins as teacher forcing (meaning that the model receives the ground-truth poses as autoregressive input instead of its own previous predictions), this is annealed over time: after one epoch, the model receives its own prediction instead of the ground truth poses (for two consecutive frames) every 16 frames, which increased to every eight frames after another epoch, to every four frames after the next epoch, and then to every single frame after that. Hence, after five epochs of training with autoregression, our model receives its own predictions only. This greatly helps the model learn to recover from its own mistakes.

5.4 Hyper-Parameter Settings

For the experiments in this paper, we used the hyper-parameter search tool Tune [19]

. We performed random search over 600 configurations with velocity loss as the only criterion, obtaining following hyper-parameters: Speech-encoding dimensionality 124 at each of 30 frames, producing 3720 elements after concatenation. The three subsequent layers had 612, 256, and 12 or 45 nodes (the output dimensionality with or without PCA). Three previous poses were encoded into a 512-dimensional conditioning vector. The activation function was

, the batch size was 64 and the learning rate 10-4

. For regularization, we applied dropout with probability 0.2 to each layer, except for the pose encoding, which had dropout 0.8 to prevent the model from attending too much to past poses.

6 Objective Evaluation

System Description
Full model The proposed method
No PCA No PCA is applied to output poses
No Audio Only text is used as input
No Text Only audio is used as input
No FiLM Concatenation instead of FiLM
No Velocity loss The velocity loss is removed
No Autoregression The previous poses are not used
Table 2: The seven system variants studied in the evaluation.

Our work aims to allow semantics-aware gesture generation by combining both acoustic and semantic speech information. As previous work on data-driven continuous gesture generation has exclusively used a single modality, it is difficult to find baselines for a fair comparison with our proposed model. Therefore, we evaluate the importance of various model components by individually ablating them, training seven different system variants including the full model (see Table 2).

As a step towards common evaluation measures for the gesture generation field, we primarily use metrics proposed by previous researchers. Specifically, we evaluated the average values of root-mean-square error (RMSE), acceleration and jerk, and speed histograms of the produced motion, in line with Kucherenko et al. kucherenko2019analyzing. To obtain these statistics, the gestures were converted from joint angles to 3D positions.

6.1 Average Motion Statistics

Table 3 illustrates acceleration and jerk, as well as RMSE, averaged over 50 test samples for the ground truth and the different ablations of the proposed method. Ground-truth statistics are given as reference values for natural motion.

We can observe that the proposed model moves more slowly than the original motion. This is probably because our model is deterministic and hence produces gestures closer to the mean pose. Not using PCA results in higher acceleration and jerk and made the model statistics closer to the ground truth. Our intuition for this is that PCA reduced variability in the data, which resulted in over-smoothed motion. Removing either audio or text input made the produced gestures even slower. This is probably because there was a weaker input signal to drive the model and hence it gesticulated closer to the mean pose. Both FiLM conditioning and the velocity penalty seem to have little effect on the motion statistics and are likely not central to the model. That autoregression is a key aspect of our system is clear from this evaluation: without autoregression, the model loses continuity and generates motion with excessive jerk. RMSE appears to not be informative. This is expected since there are many plausible ways to gesticulate, so the minimum-expected-loss output gestures do not have to be close to our ground truth.

System Acceleration Jerk RMSE
Full model 0.33  0.05 0.29  0.03 39  25
No PCA 0.55  0.09 0.46  0.06 46  29
No Audio 0.26  0.05 0.18  0.03 40  29
No Text 0.21  0.02 0.25  0.03 38  24
No FiLM 0.34  0.05 0.27  0.04 38  24
No Velocity loss 0.34  0.04 0.29  0.03 40  26
No Autoregression 1.02  0.19 1.64  0.31 39  25
Ground truth 1.42  0.42 1.12  0.33 0
Table 3:

Objective evaluation of our systems: mean and standard deviation over 50 samples

6.2 Motion Acceleration Histograms

The values in Table 3 were averaged over all frames and all 15 3D joints. To investigate the motion statistics in more detail, we computed acceleration histograms of the generated motion and compared those against histograms derived from the ground-truth test data. We calculated the relative frequency of different acceleration values over time-frames in all 50 test sequences, split into bins of width 1 cm/s2.

(a) Comparing different architectures.
(b) Comparing different input/output data.
Figure 4: Acceleration distribution histograms.

Figure 4a illustrates the acceleration histogram for the different model architectures and loss functions we considered. We observe two things: 1) the distributions are not influenced strongly by either FiLM conditioning or by velocity loss; and 2) autoregression is important for producing motion with similar motion statistics as the human motion recordings.

Acceleration histograms for different input/output data are shown in Figure 4b. We observe that the biggest impact occurs from excluding the text input and makes the acceleration smaller. This agrees with Table 3 and probably means that without semantic information the model produces mainly beat gestures, whose characteristics differ from other gesture types. Having no audio similarly decreases the acceleration of the produced gestures, which also may indicate that we are modeling different gesture types. Removing PCA increases acceleration, making the distribution more similar to the ground truth. In other words, training our model in the PCA space leads to reduced variability, which makes sense. While these numerical evaluations are valuable, they say very little about people’s perceptions of the generated gestures.

7 Perceptual Studies and Discussion

To investigate human perception of the gestures we conducted several user studies. We first evaluated participants’ perception of a virtual character’s gestures as produced by the seven variants of our model described in Table 2 (Study 1), after which the subjectively most preferred system from the first study was compared to held-out ground truth motion (Study 2). The experimental procedure and evaluation measures (described below) were identical for both studies.

Figure 5:

Results of first user study, comparing different ablations of our model in pairwise preference tests. Four questions, listed above each bar chart, were asked about each pair of videos. The bars show the preference towards the full model with 95% confidence intervals.

7.1 Experiment Design and Procedure

We assessed the perceived human-likeness of the virtual character’s motion and how the motion related to the character’s speech using measures adapted from recent co-speech gesture generation papers [9, 30]. Specifically, we asked the questions “In which video…”: (Q1) “…are the character’s movements most human-like?” (Q2) “…do the character’s movements most reflect what the character says?” (Q3) “…do the character’s movements most help to understand what the character says?” (Q4) “…are the character’s voice and movement more in sync?”

We conducted a crowdsourced comparison of the full model against each of the six ablations in Table 2, in light of the four questions above. Participants were recruited on Amazon Mechanical Turk (AMT) and were assigned to one of the six comparisons; they could complete the study only once, and were thus only exposed to one of the comparisons. Each participant was asked to evaluate 26 same-speech video pairs on the four subjective measures: 10 pairs randomly sampled from a pool of 28 random segments, 10 from a pool of 20 semantic segments, and 6 attention checks described below. These video pairs were then randomly shuffled. Video samples from all systems can be found at

Every participant first completed a training phase to familiarize themselves with the task and interface. This training consisted of five items not included in the analysis, with video segments not present in the study, showing gestures of different quality. Then, during the experiment, the videos in each pair were presented side by side in random order and could be replayed as many times as desired. For each pair, participants indicated which video they though best corresponded to a given question (one of Q1 through Q4 above), or that they perceived both videos to be equal in regard to the question.

For four of the six attention checks, we picked a random video in the pair and heavily distorted either the audio (in the 2nd and 17th video pairs) or the video quality (in the 7th and 21st video pairs). Raters were asked to report any video pairs where they experienced audio or video issues, and were automatically excluded from the study upon failing any two of these four attention checks. In addition, the 13th and 24th video pairs presented the same video (from the random pool) twice. Here, an attentive rater should answer “no difference”.

7.2 Comparison of Model Variants

In the comparison of system ablations (Study 1), 123 participants ( age = ; 52 male, 70 female, 1 other) remained after exclusion of 477 participants who failed the attention checks, experienced technical issues, or stopped the study prematurely. The majority were from the USA ( = 120). We conducted a binomial test excluding ties with Holm-Bonferroni correction of -values to analyze the responses. (24 responses that participants flagged for technical issues were excluded.) Our analysis was done in a double-blind fashion such that the conditions were obfuscated during analysis and only revealed to the authors after the statistical tests had been performed. The results are shown in Figure 5.

We can see from the evaluation of the “No Text” system that removing the semantic input drastically decreases both the perceived human-likeness of the produced gestures and how much they are linked to speech: participants preferred the full model over the one without text across all four questions asked with <.0001. This confirms that semantics are important for appropriate automatic gesture generation.

The “No Audio” model is unlikely to generate beats, and might not follow an appropriate speech rhythm when used with a speaking avatar. Results in Figure 5 confirm this: we see that removing the acoustic information from the model decreased perceived output quality. In terms of human-likeness (Q1) there was a statistical difference only for random samples with <.01. For Q2 and Q3, the full model was also preferred with <.01. For Q4, participants preferred the full model with <.05.

Removing autoregression from the model negatively affected only perceived naturalness (<.0001), as can been seen in Figure 5. This agrees with the findings from the objective evaluation: without autoregression the model produces jerky, unnatural-looking gestures, but the jerkiness does not influence whether gestures are semantically linked to the speech content. Removing FiLM or the velocity penalty did not have a statistical difference on gesture perception, suggesting that these components are not critical for the model.

The model without PCA gave unexpected results. In videos, we see that removing PCA improved gesture variability. While for human-likeness, there was no statistical difference, “No PCA” was significantly better (<.001) on Q2 and Q3 (see Figure 5). On Q4 there was a statistical difference only for semantic segments with <.001, as these probably require more sophisticated gesticulation. In summary, participants preferred the system without PCA, so it was chosen as our final model for the second, final, comparison.

7.3 Comparison Against the Ground Truth

Having identified “No PCA” as the best variant among our systems, we compared that model variant to the ground-truth gestures using the same procedure as before. In this study (Study 2), 20 participants ( age = ; 9 male, 11 female) remained after excluding 31 participants through the same criteria as in Study 1. = 18 were from the USA. There was a very substantial preference for the ground-truth motion (between 84 and 93%) across questions and segment types. All differences were statistically significant according to Holm-Bonferroni-corrected binomial tests ignoring ties.

7.4 Discussion of Generated Gestures

This work is informed by the gesture typology of McNeill [21], which categorizes gestures as iconic, metaphoric, deictic and beats. We find that our model is able to generate not just beat gestures, which correlate with audio, but also several iconic gestures (e.g., raising the hands for the word “top” in and some metaphoric gestures, both of which correlate with the semantic content of the speech. Importantly, these were produced in continuous gesticulation, instead of by predicting “canned” gesture classes as in, e.g., [4]. This indicates that our model was able to leverage both acoustic and semantic input as intended. In practice, the range of learned gestures and their types depend, of course, on the training data, creating a need for large, comprehensive datasets for the task.

Looking at the videos, we also notice that the generated gestures have very little motion in the spine and shoulders. As whole-spine movements are random and not correlated with the speech, the model learns to predict average values for the spine and shoulder positions, resulting in a somewhat stiff-looking avatar. We believe this was the main reason why participants so easily identified the ground truth in Study 2.

8 Conclusions

We have presented a new machine learning-based model333Code will be made publicly available upon acceptance. for co-speech gesture generation. To the best of our knowledge, this is the first data-driven model capable of generating continuous gestures linked to both the audio and the semantics of the speech. We evaluated different architecture choices both objectively and subjectively. Our findings indicate that:

  1. Using both modalities of the speech – audio and text – can improve gesture-generation models and enable them to generate both beat gestures and semantically-linked gestures, such as metaphoric gestures.

  2. Autoregressive connections, while not commonplace in contemporary gesture-generation models, can enforce continuity of the gestures, without vanishing-gradient issues and with few parameters to learn. We also described a training scheme that prevents autoregressive information from overpowering other inputs.

  3. PCA applied to the motion space (as used in [30]) can restrict the model by removing perceptually-important variation from the data, which may reduce the range of gestures produced.

Future work involves making the model stochastic and testing how gestures influence human-agent interaction. Another research direction would be to further improve the semantic coherence of the gestures, for instance by treating different gesture types separately.


The authors would like to thank Andre Pereira for helpful discussions. This work was supported by the Swedish Foundation for Strategic Research Grant No.: RIT15-0107 (EACare) and by the Wallenberg AI, Autonomous Systems, and Software Program (WASP) of the Knut and Alice Wallenberg Foundation, Sweden.


  • [1] K. Bergmann, V. Aksu, and S. Kopp (2011) The relation of speech and gestures: temporal synchrony follows semantic synchrony. In Workshop on Gesture and Speech in Interaction, Cited by: §2.3.
  • [2] T. W. Bickmore (2004) Unspoken rules of spoken interaction. Communications of the ACM. Cited by: §2.
  • [3] J. Cassell, H. H. Vilhjálmsson, and T. Bickmore (2001) BEAT: the behavior expression animation toolkit. In Conference on Computer Graphics and Interactive Techniques, Cited by: §1.
  • [4] C. Chiu, L. Morency, and S. Marsella (2015) Predicting co-verbal gestures: a deep and temporal modeling approach. In International Conference on Intelligent Virtual Agents, Cited by: §1, §7.4.
  • [5] M. Chu and S. Kita (2016) Co-thought and co-speech gestures are generated by the same action generation process. Journal of Experimental Psychology: Learning, Memory, and Cognition. Cited by: §2.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. Conference of the North American Chapter of the Association for Computational Linguistics. Cited by: Figure 2, §4.2, §5.1.
  • [7] G. Ferré (2010) Timing relationships between speech and co-verbal gestures in spontaneous French. In Language Resources and Evaluation, Workshop on Multimodal Corpora, Cited by: §2.3.
  • [8] Y. Ferstl and R. McDonnell (2018) Investigating the use of recurrent motion modelling for speech gesture generation. In International Conference on Intelligent Virtual Agents, Cited by: §4, §5.1.
  • [9] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J. Malik (2019) Learning individual styles of conversational gesture. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cited by: §5.1, §7.1.
  • [10] S. Goldin-Meadow (1999) The role of gesture in communication and thinking. Trends in Cognitive Sciences. Cited by: §1.
  • [11] F. S. Grassia (1998) Practical parameterization of rotations using the exponential map. Journal of Graphics Tools. Cited by: §5.1.
  • [12] M. Graziano and M. Gullberg (2018) When speech stops, gesture stops: evidence from developmental and crosslinguistic comparisons. Frontiers in Psychology. Cited by: §4.2.
  • [13] M. Graziano, E. Nicoladis, and P. Marentette (2019) How referential gestures align with speech: evidence from monolingual and bilingual speakers. Language Learning. Cited by: §2.3, §2.
  • [14] G. E. Henter, S. Alexanderson, and J. Beskow (2019) MoGlow: probabilistic and controllable motion synthesis using normalising flows. arXiv preprint arXiv:1905.06598. Cited by: §5.3.
  • [15] J. M. Iverson and E. Thelen (1999) Hand, mouth and brain. the dynamic emergence of speech and gesture. Journal of Consciousness Studies. Cited by: §2.
  • [16] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §5.3.
  • [17] S. Kopp, H. Rieser, I. Wachsmuth, K. Bergmann, and A. Lücking (2007) Speech-gesture alignment. In Conference of the International Society for Gesture Studies, Cited by: §1.
  • [18] T. Kucherenko, D. Hasegawa, G. E. Henter, N. Kaneko, and H. Kjellström (2019) Analyzing input and output representations for speech-driven gesture generation. In International Conference on Intelligent Virtual Agents, Cited by: §1.
  • [19] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica (2018) Tune: a research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118. Cited by: §5.4.
  • [20] D. P. Loehr (2012) Temporal, structural, and pragmatic synchrony between intonation and gesture. Laboratory Phonology. Cited by: §2.3, §2.3, §2.
  • [21] D. McNeill (1992) Hand and mind: what gestures reveal about thought. University of Chicago Press. Cited by: §2, §7.4.
  • [22] G. A. Miller (1995) WordNet: a lexical database for English. Communications of the ACM. Cited by: §3.
  • [23] M. Neff, M. Kipp, I. Albrecht, and H. Seidel (2008) Gesture modeling and animation based on a probabilistic re-creation of speaker style. ACM Transactions on Graphics. Cited by: §1.
  • [24] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) FiLM: visual reasoning with a general conditioning layer. In

    AAAI Conference on Artificial Intelligence

    Cited by: §5.2.
  • [25] W. Pouw and J. A. Dixon (2019) Quantifying gesture-speech synchrony. In Gesture and Speech in Interaction, Cited by: §2.3, §2.3, §2.
  • [26] W. Pouw, S. J. Harrison, and J. A. Dixon (2019) Gesture–speech physics: the biomechanical basis for the emergence of gesture–speech synchrony.. Journal of Experimental Psychology: General. Cited by: §3.
  • [27] M. Salem, K. Rohlfing, S. Kopp, and F. Joublin (2011) A friendly gesture: investigating the effect of multimodal robot behavior in human-robot interaction. In IEEE International Symposium on Robot and Human Interactive Communication, Cited by: §1.
  • [28] G. Salvi, J. Beskow, S. Al Moubayed, and B. Granström (2009) SynFace: speech-driven facial animation for virtual speech-reading support. Journal on Audio, Speech, and Music Processing. Cited by: §1.
  • [29] Z. Wu, O. Watts, and S. King (2016)

    Merlin: an open source neural network speech synthesis system

    In ISCA Speech Synthesis Workshop, Cited by: §4.2.
  • [30] Y. Yoon, W. Ko, M. Jang, J. Lee, J. Kim, and G. Lee (2019) Robots learn social skills: end-to-end learning of co-speech gesture generation for humanoid robots. In IEEE International Conference on Robotics and Automation, Cited by: §1, §5.1, §7.1, item 3.