Log In Sign Up

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that MFCCs, alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.


page 1

page 2

page 3

page 4


Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

Co-speech gestures enhance interaction experiences between humans as wel...

A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents

Embodied Conversational Agents (ECA) take on different forms, including ...

A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020

Co-speech gestures, gestures that accompany speech, play an important ro...

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

This paper reports on the second GENEA Challenge to benchmark data-drive...

Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance

Gesture behavior is a natural part of human conversation. Much work has ...

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

We present ZeroEGGS, a neural network framework for speech-driven gestur...

1. Introduction

Conversational agents in the form of virtual agents or social robots are rapidly becoming wide-spread and many of us will soon interact regularly with them in our day-to-day lives. Humans use non-verbal behaviors to signal their intent, emotions and attitudes in human-human interactions (Knapp et al., 2013; Matsumoto et al., 2013). Similarly, it has been shown that people read and interpret robots’ non-verbal cues similarly to non-verbal cues from other people (Breazeal et al., 2005). Interestingly, adding these non-verbal behaviors to robots has been shown to positively affect people’s perception of the robot (Salem et al., 2013).Conversational agents therefore need the ability to perceive and produce non-verbal communication.

An important part of non-verbal communication is gesticulation: gestures made with hands, arms, head pose and body pose communicate a large share of non-verbal content (McNeill, 1992). To facilitate natural human-agent interaction, it is hence important to enable robots and embodied virtual agents to accompany their speech with gestures in the way people do.

Most existing work on generating hand gestures relies on rule-based methods (Cassell et al., 2001; Ng-Thow-Hing et al., 2010; Huang and Mutlu, 2012; Ravenet et al., 2018).

These methods are rather rigid as they can only generate gestures that are incorporated in the rules. Writing down rules for all possible gestures found in human interaction is highly labor-intense and time-consuming. Consequently, it is difficult to fully capture the richness of human gesticulation in rule-based systems.

In this paper, we present a solution that eliminates this bottleneck by using a data-driven method that learns to generate human gestures from a dataset of human actions. More specifically, we use speech data, as it is highly correlated with hand gestures (McNeill, 1992) and has the same temporal character.

To predict gestures from speech, we apply Deep Neural Networks (DNNs), which have been widely used in human skeleton modeling for motion prediction (Martinez et al., 2017) as well as classification (Bütepage et al., 2017). We further apply representation learning on top of conventional DNNs with speech as input and gestures as output.

Representation learning is a branch of unsupervised learning aiming to learn a better representation of the data. The new representation can be “better” in various ways: it may be lower-dimensional, contain less redundancy, be more informative, etc. Typically, representation learning is applied to make a subsequent learning task easier. Inspired by previous successful applications to learning human motion dynamics, for example in prediction

(Bütepage et al., 2017), classification (Liu and Taniguchi, 2014) and motion synthesis (Habibie et al., 2017; Haag and Shimodaira, 2016), this paper applies representation learning to the motion sequence, in order to extend previous approaches for neural-network-based speech-to-gesture mappings (Takeuchi et al., 2017a; Hasegawa et al., 2018).

The contributions of this paper are two-fold:

  1. We propose a novel speech-driven non-verbal behavior generation method that is generic and agent-agnostic (i.e., it can be applied to any embodiment).

  2. We evaluate the importance of representation both for the motion (by doing representation learning) and for the speech (by comparing different speech feature extractors).

Our evaluation results show that our system can learn a mapping from a human speech signal to the corresponding upper-body motion in the form of 3D joint positions. We analyze which motion representation size yields the best results for the speech-driven gesture generation. Moreover, we numerically evaluate which speech features are most useful. Finally, we perform a user study, which finds that representation learning improved the perceived naturalness of the gestures over the baseline model.

2. Representation learning for speech-motion mapping

Figure 1. Framework overview. The Deep Neural Network (DNN) green boxes are further described in Figures 2 and 3.

2.1. Problem formulation

We frame the problem of speech-driven gesture generation as follows: given a sequence of speech features extracted from segments (frames) of speech audio at regular intervals , the task is to generate a corresponding gesture sequence that a human might perform while uttering this speech.

A speech segment would be typically represented by some features, such as Mel Frequency Cepstral Coefficients, MFCCs, (which are commonly used in speech recognition) or prosodic features including pitch (F0), energy, and their derivatives (which are commonly used in speech emotion analysis).

The ground truth gestures and predicted gestures are typically represented as 3D-coordinate sequences:
being the number of keypoints of the human body (such as shoulder, elbow, etc.) that are being modelled.

The most recent systems tend to perform mappings from to using a neural network (NN) learned from data. The dataset typically contains recordings of human motion (for instance from a motion capture system) and the corresponding speech signals.

2.2. Baseline speech-to-motion mapping

Our model builds on the work of Hasegawa et al. (Hasegawa et al., 2018). In this section, we describe their model, which is our baseline system.

The speech-gesture neural network (Hasegawa et al., 2018) takes a speech sequence as input and generates a sequence of gestures frame by frame. As illustrated in Figure 1, the speech is processed in overlapping chunks of frames (like in (Hasegawa et al., 2018)) before and after the current time . (The offset between frames in the figure is exaggerated for demonstration purposes.) An entire speech-feature window is fed into the network at each time step : . The network is regularized by predicting not only the pose but also the velocity as output: . While incorporating the velocity into test-time predictions did not provide a significant improvement, the inclusion of velocity as a multitask objective during training forced the network to learn motion dynamics (Takeuchi et al., 2017a).

Figure 2. Baseline DNN for speech-to-motion mapping. The green box identifies the part used for the DNN in Figure 1.

The baseline neural network architecture is illustrated in Figure 2. First, MFCC features are computed for every speech segment. Then three fully connected layers (FC) are applied to every chunk

. This part of the network can be seen as an additional, learned feature extractor. Next, a recurrent network layer with so-called Gated Recurrent Units (GRUs)

(Cho et al., 2014) is applied to the resulting sequence. Finally, an additional linear, fully-connected layer is used as the output layer.

It should be noted that the baseline network we described is a minor modification of the network in (Hasegawa et al., 2018). Specifically, we use a different type of recurrent network units, namely GRUs instead of B-LSTMs. Our experiments found that this cuts the training time in half while maintaining the same prediction performance. We also used shorter window length for computing MFCC features, namely 0.02 s instead of 0.125 s, since MFCCs were developed to be informative about speech for these window lengths. The only other difference against (Hasegawa et al., 2018) is that we did not post-process (smooth) the output sequences.

2.3. Proposed approach

Our intent with this paper is to extend the baseline model by leveraging the power of representation learning. Our proposed approach contains three steps:

  1. We apply representation learning to learn a motion representation .

  2. We learn a mapping from the chosen speech features to the learned motion representation (using the same NN architecture as in the baseline model).

  3. The two learned mappings are chained together to turn speech input into motion output .

(a) MotionED: Representation learning for the motion
(b) SpeechE: Mapping speech to motion representations
(c) Combining the learned components: SpeechE and MotionD
Figure 3. How the proposed encoder-decoder DNN for speech-to-motion mapping is constructed. The green box denotes the part of the system used for the DNN in Figure 1.

Motion representation learning

Figure (a)a illustrates representation learning for human motion sequences. The aim of this step is to reduce the motion dimensionality, which confers two benefits: 1) simplifying the learning problem by reducing the output space dimensionality; and 2) reducing redundancy in the training data by forcing the system to concentrate important information to fewer numbers.

To learn motion representations, we used a neural network structure called a Denoising Autoencoder (DAE) (Vincent et al., 2010) with one hidden layer (). This network learns to reconstruct the input frame while having a bottleneck layer in the middle, which forces the network to compute lower dimensional representation. The denoising autoencoder is specifically trained to reconstruct the clean, original data from input examples with additive noise. The network can be seen as a combination of two networks: MotionE, which encodes the motion to the representation and MotionD, which decodes the representation back to the motion :


The neural network learns to reconstruct the original motion coordinates as closely as possible by minimizing the mean squared error (MSE) loss function:


Encoding speech to the motion representation

Figure (b)b illustrates the principle of how we map speech to motion representation. Conceptually, the network performing this task fulfills the same role as the baseline network in Section 2.2. The main difference versus the baseline is that the output of the network is not raw motion values, but a compact, learned representation of motion. To be as comparable as possible to the baseline, we use the same network architecture to map speech to motion representations in the proposed system as the baseline used for mapping speech to motion. We call this network SpeechE.

Connecting the speech encoder and the motion decoder

Figure (c)c illustrates how the system is used at testing time by chaining together the two previously learned mappings. First, speech input is fed to the SpeechE encoding net, which produces a sequence of motion representations. Those motion representations are then decoded into joint coordinates by the MotionD decoding net.

2.4. Implementation

The baseline neural network

Figure 2 shows the structure and layer sizes of the neural network used in the baseline system. As seen, the network inputs contained 61

26 elements, comprising 26-dimensional speech-derived MFCC vectors from the current frame plus 30 adjacent frames both before and after it, resulting in a total of 61 vectors in the input. (While we describe and explore other audio features in

3.2, the baseline model only used MFCCs, to be consistent with (Hasegawa et al., 2018)

.) The Fully Connected (FC) layers and the Gated Recurrent Unit (GRU) layers both had a width of 256 and used the ReLU activation function. Batch normalization and dropout with probability 0.1 of dropping activations were applied between every layer. Training minimized the mean squared error between predicted and ground-truth gesture sequences using the Adam optimizer

(Kingma and Ba, 2014)

with learning rate 0.001 and batch size 2048. Training was run for 120 epochs, after which no further improvement in validation set loss was observed.

To encourage replication of our results we make the code publicly available at _generation_with_autoencoder.

The denoising autoencoder neural network

We trained a DAE with input size 384 (64 joints: 192 3D-coordinates and their first derivatives) and one hidden, feedforward layer in the encoder and decoder. Different widths were investigated for the bottleneck layer (see Section 4.1

), with 325 units giving the best performance on our validation data. Gaussian noise was added to each input with a standard deviation equal to 0.05 times the standard deviation of that feature dimension. Training minimized the MSE reconstruction loss using Adam with a learning rate of 0.001 and batch size 128. Training was run for 20 epochs.

3. Experimental setup

This section describes the data and gives technical detail regarding the experiments we conducted to evaluate the importance of input and output representations in speech-driven gesture generation.

3.1. Gesture-speech dataset

For our experiments, we used a gesture-speech dataset collected by Takeuchi et al. (Takeuchi et al., 2017b). Motion data were recorded in a motion capture studio from two Japanese individuals having a conversation in the form of an interview.

An experimenter asked questions prepared beforehand, and a performer answered them. The dataset contains MP3-encoded speech audio captured using headset microphones on each speaker, coupled with motion-capture motion data stored in the BioVision Hierarchy format (BVH). The BVH data describes motion as a time sequence of Euler rotations for each joint in the defined skeleton hierarchy. These Euler angles were converted to a total of 64 global joint positions in 3D. As some of the recordings had a different framerate than others, we downsampled all recordings to a common framerate of to 20 frames per second (fps). Afterward, all the coordinates were translated to hip-center coordinates, meaning that the origin of the coordinate system was moved to the hip. These coordinates were used as our target, output data.

The dataset contains 1,047 utterances111The original paper reports 1,049 utterances, which is a typo., of which our experiments used 957 for training, 45 for validation, and 45 testing. The relationship between various speech-audio features and the 64 joint positions was thus learned from 171 minutes of training data at 20 fps, resulting in 206,000 training frames.

3.2. Feature extraction

The ease of learning and the limits of expressiveness for a speech-to-gesture system depend greatly on the input features used. Simple features that encapsulate the most important information are likely to work well for learning from small datasets, whereas rich and complex features might allow learning additional aspects of speech-driven gesture behavior, but may require more data to achieve good accuracy. We experimented with three different, well-established audio features as inputs to the neural network, namely:

  1. MFCCs

  2. Spectrograms

  3. Prosodic features

In terms of implementation, 26 MFCCs were extracted with a window length of 0.02 s and a hop length of 0.01 ms, which amounts to 100 analysis frames per second. Our spectrogram features, meanwhile, were 64-dimensional and extracted with the window length and hop size 0.005 s, yielding a rate of 200 fps. Frequencies that carry little speech information (below the hearing threshold of 20 Hz, or above 8000 Hz) were removed. Both the MFCC and the spectrogram sequences were downsampled to match the motion frequency of 20 fps by replacing every 5 (MFCCs) or 10 (spectrogram) frames by their average. (This averaging prevents aliasing artifacts.)

As an alternative to MFCCs and spectrum-based features, we also considered prosodic features. These differ in that prosody encompasses intonation, rhythm, and anything else about the speech outside of the specific words spoken (e.g., semantics and syntax). Prosodic features were previously used for gesture prediction in early data-driven work by Chiu & Marsella (Chiu and Marsella, 2011). For this study, we considered pitch and energy (intensity) information. The information in these features has a lower bitrate and is not sufficient for discriminating between and responding differently to arbitrary words, but might still be informative for predicting non-verbal emphases like beat gestures and their timings.

We considered four specific prosodic features, extracted from the speech audio with a window length of 0.005 s, resulting in 200 fps. Our two first prosodic features were the energy of the speech signal and the time derivative (finite difference) of the energy series. The third and fourth features were the logarithm of the F0 (pitch) contour, which contains information about the speech intonation, and its time derivative. We extracted pitch and intensity values from audio using Praat

(Boersma, 2002) and normalized pitch and intensity as in (Chiu and Marsella, 2011): the pitch values were adjusted by taking and setting negative values to zero, and the intensity values were adjusted by taking . All these features were again downsampled to the motion frequency of 20 fps using averaging.

3.3. Numerical evaluation measures

We used both objective and subjective measures to evaluate the different approaches under investigation. Among the former, two kinds of error measures were considered:

Average Position Error (APE):

The APE is the average Euclidean distance between the predicted coordinates and the original coordinates :


where is the total duration of the sequence, is the dimensionality of the motion data and n is a sequence index.

Motion Statistics:

We considered the average values and distributions of acceleration and jerk for the produced motion.

We believe the motion statistics to be the most informative for our task: in contrast to tracking, the purpose of gesture generation is not to reproduce one specific true position, but rather to produce a plausible candidate for natural motion. Plausible motions do not require measures like speed or jerk to closely follow the original motion, but they should follow a similar distribution. That is why we study distribution statistics, namely average speed and jerk.

Since there is some randomness in system training, e.g., due to random initial network weights, we evaluated every condition five times and report the mean and standard deviation of those results.

4. Results and Discussion

This section presents an analysis of the performance of the gesture-prediction system. We investigate different design aspects of our system that relate to the importance of representations, namely the speech and motion representations used.

4.1. Importance of motion encoding

First of all, we evaluated how different dimensionalities for the learned motion representation affected the prediction accuracy of the full system. Figure 4 graphs the results of this evaluation. In terms of average position error (APE) (see Figure (a)a) the optimal dimensionality is clearly 325, which is smaller than the original data dimensionality (384). Motion jerkiness (see Figure (b)b

) is also lowest for dimensionality 325, but only by a slight margin compared to the uncertainty in the estimates. Importantly, the proposed system is seen to perform much better than the baseline

(Hasegawa et al., 2018) on both evaluation measures. The difference in the average jerk, in particular, is highly significant. This validates our decision to use representation learning to improve gesture generation models. While motion jerkiness can be reduced through post-processing, as in (Hasegawa et al., 2018), that does not address the underlying shortcoming of the model.

The numerical results are seen to vary noticeably between different runs, suggesting that that training might converge to different local optima depending on the random initial weights.

(a) Average position error (APE). The baseline APE (blue line) is 8.30.4.
(b) Average jerk. Baseline jerk is 2.80.3 while ground-truth jerk is 0.54.
Figure 4. Effect of learned-representation dimensionality in the proposed model.

4.2. Input speech representation

Having established the benefits of representation learning for the output motion, we next analyze which input features perform the best for our speech-driven gesture generation system. In particular, we compare three different features – MFCCs, raw power-spectrogram values, and prosodic features (log F0 contour, energy, and their derivatives) – as described in Section 3.2.

From Table 1, we observe that MFCCs achieve the lowest APE, but produce motion with higher acceleration and jerkiness than the spectrogram features do. Spectrogram features gave suboptimal APE, but match ground-truth acceleration and jerk better than the other features we studied.

Model/feature APE Acceleration Jerk
Static mean pose 8.95 0 0
Prosodic 8.560.2 0.900.03 1.520.07
Spectrogram 8.270.4 0.510.07 0.850.12
Spectr. + Pros. 8.110.3 0.570.08 0.950.12
MFCC 7.660.2 0.530.03 0.910.05
MFCC + Pros. 7.650.2 0.580.06 0.970.11
Baseline (Hasegawa et al., 2018) (MFCC) 8.070.1 1.500.03 2.620.05
Ground truth 0 0.38 0.54
Table 1. Objective evaluation of different speech features, averaged over five re-trainings of the system.

4.3. Detailed performance analysis

The objective measures in Table 1 do not unambiguously establish which input features would be the best choice for our predictor. We therefore further analyze the statistics of the generated motion, particularly acceleration. Producing the right motion with the right acceleration distribution is crucial for generating convincing motions, as too fast or too slow motion does not look natural.

(a) Average acceleration histogram.
(b) Acceleration histogram for shoulders. Legend as in (a).
(c) Acceleration histogram for hands Legend as in (a).
Figure 5. Acceleration distributions given different speech features. Firstly, the motion produced from our model (with any input feature) is more similar to the acceleration distribution of the ground-truth motion, compared to motion from the baseline model. Secondly, we find that MFCCs produce an acceleration distribution most similar to the ground truth, especially for the hands, as shown in (c).

To investigate the motion statistics associated with the different input features, we computed acceleration histograms of the generated motions and compared those with histograms derived from the ground truth. We calculated the relative frequency of different acceleration values over the frames in all 45 test sequences, split into bins of equal width. For ease of visualization, the histograms are not shown as bar plots, but the bin frequencies have instead been connected with lines.

Figure (a)a presents acceleration histograms across all joints for different input features. We observe that the baseline model has an acceleration distribution different from the ground truth, while all our model variants produce acceleration histograms more similar to the ground truth.

Since the results in Figure (a)a are averaged over all joints, they do not indicate whether all the joints move naturally. To address this we also analyze the acceleration distribution for certain specific joints. Figure (b)b shows an acceleration histogram calculated for the shoulders only. We see that our system with all the speech features has acceleration distributions very close to one another, but that all of them far away from the actual data. A possible explanation for this could be that shoulder motion might be difficult to predict from the speech input, in which case the predicted motion is likely to stay close the mean shoulder position. To restore the appropriate dynamic range of motion one could apply a probabilistic method to learn a distribution of natural-looking joint coordinate trajectories, and then draw samples from this distribution at test time. This is discussed further in Section 6.1 on future work.

Figure (c)c shows acceleration histograms for the hands. Hands convey the most important gesture information, suggesting that this plot is the most informative. Here, the MFCC-based system is much closer to the ground truth. Combining MFCC features with prosodic features resulted in similar performance as for MFCC inputs alone. This could be due to redundancy in the information exposed by MFCCs and prosodic features, or due to the current networks and optimizer not being able to exploit synergies between the two representations.

Taken together, Figures (a)a-(c)c suggest that motion generated from MFCC features give acceleration statistics as similar or more similar to the ground truth as those of motion generated from other features. Moreover, using MFCCs as input features makes our proposed system consistent with the baseline paper (Hasegawa et al., 2018).

4.4. User study

The most important goal in gesture generation is to produce motion patterns that are convincing to human observers. Since improvements in objective measures do not always translate into superior subjective quality for human observers, we validated our conclusions by means of a user study comparing key systems.

Scale Statement (translated from Japanese)
Naturalness Gesture was natural
Gesture was smooth
Gesture was comfortable
Time Gesture timing was matched to speech
consistency Gesture speed was matched to speech
Gesture pace was matched to speech
Semantic Gesture was matched to speech content
consistency Gesture well described speech content
Gesture helped me understand the content
Table 2. Statements evaluated in user study

We conducted a 12 factorial design with the within-subjects factor being representation learning (baseline vs. encoded). The encoded gestures were generated by the proposed method from MFCC input. We randomly selected 10 utterances from a test dataset of 45 utterances, for each of which we created two videos using the two gesture generation systems. Visual examples are provided at After watching each video, we asked participants to rate nine statements about the naturalness, time consistency, and semantic consistency of the motion. The statements were the same as in the baseline paper (Hasegawa et al., 2018) and are listed in Table 2. Ratings used a seven-point Likert scale anchored from strongly disagree (1) to strongly agree (7). The utterance order was fixed for every participant, but the gesture conditions (baseline vs. encoded) were counter-balanced. With 10 speech segments and two gesture-generation systems, we obtained 20 videos, producing 180 ratings in total per subject, 60 for each scale in Table 2.

Figure 6. Results from the user study. We note a significant difference in naturalness, but not the other scales.

19 native speakers of Japanese (17 male, 2 female), on average 26 years old, participated in the user study. A paired-sample -test was conducted to evaluate the impact of the motion encoding on the perception of the produced gestures. Figure 6 illustrates the results we obtained for the three scales being evaluated. We found a significant difference in naturalness between the baseline (M=4.16, SD=0.93) and proposed model (M=4.79, SD=0.89), =-3.6372,

¡0.002. A 95%-confidence interval for the mean rating improvement with the proposed system is (0.27,1.00). There were no significant difference on the other scales: for time-consistency

=1.0192, =0.32, for semantic consistency =1.5667, =0.13. These results indicate that the gestures generated by the proposed method (i.e., with representation learning) were perceived as more natural than the baseline.

5. Related work

We review only data-driven approaches and pay special attention to methods incorporating elements of representation learning, since that is the direction of our research. For a review of non-data driven systems, we refer the reader to Wagner et al. (Wagner et al., 2014).

5.1. Data-driven head and face movements

Facial-expression generation has been an active field of research for several decades. Many of the state-of-the-art methods are data-driven. Several recent works have applied neural networks in this domain (Haag and Shimodaira, 2016; Greenwood et al., 2017; Sadoughi and Busso, 2018, 2017a; Suwajanakorn et al., 2017). Among the cited works, Haag & Shimodaira (Haag and Shimodaira, 2016) use a bottleneck network to learn compact representations, although their bottleneck features subsequently are used to define prediction inputs rather than prediction outputs as in the work we presented. Our proposed method works on a different aspect of non-verbal behavior that co-occurs with speech, namely generating upper-body motion driven by speech.

5.2. Data-driven body motion generation

Generating body motion is an active area of research with applications to animation, computer games, and other simulations. Current state-of-the-art approaches in such body-motion generation are generally data-driven and based on deep learning (Zhang et al., 2018; Zhou et al., 2017; Pavllo et al., 2018). Zhou et al. (Zhou et al., 2017)

proposed a modified training regime to make recurrent neural networks generate human motion with greater long-term stability, while Pavllo et al.  

(Pavllo et al., 2018) formulated separate short-term and long-term recurrent motion predictors, using quaternions to more adequately express body rotations.

Some particularly relevant works for our purposes are (Liu and Taniguchi, 2014; Holden et al., 2015, 2016; Bütepage et al., 2017). All of these leverage representation learning (various forms of autoencoders) that predict human motion, yielding accurate yet parsimonious predictors. Habibie et al. (Habibie et al., 2017) extended this general approach to include an external control signal in an application to human locomotion generation with body speed and direction as the control input. Our approach is broadly similar, but generates body motion from speech rather than position information.

5.3. Speech-driven gesture generation

Like body motion in general, gesture generation has also begun to shift towards data-driven methods. Several researchers have tried to combine data-driven approaches with rule-based systems. For example, Bergmann & Kopp (Bergmann and Kopp, 2009) learned a Bayesian decision network for generating iconic gestures. Their system is a hybrid between data-driven and rule-based models because they learn rules from data. Sadoughi et al. (Sadoughi and Busso, 2017b) used probabilistic graphical models with an additional hidden node to provide contextual information, such as a discourse function. They experimented on only three hand gestures and two head motions. We believe that regression methods that learn and predict arbitrary movements, like the one we have proposed, represent a more flexible and scalable approach than the use of discrete and pre-defined gestures.

The work of Chiu & Marsella (Chiu and Marsella, 2011)

is of great relevance to the work have presented, in that they took a regression approach and also utilized representation learning. Specifically, they used wrist height in upper-body motion to identify gesticulation in motion capture data of persons engaged in conversation. A network based on Restricted Boltzmann Machines (RBMs) was used to learn representations of arm gesture motion, and these representations were subsequently predicted based on prosodic speech-feature inputs using another network also based on RBMs.

Recently, Hasegawa et al. (Hasegawa et al., 2018) designed a speech-driven neural network capable of producing 3D motion sequences. We built our model on this work while extending it with motion-representation learning, since learned representations have improved motion prediction in other applications as surveyed in Section 5.2.

6. Conclusions and future work

This paper presented a new model for speed-driven gesture generation. Our method extends prior work on deep learning for gesture generation by applying representation learning. The motion representation is learned first, after which a network is trained to predict such representations from speech, instead of directly mapping speech to raw joint coordinates as in prior work. We also evaluated the effect of different representations for the input speech.

Our experiments show that representation learning improves the objective and subjective performance of the speech-to-gesture neural network. Although models with and without representation learning were rated similarly in terms of time consistency and semantic consistency, subjects rated the gestures generated by the proposed method as significantly more natural than the baseline.

The main limitation of our method, as with any data-driven method and particularly those based on deep learning, is that it requires substantial amounts of parallel speech-and-motion training data of sufficient quality in order to obtain good prediction performance. In the future, we might overcome this limitation by obtaining datasets directly from publicly-available video recordings using motion-estimation techniques.

6.1. Future work

We see several interesting directions for future research:

Firstly, it is beneficial to make the model probabilistic, e.g., by using a Variational Autoencoder (VAE) as in (Kucherenko, 2018). A person is likely to gesticulate differently at different times for the same utterance. It is thus an appealing idea to make a conversational agent also generate different gestures every time they speak the same sentence. For this we need to make the mapping probabilistic, to represent

a probability distribution over plausible motions and then draw samples from that distribution. VAEs can provide us with this functionality.

Secondly, text should be taken into account, e.g., as in (Ishii et al., 2018). Gestures that co-occur with speech depend greatly on the semantic content of the utterance. Our model generates mostly beat gestures, as we rely only on speech acoustics as input. Hence the model can benefit from incorporating the text transcription of the utterance along with the speech audio. This may enable producing a wider range of gestures (also metaphoric and deictic gestures).

Lastly, the learned model can be applied to a humanoid robot so that the robot’s speech is accompanied by appropriate co-speech gestures, for instance on the NAO robot as in (Yoon et al., 2018).


The authors would like to thank Sanne van Waveren, Iolanda Leite and Simon Alexanderson for helpful discussions.

This project is supported by the Swedish Foundation for Strategic Research Grant No.: RIT15-0107 (EACare).


  • (1)
  • Bergmann and Kopp (2009) Kirsten Bergmann and Stefan Kopp. 2009. GNetIc–Using Bayesian decision networks for iconic gesture generation. In International Workshop on Intelligent Virtual Agents. Springer, 76–89.
  • Boersma (2002) Paul Boersma. 2002. Praat, a system for doing phonetics by computer. Glot International 5 (2002).
  • Breazeal et al. (2005) Cynthia Breazeal, Cory D Kidd, Andrea Lockerd Thomaz, Guy Hoffman, and Matt Berlin. 2005. Effects of nonverbal communication on efficiency and robustness in human-robot teamwork. In Intelligent Robots and Systems, 2005.(IROS 2005). 2005 IEEE/RSJ International Conference on. IEEE, 708–713.
  • Bütepage et al. (2017) Judith Bütepage, Michael J Black, Danica Kragic, and Hedvig Kjellström. 2017. Deep representation learning for human motion prediction and classification. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . IEEE.
  • Cassell et al. (2001) Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: the behavior expression animation toolkit. In Annual Conference on Computer Graphics and Interactive Techniques.
  • Chiu and Marsella (2011) Chung-Cheng Chiu and Stacy Marsella. 2011. How to train your avatar: A data driven approach to gesture generation. In Proc. International Workshop on Intelligent Virtual Agents. 127–140.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014.

    On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.

    Syntax, Semantics and Structure in Statistical Translation (2014), 103.
  • Greenwood et al. (2017) David Greenwood, Stephen Laycock, and Iain Matthews. 2017. Predicting Head Pose from Speech with a Conditional Variational Autoencoder. In Proc. Interspeech 2017. 3991–3995.
  • Haag and Shimodaira (2016) Kathrin Haag and Hiroshi Shimodaira. 2016. Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In International Conference on Intelligent Virtual Agents. Springer, 198–207.
  • Habibie et al. (2017) Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, Taku Komura, Jun Saito, Ikuo Kusajima, Xi Zhao, Myung-Geol Choi, Ruizhen Hu, et al. 2017. A Recurrent Variational Autoencoder for Human Motion Synthesis. IEEE Computer Graphics and Applications 37 (2017), 4.
  • Hasegawa et al. (2018) Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, Hiroshi Sakuta, and Kazuhiko Sumi. 2018. Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network. In Proceedings of the 18th International Conference on Intelligent Virtual Agents. ACM, 79–86.
  • Holden et al. (2016) Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG) 35, 4 (2016), 138.
  • Holden et al. (2015) Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. 2015. Learning motion manifolds with convolutional autoencoders. In Proc. SIGGRAPH Asia Technical Briefs. 18:1–18:4.
  • Huang and Mutlu (2012) Chien-Ming Huang and Bilge Mutlu. 2012. Robot behavior toolkit: generating effective social behaviors for robots. In ACM/IEEE International Conference on Human Robot Interaction.
  • Ishii et al. (2018) Ryo Ishii, Taichi Katayama, Ryuichiro Higashinaka, and Junji Tomita. 2018. Generating Body Motions Using Spoken Language in Dialogue. In Proceedings of the 18th International Conference on Intelligent Virtual Agents (IVA ’18). ACM, New York, NY, USA, 87–92.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Knapp et al. (2013) Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal communication in human interaction. Wadsworth, Cengage Learning.
  • Kucherenko (2018) Taras Kucherenko. 2018. Data Driven Non-Verbal Behavior Generation for Humanoid Robots. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Doctoral Consortium (ICMI ’18). 520–523.
  • Liu and Taniguchi (2014) Hailong Liu and Tadahiro Taniguchi. 2014. Feature extraction and pattern recognition for human motion by a deep sparse autoencoder. In 2014 IEEE International Conference on Computer and Information Technology (CIT). IEEE, 173–181.
  • Martinez et al. (2017) Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4674–4683.
  • Matsumoto et al. (2013) David Matsumoto, Mark G Frank, and Hyi Sung Hwang. 2013. Nonverbal communication: Science and applications: Science and applications. Sage.
  • McNeill (1992) David McNeill. 1992. Hand and mind: What gestures reveal about thought. University of Chicago press.
  • Ng-Thow-Hing et al. (2010) Victor Ng-Thow-Hing, Pengcheng Luo, and Sandra Okita. 2010. Synchronized gesture and speech production for humanoid robots. In IEEE/RSJ International Conference on Intelligent Robots and Systems.
  • Pavllo et al. (2018) Dario Pavllo, David Grangier, and Michael Auli. 2018. QuaterNet: A Quaternion-based Recurrent Model for Human Motion. In proc. BMVC.
  • Ravenet et al. (2018) Brian Ravenet, Catherine Pelachaud, Chloé Clavel, and Stacy Marsella. 2018. Automating the production of communicative gestures in embodied characters. Frontiers in psychology 9 (2018).
  • Sadoughi and Busso (2017a) Najmeh Sadoughi and Carlos Busso. 2017a.

    Joint learning of speech-driven facial motion with bidirectional long-short term memory. In

    International Conference on Intelligent Virtual Agents. Springer, 389–402.
  • Sadoughi and Busso (2017b) Najmeh Sadoughi and Carlos Busso. 2017b. Speech-driven animation with meaningful behaviors. arXiv preprint arXiv:1708.01640 (2017).
  • Sadoughi and Busso (2018) Najmeh Sadoughi and Carlos Busso. 2018. Novel realizations of speech-driven head movements with generative adversarial networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018). 6169–6173.
  • Salem et al. (2013) Maha Salem, Friederike Eyssel, Katharina Rohlfing, Stefan Kopp, and Frank Joublin. 2013. To err is human (-like): Effects of robot gesture on perceived anthropomorphism and likability. International Journal of Social Robotics 5, 3 (2013), 313–323.
  • Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG) 36, 4 (2017), 95.
  • Takeuchi et al. (2017a) Kenta Takeuchi, Dai Hasegawa, Shinichi Shirakawa, Naoshi Kaneko, Hiroshi Sakuta, and Kazuhiko Sumi. 2017a. Speech-to-Gesture Generation: A Challenge in Deep Learning Approach with Bi-Directional LSTM. In International Conference on Human Agent Interaction.
  • Takeuchi et al. (2017b) Kenta Takeuchi, Souichirou Kubota, Keisuke Suzuki, Dai Hasegawa, and Hiroshi Sakuta. 2017b. Creating a Gesture-Speech Dataset for Speech-Based Automatic Gesture Generation. In International Conference on Human-Computer Interaction. Springer, 198–202.
  • Vincent et al. (2010) Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol. 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research 11, Dec (2010), 3371–3408.
  • Wagner et al. (2014) Petra Wagner, Zofia Malisz, and Stefan Kopp. 2014. Gesture and speech in interaction: An overview. (2014).
  • Yoon et al. (2018) Youngwoo Yoon, Woo-Ri Ko, Minsu Jang, Jaeyeon Lee, Jaehong Kim, and Geehyuk Lee. 2018. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots. arXiv preprint arXiv:1810.12541 (2018).
  • Zhang et al. (2018) He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-adaptive neural networks for quadruped motion control. ACM Transactions on Graphics (TOG) 37, 4 (2018), 145.
  • Zhou et al. (2017) Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. 2017. Auto-conditioned recurrent networks for extended complex human motion synthesis. In Proc. ICLR.