Multimodal analysis of the predictability of hand-gesture properties

Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.

READ FULL TEXT VIEW PDF
01/14/2021

Generating coherent spontaneous speech and gesture from text

Embodied human communication encompasses both verbal (speech) and non-ve...
01/25/2020

Gesticulator: A framework for semantically-aware speech-driven gesture generation

During speech, people spontaneously gesticulate, which plays a key role ...
10/02/2020

Understanding the Predictability of Gesture Parameters from Speech and their Perceptual Importance

Gesture behavior is a natural part of human conversation. Much work has ...
05/17/2001

Toward Natural Gesture/Speech Control of a Large Display

In recent years because of the advances in computer vision research, fre...
08/16/2019

Wi-Fringe: Leveraging Text Semantics in WiFi CSI-Based Device-Free Named Gesture Recognition

The lack of adequate training data is one of the major hurdles in WiFi-b...
03/24/2022

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

Generating speech-consistent body and gesture movements is a long-standi...

1. Introduction

Figure 1. An illustration of the problem we study.

A picture of an audio waveform and the example sentence "there is a tower to the left" (with tower in bold) are connected to a labelled box that says "multimodal fusion". Arrows point from this box to two other boxes: one titled "Speech2GestExist", with "probability of gesture: 83 percent" written under it, and another titled "Speech2GestProp", with the following text under it: "amount: 7 percent, shape: 75 percent, direction: 20 percent, size: 65 percent".

Verbal and nonverbal communication are important and complementary components of embodied human communication. In human communication, speech is typically accompanied by co-speech gestures or gesticulation, performed by the hands, head, and occasionally the body. Automatically generating such co-speech gestures is an important task in character animation and human-agent interaction, because a substantial fraction of our communication takes place through co-speech gestures mcneill1992hand; kendon2004gesture. Furthermore, gesticulation has also been shown to enhance interactions with embodied agents bergmann2013virtual; luo2013examination, e.g., to help with learning tasks bergmann2013virtual, and to lead to a higher sense of co-presence wu2014effects.

While early hand gesture-generation systems mainly relied on rule-based approaches cassell1994animated; kopp2004synthesizing; ng2010synchronized; marsella2013virtual, data-driven gesture generation has become an important research area in recent years yoon2018robots; kucherenko2020gesticulator; yoon2020speech; ahuja2020no; ferstl2020understanding

. Both paradigms have advantages and disadvantages. Rule-based systems produce gestures with clear communicative function, but lack diversity and require much manual effort to design. Data-driven systems, on the other hand, need less manual work and are more flexible, since they can generalise and generate new gestures on the fly. They may also scale better to large datasets. However, despite several attempts

ahuja2020no; kucherenko2020gesticulator; yoon2020speech, there have in our view been no convincing demonstrations of recent data-driven approaches consistently generating gestures with a clear semantic relation to the speech content. For example, in terms of subjective gesture appropriateness for the speech, no system in the 2020 GENEA gesture-generation challenge kucherenko2020genea surpassed a bottom line that simply paired the input speech audio with mismatched excerpts of training data motion, completely unrelated to the speech.

It would be desirable to develop approaches that combine the strengths of both paradigms, enabling systems to be built from data yet producing gestures that fulfil a communicative function together with the speech. This has led us to investigate whether the communicative attributes of gesture can be modelled directly using recent data-driven methods.

The goal of this paper is to analyse to what extent modern deep-learning approaches are able to predict important communicative properties of hand gestures from the co-occurring speech. As such, this work should not be read as a machine-learning paper, since our focus is not to propose new architectures or advance the numerical performance on some pre-existing benchmark, nor as a gesture-generation paper, since no gesture synthesis is performed. Instead, this is intended as a work on gesture analysis that studies the predictability of important gesture properties. Apart from being an interesting question in its own right, developing the ability to predict semantic aspects of gesticulation is a key element in driving future gesture-generation systems kucherenko2021speech2properties2gestures to produce more meaningful and appropriate gesticulation. This work can therefore be seen as a continuation of recent efforts ferstl2020understanding; saund2021cmcf; yunus2020sequence towards imbuing data-driven systems with greater control over communicative function.

The specific contributions of our work are:

  • We conduct extensive gesture-property prediction experiments on a direction-giving dataset with a high fraction of representational gestures, for which gesture properties have been extensively hand-annotated. Specifically, we predict 13 distinct property labels – 8 relating to communicative function – which is significantly more than any prior work.

  • We analyse which modalities of speech – audio and/or text – are useful for predicting which gesture properties.

  • We investigate how individual or general different gesture properties are, by experimenting with gesture-property prediction for both known and previously unseen speakers.

Despite the highly individual and stochastic nature of gestures, we find that numerous gesture properties can be predicted from speech, both for speakers inside and outside the training data. We also find that speech text and audio differ in their uses, where time-aligned text enables predicting gesture category and semantics, while prosodic audio features help predict gesture phase. More information, including dataset and code, will be released on our project page at: svito-zar.github.io/speech2properties2gestures .

2. Related work

Since this paper considers the predictability of different properties of human gesticulation from multimodal representations of speech, our review of related work covers two aspects: first the prediction of various gesture properties, and then the use and combination of speech modalities for gesture generation. In general, the predictability of gesture properties has not been extensively studied, and most current gesture-generation systems do not integrate explicit gesture-property prediction, but there is nonetheless some prior work on predicting various gesture properties from speech.

2.1. Gesture presence/absence prediction

ferstl2021expressgesture used a statistical method based on speech prosody peaks to predict where a gesture should be placed. They set the timing so that gesture strokes were 55% complete at the pitch peak. yunus2019gesture

predicted gesture presence and timing based on speech audio using a recurrent neural network (RNN). In this work, we likewise explore using a neural network for this, but we use a convolutional neural network (CNN) instead of an RNN and consider a more extensive set of gesture properties.

2.2. Gesture lexeme prediction

Many gesture synthesis approaches predict gesture lexemes, or tags, that encapsulate both gesture form and semantics. For example, a cup or conduit gesture involves a curved handshape, with the palm facing up and a forward motion of the hand from the speaker outward (gesture form) and is used to indicate an offering or conveyance (semantics). Systems of this type include cassell2001beat; lee2006nonverbal; lhommet2013gesture; marsella2013virtual; kappagantula2020automatic; chiu2015predicting. Some of these were rule-based, and predicted gesture semantics from input text based on a set of rules cassell2001beat; lee2006nonverbal; lhommet2013gesture. Other research applied statistical methods to learn probabilistic mappings from semantic concepts to gestures kipp2005gesture; ishi2018speech. Later, deep learning was applied to predict a fixed set of semantic gestures based on audio, text, and part-of-speech tags chiu2015predicting. Our work, in contrast, does not consider a codified set of lexemes and instead predicts gesture properties that captures different elements of semantics, such as gesture categories and semantic gesture features.

2.3. Gesture kinematics prediction

ferstl2020understanding considered predicting kinematic gesture properties (specifically velocity, initial acceleration, gesture size, arm swivel, and hand opening) from speech. They trained multiple recurrent neural networks to predict these gesture parameters from the speech audio signal, and found that some parameters, such as path length, were predicted more accurately than others, for example velocity. Instead of kinematics, we consider the predictability of gesture properties related to gesture semantics and phase.

2.4. Gesture phase and category prediction

Kendon kendon2011gesticulation defined the following gesture phases: preparation, hold, stroke, and retraction. All phases are optional except for the stroke, which is the expressive phase of the gesture. It has been shown that gesture stroke is strongly correlated with pitch accentuation in speech jannedy2005structuring; esteve2013prosodic. Furthermore, McNeill mcneill1992hand defined different gesture categories, or dimensions, such as deictic, iconic, and metaphoric (all related to the spoken message) gestures and beat gestures (which are more strongly related to speech prosody and rhythm).

This paper investigates how well gesture phases (as defined by Kendon), gesture semantic meaning, and gesture categories (as defined by McNeill) can be predicted from speech audio and text in a data-driven manner. The most similar prior work is due to Yunus et al. yunus2019gesture; yunus2020sequence, where a restricted set of gesture phase and category were predicted based on acoustic features only. Our study differs in that we consider additional gesture properties and also study the effect of different speech modalities as input.

2.5. Effect of the speech input modality

Many data-driven systems have only considered a single speech modality – either audio recordings or text transcriptions thereof – as input to the gesture generation, e.g., neff2008gesture; bergmann2009GNetIc; yoon2018robots; kucherenko2021moving. However, the field is now shifting to use both audio and text together chiu2015predicting; kucherenko2020gesticulator; yoon2020speech; ahuja2020no. This is, among other things, based on recent ablation studies of end-to-end gesture-synthesis systems in kucherenko2020gesticulator; yoon2020speech, that compared gesture generation models which used only one modality against models using both. These studies found that using both speech modalities (audio and text) improved the synthesised gestures. This paper delves further into the effects of the different input modalities, and addresses the question of which speech modalities are useful for predicting particular properties of human gesticulation.

3. Data

3.1. Corpus

There are two principal ways to obtain data for 3D gesture synthesis: optical motion capture lee2019talking; joo2019towards

and 3D pose estimation from videos

yoon2020speech; ahuja2020no. Among existing datasets, almost all are monologues, with only joo2019towards involving interactions of more than one person.

Our present work aims at modelling iconic gestures, which are rare in all the previously cited datasets. Despite their important role in enabling meaningful gesticulation, these gestures only occur occasionally during social conversations. Hence we decided to focus on a dataset that contains a large proportion of iconic gestures, the Bielefeld Speech and Gesture Alignment corpus (SaGA) lucking2013data. This is the largest and newest database we are aware of with detailed and accurate gesture-property annotations. Larger gesture databases exist, e.g., ahuja2020no, but do not have the annotations necessary for our research. We believe the SaGA dataset is sufficiently large for our purposes, since it has been previously used for generating iconic gestures bergmann2009GNetIc, albeit based on information that cannot be extracted from speech.

The SaGA dataset contains a total of 280 minutes of recordings of 25 different participants speaking and gesturing to an interlocutor. Each recording lasted around 10 minutes, with durations ranging from 4 to 19 minutes. All recordings are in German. A key goal of SaGA was to capture a large number of iconic gestures. This was accomplished through a specific data collection procedure in which participants first saw a virtual reality bus tour and then described the route, and the prominent visual landmarks placed along that route, to another person. Both the navigation task and the landmarks provided natural visual grounding upon which iconic gestures are based. All participants followed the same route, thus maximising the degree of consistency between the recordings and simplifying the task of grounding gesture prediction in language by considering a tightly restricted semantic domain. Audio and video were recorded of each interaction lucking2013data and every gesture was manually annotated according to a detailed labelling scheme. We use a subset of their annotation categories for our study, as described in Section 3.3.

Dataset partitioning

Following previous works in gesture-synthesis research kucherenko2020gesticulator; alexanderson2020style; wu2021modeling, the dataset was encoded at 20 fps. This resulted in 261,909 frames in total, out of which 127,581 frames were annotated as containing a gesture. For our research, we replaced time-frames annotated as interlocutor speech with silence, in order to concentrate on the gesturing person’s own speech. We used 22 out of 25 recordings for training and cross-validation. The remaining 3 recordings (numbers 7, 8 and 10) were held out for future research, so that future models can be evaluated without data leakage from the experiments reported here.

We used two different data partitions for cross-validation, to avoid tuning hyperparameters and evaluating on the exact same data splits. For choosing hyper-parameters, we performed classical 10-fold cross-validation. For evaluating the model, we use 20-fold cross-validation, set up such that every fold contains 5% of the data from each of the 22 subjects in the recordings we consider. Training and validation sequences never overlapped.

3.2. Speech modalities and their encoding

We used two different speech modalities from the dataset, each of which is described below.

Text

Each recording was transcribed in German. Transcriptions contain the written form of every word and its timing (onset and offset), but no punctuation or other sentence delimiters due to the spontaneous and continuous nature of the speech.

We experimented with two commonly used word embeddings for German: DistilBERT sanh2020distilbert, which encodes each word together with context, and FastText joulin2016fasttext, which does not take context into account. FastText outperformed DistilBERT when predicting the semantics property (where the text modality has the most impact) and was hence chosen as the text embedding for our experiments.

The FastText tokeniser produces one 300-dimensional feature vector (a.k.a. “embedding”) per word-piece token. These were converted to a single feature vector per word by computing the arithmetic average of the feature vectors of all word pieces within that word. When predicting gesture properties, each vector was supplemented with one extra number about word timing, namely the time-difference from the word onset to the prediction target frame (negative for words starting before the target point and positive for future words). Text-based gesture-generation commonly uses timing information

ishi2018speech; yoon2018robots, even though that information cannot be derived from text alone.

Audio

We extracted the audio tracks from each video and converted them to mono waveforms with a 48 kHz sampling rate. We then used Parselmouth jadoul2018introducing

to compute five prosodic features as the audio feature set of our experiments: voiced/unvoiced binary flag, log fundamental frequency (linearly interpolated in unvoiced regions), log energy, and the derivatives of the last two computed with finite differences. Such prosodic features are commonly used in speech emotion analysis as well as for gesture-property prediction, e.g.,

yunus2019gesture. Specifically, we transformed pitch and intensity like in chiu2011train; kucherenko2021moving: the pitch values were adjusted by taking and setting negative values to zero, and the intensity values were adjusted by taking . The audio features were first extracted at 200 fps and then resampled to 20 fps by averaging, to match the resolution of the gesture annotations.

We also experimented with using spectrograms instead of prosodic features, but found no difference between the two when predicting gesture phase (where the audio modality has the most impact). Prosodic features were chosen since they have the benefit that they are more anonymous, enabling us to release audio features.

Figure 2. The frequency of each gesture-property label in the SaGA dataset. Note that frequencies may sum to more than 127,581 since most categories are not mutually exclusive.

A bar plot containing the number of frames with each of the 13 possible gesture-property labels. The numbers are as follows. Phase labels: retraction: 17528, prep: 38374, pre-hold: 650, stroke: 53021, post-hold: 16654. Category labels: deictic: 37806, beat: 20079, iconic: 90222, discourse: 16873. Semantic labels: amount: 6474, shape: 16489, size: 2447, direction: 17552.

3.3. Gesture properties and their encoding

The SaGA corpus contains detailed annotations of the properties of the gestures in the recordings. We made use of the following gesture properties in our experiments: R.G.Left Semantic, R.G.Right Semantic, R.G.Left Phrase, R.G.Right Phrase, R.G.Left.Phase, and R.G.Right.Phase. The Semantics property indicates which semantic information is contained in the gesture. Phrase indicates gesture category. Phases are sub-units of gestures that indicate: if the hands are preparing to gesture, meaning is currently being conveyed, etc. For details about the data collection and annotation scheme we refer the reader to lucking2010bielefeld and bergmann2006verbal. To simplify modelling, we merged the features for the left and right hand into a single feature using a per-frame logical OR. Each feature was encoded into a vector of binary values, which is one-hot for Phase since phases are mutually exclusive.

Gesture-property representations

We encoded gesture properties at a rate of twenty frames per second (20 fps). As described in Section 4.2, our system first predicts if a gesture is needed and then what kind of gesture it should be. For the latter gesture-property prediction task, we only consider time-frames where a gesture was present in the data, i.e., frames where any of the annotations we considered were present and nonzero. This amounted to 127k out of 261k total frames. We list the gesture-property labels we considered and the number of frames they were present at in Figure 2. As can be seen, most of the gesture-property labels only apply to a small fraction of the gesture-containing frames in the data.

We encoded the gesture properties as binary vectors. For this, we first created an ordering of the different labels relevant to each property. For example, for Gesture Category we ordered the different possible labels as follows {1: ‘deictic’, 2: ‘beat’, 3: ‘iconic’, 4: ‘discourse’}. A frame with Category annotation “beat-iconic” would then be encoded by the vector . As the example shows, gesture categories are not mutually exclusive, and several labels can be present simultaneously. The same applies to gesture semantics labels. Gesture phase, on the contrary, is exclusive – only one label can be applicable at a time – and we take this mutual exclusivity of gesture phases into account during modelling and evaluation.

Note that the work in this paper does not make use of the videos captured during the SaGA corpus recordings, only transcriptions, gesture annotations, and anonymous audio features (prosody) derived from those recordings. We will release the extended and anonymized version of the SaGA dataset at our project page: svito-zar.github.io/speech2properties2gestures .

4. Experimental setup

This section describes the experimental setup for our experiments on predicting gesture properties from speech text and audio.

Figure 3.

The shared multimodal architecture of our two networks. First, the two modalities are independently encoded using dilated temporal CNNs, using zero-padding as necessary. Then, the two encodings are concatenated and fed into an MLP decoder, which returns the final output.

An architecture figure for the multimodal neural network. On the top of the figure, the inputs are shown side-by-side: text, with the example sentence "then you turn left at the tower"; and audio, depicted with a waveform. The two inputs are converted to a sequence of vectors, centred around the middle frame. The vectors are FastText word embeddings for text (per-word), and prosodic features for audio (at 20 FPS). The prediction is then generated as described in the caption. It may be the probability of gesture, or the label probabilities for one property, for the middle frame of the input.

4.1. Problem formulation

We frame the problem of gesture-property prediction as follows: given a sequence of speech features the task is to generate a sequence of corresponding binary gesture properties . Here, denotes indexing into a sequence of vectors for integer in to . Each speech segment is represented by several different features, specifically acoustic features (e.g., prosody), semantic features (e.g., word embeddings), or both.

4.2. Gesture-property prediction model

Our gesture-property prediction model consists of two components that take speech audio and text as input: Speech2GestExist, which predicts the probability of making a gesture, and Speech2GestProp, which predicts the probabilities of different labels for a given gesture property. Such hierarchical models have been successful on other sequence-prediction tasks such as text-to-speech intonation generation, where first predicting the presence or absence of voicing, and then predicting voicing frequency, worked better than predicting the two aspects jointly at once wang2018autoregressive.

Detailed model specification

We implemented the Speech2GestProp and Speech2GestExist components using the same architecture based on dilated convolutions yu2016multi for information aggregation along the time dimension. We have chosen convolutions instead of recurrency because no long-term memory is needed for this task, since what was said one minute ago is irrelevant for the present gesture. Dilated CNNs yu2016multi are a widely-used neural-network architecture for sequence modelling, used in WaveNet vandenoord2016wavenet and WaveGlow prenger2019waveglow, and recently also adapted to human motion modelling hou2021causal.

The model inputs are sequences of audio frames and transcribed spoken words in a sliding window centred on the current time frame. Based on findings regarding the temporal synchrony between speech and gesture loehr2012temporal; pouwQuantifying, we consider the current, three past, and three future word-token feature vectors and the current and twenty past and twenty future audio frames (i.e., 1 s to either side). By sliding these windows over the input speech-feature sequences, we can make predictions for the selected gesture properties frame by frame for all times in the sequence. (For this paper, we only considered frames sufficiently far from sequence edges for all model inputs to be well defined, to avoid edge effects.) This setup makes use of future speech, which is standard in gesture generation and rarely considered a limitation since most applications do not depend on live speech. For example, the utterance-based TTS systems used by many social robots and virtual agents require the entire utterance text to be available before audio synthesis can begin.

As illustrated in Figure 3, speech audio and text are first encoded into the intermediate text window embedding and audio window embedding representations using two separate neural networks, each of them containing several layers of dilated convolution. The two embeddings are then concatenated and passed into a simple fully-connected neural network (MLP). At the final layer, we map the values onto the unit interval , since the output should indicate the probability that each relevant gesture property is present. For that, a sigmoid output nonlinearity is applied to Gesture Category and Gesture Semantics and a softmax output nonlinearity is applied to the Gesture Phase

outputs. The softmax is used since different phase labels, unlike the other property categories, are mutually exclusive. From a probabilistic perspective, the use of a sigmoid for each binary property corresponds to the assumption that each property is statistically independent of the others, given the input features. This is a common modelling assumption for binary variables that are not mutually exclusive.

Since any given gesture property is present in just a fraction of the time frames, any gesture-property predictor training data will be highly imbalanced. To mitigate this, we experimented with upsampling underrepresented classes to balance the data and also considered several different loss functions: not only the standard

cross-entropy loss, (where is the model probability of the correct class at time ), but also the focal loss lin2017focal, developed to address the rarity of positive labels in common datasets, and a class-balancing version of the focal loss from cui2019class. The results are reported in Section 5.3. Each loss function is aggregated for sequences and minibatches by summing over constituent frames.

Hyperparameters

For each experiment and each model in Section 5, we conducted a separate hyperparameter search using random search bergstra2012random

. Each random search consisted of 50 runs. For each run, we randomly sampled all the key hyperparameters over a predefined range for each value and trained the model for a fixed number of epochs dependent on the task. We found no significant difference in the validation scores in the latter half of training, therefore no early stopping was used and the weights from the final epoch were used. During the hyperparameters search we varied: hidden dimensionality, number of layers, kernel size, dropout, and output embedding dimensionality for each encoder; hidden dimensionality, number of layers, and dropout for the decoder; learning rate, batch size, and other optimisation parameters.

We selected the best hyperparameters based on the average Macro yang1999re score of 10-fold cross-validation, and used these settings to compute the results reported in Section 5. Hyperparameters for all models in the paper are publicly available on FigShare: doi.org/10.6084/m9.figshare.15134076.

4.3. Baseline systems

For the majority of the properties we predict, no previous baseline systems or benchmark performance exist. Instead, our main starting point for baselining is the finding from kucherenko2020genea that no gesture-generation system beat a mismatched bottom line that paired speech with unrelated training-data motion. Inspired by this, we create and compare against a number of simple bottom-line systems that similarly have no dependence on the input speech. These include two constant-output systems (AlwaysZero and AlwaysOne), and two systems based on random output, either uniformly random (system UniformRandom) or random draws with the same distribution as the a-priori class abundances in the training data (system InformedRandom). Any system can be said to be better than chance if it surpasses all four of these bottom lines. Moreover, any time that happens, we say that the corresponding property is predictable from the given input features. (This is very different from being perfectly predictable, which arguably is an unrealistic goal for problems that involve human behaviour.)

4.4. Evaluation metrics

It is well known that standard classification accuracy (one minus the error rate) does not capture overall system performance well when the data is highly unbalanced, since it may then be possible to achieve high accuracy by always predicting the majority class, regardless of the input features of the given instance. Instead, we use the

score as our main performance indicator. This measure is the harmonic mean of precision and recall, and is a popular evaluation measure for classification of unbalanced classes. More specifically, we use the Macro

score opitz2019macro, which is simply the arithmetic average of scores for all possible, mutually exclusive classes : .

Note that since phase labels are mutually exclusive, while other gesture-property labels are not, phase is evaluated differently. For gesture categories and semantics we calculate separate Macro scores for each label, since they are not mutually exclusive and are treated as independent. For the gesture phase, on the contrary, we evaluate only the scores for each label, and not the Macro score, which averages over all possible labels.

To get a better understanding of generalisation ability on our limited dataset, we used cross-validation. For each of our experiments, we report the mean and standard deviation of of the selected performance measure across 20 cross-validation folds. These folds were set up such that every fold contained 5% of the data from each of the 22 people in the recordings we considered. This means that the cross-validation quantifies

within-person generalisation performance, although we also looked at across-person generalisation by holding out one individual at a time (see Section 5.5).

5. Results and Discussion

We conducted several experiments, first comparing different performance metrics, and then evaluating 1) how well we can predict gesture presence, 2) which modalities are essential for predicting which gesture properties, 3) how well predictions generalise to new speakers, and more. In this section, we report and discuss the results of these experiments.

In each experiment, we vary one aspect while keeping everything else the same. Our default settings are:

  • using both speech modalities, instead of only audio or text;

  • evaluating generalisation within known speakers, instead of generalisation to new speakers;

  • training individual models for each gesture property, instead of training a single model of all properties simultaneously.

5.1. Comparison of evaluation metrics

In order to put the evaluation metric used into context, Table 

1 reports the accuracy, precision, recall, and Macro scores for predicting the presence/absence of (as an example) the gesture semantics property label “shape”.

Overall, Macro is the most preferable evaluation metric. Accuracy is misleading because it can be very high for primitive baselines (such as AlwaysZero) simply because one class is dominant over the other. Using only precision or recall is not sufficient, as each focuses only on either false negatives or false positives. As the score for label presence is the harmonic mean between precision and recall, it tends to be closer to the lower of the two values (see, e.g., UniformRandom). However, the score is not symmetric and strongly focuses on true positives. The Macro score is computed as the arithmetic average between the scores for label presence ( for 1) and for label absence ( for 0). This metric has the added advantage that chance performance is 50% even for imbalanced data, and it is the metric we choose to report in the rest of this section. The maximum achievable score, on the other hand, is likely significantly below 100%, since human gesticulation is highly stochastic.

Accuracy Precision Recall for 1 for 0 Macro
AlwaysZero 87% 10% 0% 00% 0% 00% 0% 00% 92% 2% 46% 1%
AlwaysOne 13% 5% 13% 5% 100% 00% 22% 8% 0% 0% 11% 4%
UniformRandom 50% 01% 13% 5% 50% 02% 20% 6% 64% 5% 42% 3%
InformedRandom 77% 04% 12% 5% 12% 01% 12% 03% 86% 2% 49% 1%
our result 86% 04% 44% 12% 35% 9% 39% 9% 92% 3% 67% 5%
Table 1. A comparison between various evaluation metrics for gesture property prediction for the gesture semantic property “shape”. The baselines are italicised; “our result” refers to our multimodal dilated CNN (BothModalities). Red colour highlights issues with the associated metrics.

5.2. Predicting gesture presence and timing

Figure 4. Macro scores (means and standard deviations) for gesture presence prediction. Gesture presence is seen to be predictable regardless of the input modality used.

A point chart showing the Macro F1 scores for 4 baselines and 3 models as follows. AlwaysZero - 31.41%, AlwaysOne - 34.28%, NaiveRandom - 49.39%, Informed Random - 49.03%, AudioBased - 64%, TextBased - 71%, FullModel - 70%.

The first question we considered was whether or not it is possible to predict when to make a gesture (i.e., predict the presence or absence of a gesture from the speech features in our dataset). The best model found by our hyperparameter search achieved a Macro score on this binary classification task. This is better than chance (see Figure 4) and agrees with results from previous work on another dataset in a different language (English) yunus2019gesture.

5.3. Experiments on machine-learning setup

We experimented with two different approaches for dealing with data imbalance, namely upsampling uncommon classes and special loss functions, as described in Section 4.2. Our experiments found no benefits to either upsampling or the special loss functions in terms of Macro score: the results were slightly better for some features and slightly worse for the others, but there was no major difference. We therefore use the conventional cross-entropy loss without any upsampling for all other experiments in this paper.

We also experimented with training a single model to predict all the gesture properties at once, versus training individual models for each gesture property. We found that using individual models has higher macro score, possibly due to different tasks benefiting from different hyperparameter choices, hence we model each gesture property individually in the rest of our experiments.

5.4. Evaluating text and audio contributions

gesture category [Macro ] gesture semantics [Macro ] gesture phase []
label deictic beat iconic discourse amount shape direction size pre-hold post-hold stroke retraction preparation
relative frequency 29.05% 14.47% 72.03% 12.78% 4.7% 13.1% 13.7% 1.9% 0.6% 12.2% 40.9% 14.8% 30.8%
AlwaysOne 23% 3% 14% 3% 41% 2% 12% 2% 5% 2% 11% 4% 12% 4% 2% 1%
AlwaysZero 41% 2% 46% 1% 23% 3% 46% 1% 49% 1% 47% 1% 46% 1% 49% 1%
UniformRandom 48% 1% 43% 2% 48% 1% 42% 1% 37% 2% 42% 3% 42% 3% 35% 1% 1.4% 0.8% 16% 3% 36% 16% 18% 4% 26% 7%
InformedRandom 50% 1% 50% 1% 50% 1% 50% 1% 50% 1% 50% 1% 50% 1% 50% 1% 1% 1.4% 14% 3% 46% 10% 16% 4% 32% 5%
AudioOnly 52% 1% 51% 2% 53% 3% 52% 2% 50% 1% 51% 1% 51% 2% 50% 1% 0% 0% 7% 3% 53% 4% 15% 4% 41% 3%
TextWithTiming 59% 3% 50% 2% 60% 4% 59% 5% 64% 9% 67% 5% 65% 4% 56% 8% 0% 0% 14% 4% 47% 3% 21% 4% 41% 4%
TextNoTiming 59% 3% 50% 3% 58% 3% 57% 3% 63% 7% 67% 5% 65% 5% 59% 10% 0% 0% 14% 6% 47% 3% 20% 5% 39% 4%
BothModalities 59% 3% 50% 2% 58% 3% 58% 4% 63% 8% 65% 4% 64% 5% 57% 9% 0% 0% 14% 5% 47% 3% 20% 4% 40% 3%
Table 2. Gesture-property prediction scores for all baselines, and our trained predictors using text, audio, or both modalities. Baselines are italicised; bold, coloured numbers indicate that the given label is found to be predictable as defined in Sec. 4.3.
gesture category [Macro ] gesture semantics [Macro ] gesture phase []
label deictic beat iconic discourse amount shape direction size pre-hold post-hold stroke retraction preparation
BetweenSpeaker 58% 3% 49% 2% 56% 5% 56% 4% 61% 5% 67% 7% 65% 6% 59% 11% 0% 0% 12% 8% 42% 7% 20% 6% 38% 6%
WithinSpeaker 59% 3% 50% 2% 58% 3% 58% 4% 63% 8% 65% 4% 64% 5% 57% 9% 0% 0% 14% 5% 47% 3% 20% 4% 40% 3%
60% 2% 53% 3% 62% 4% 61% 3% 64% 9% 67% 7% 64% 5% 54% 7% 0% 0% 12% 5% 54% 5% 22% 7% 45% 3%
Table 3. A comparison of prediction results for different cross-validation strategies. For detailed information see Section 5.5.

This experiment analysed the importance of the two input speech modalities – text and audio (prosodic features) – for predicting the gesture properties under study. We consider four different versions of our model: trained using only speech prosodic features (AudioOnly); trained using only text features, with explicit timing information for each word either included (TextWithTiming) or omitted (TextNoTiming); and trained on both speech audio and text features, including word timing (BothModalities).

As can be seen from the results in Table 2, the text features were informative and predict gesture category and gesture semantics content better than chance. Audio features, in contrast, did not improve on the best bottom-line predictions for these gesture properties, and so did not allow better-than-chance prediction on their own. This provides evidence that audio features alone are insufficient for predicting the semantics of iconic gestures. It also indicates that even the category of the gesture cannot be predicted from audio features alone. Moreover, the combination of audio and text did not in general perform better than text on its own. Text appears to be a necessary input for both gesture category and semantics. This has strong implications for the need to include text in machine learning gesture models in general. This finding can be explained by semantic information, which gesture category strongly depends on, being challenging to obtain directly from the audio. Whether or not an iconic gesture is appropriate will generally depend on the semantic information in the speech.

A different story emerges for gesture phase prediction, where audio was more helpful than text (even with word timing information). In particular, gesture stroke could be predicted noticeably better than informed random sampling using audio. This confirms previous findings that gesture stroke can be predicted from speech audio yunus2020sequence.

5.5. Generalising within and across speakers

Next, we evaluated gesture property prediction performance when generalising to novel speakers, versus the performance on speakers present in the training data.

Table 3 compares prediction results of a hold-one-speaker-out cross-validation strategy (BetweenSpeaker) to our default cross-validation strategy where speakers are present in both training and test sets (WithinSpeaker), and when speakers are present in both sets and the model has access to speaker IDs encoded as one-hot vectors ().

We observe minor performance changes when we evaluate on previously unseen speakers, but the changes are not substantial and performance remains better than chance. As gesture behaviour is highly idiosyncratic across individuals, one might a priori have expected a large drop in prediction performance. It is reassuring that the actual change is quite modest, suggesting that gesture-property predictors generalise relatively well to new speakers.

Conversely, we observe no notable difference between WithinSpeaker and , indicating that predictions did not benefit from knowing the speaker label. One reason could be that each speaker only has 10 min of data, which may not allow learning personalised correlations. As above, another related contributing factor could be that gesture behaviour is very stochastic overall.

We have also investigated the generalisation gap between training and validation performance. For brevity, the exact numbers are not reported, but the model overfits by double digits for gesture semantics and category, but not for gesture phase. It may be that gesture-phase prediction is ambiguous and difficult even on training data with the features we used.

5.6. Some prediction examples

Figure 5. Example output sequences from gesture label prediction. The -axes are time (in seconds) and the -axes label probability. True labels are red, predictions blue.

Four line plots showing, for 4 different labels, the predicted probabilities and the ground-truth for particular input windows lasting several seconds.

For "semantics - size", the ground truth is 0 throughout the 5 second long input window, and the example prediction is around 0 percent with a small peak going up to 20 percent then down to 0 again in the first second of the input.

The ground truth and the example prediction for "category - deictic" is similar, except the prediction has its peak at the end of the window.

The ground truth for "category - iconic" is always 1 over the 3 second long window, but the prediction is more erratic. In the first half of the input, the prediction starts from 60 percent, goes down to 50 percent, then quickly rises up and starts shifting between 80 and 100 percent. In the second half, it suddenly drops to zero and stays there for a second, then it instantly goes up to 70 percent and rises up to 80 percent in the last half second.

For "phase - stroke", the ground truth is 1 in the first second and then zero in the remaining 2 seconds, but the prediction, hovers between 50 to 60 percent throughout the entire window.

Figure 5 shows several example sequences of per-frame predicted probabilities for the presence of various gesture property labels on a held-out speech sequence. We can see both good and bad performance. We note that the predictions for gesture semantics and category are surprisingly confident, given that cross-entropy tends to promote cautious models that favour predicting numbers close to the a-priori class probabilities in the absence of compelling evidence to the contrary. This behaviour might be due to overfitting.

5.7. Do all speakers have similar predictability?

Figure 6. Macro score for predicting the “direction” label across the 22 speakers considered. Note the high variation.

A bar plot showing 22 Macro F1 scores. The scores wildly vary between roughly 58 percent and roughly 76 percent.

There is a large difference in prediction performance for different speakers, as seen in the high standard deviation in most of the tables above. We show one example of the performance variation between speakers in Figure 6. We see that how well we predict the gesture semantic label “direction” changes substantially from speaker to speaker, indicating that not all speakers are equally predictable, although predictions are better than chance in all these cases.

5.8. On the effect of hyperparameters

Figure 7. Training evolution of gesture category prediction average Macro score for 50 different hyperparameter settings. The -axis shows the number of update steps and the -axis the average Macro score. The total number of training steps was fixed to 25k.

A line plot with 50 lines that show the evolution of the macro F1 during training. It shows that the training is relatively stable for individual hyperparameter settings (i.e., the macro F1 rises slowly until it converges), but the performance heavily depends on the hyperparameter setting: the final performances are roughly evenly distributed between 46 and 58 percent. The difference between the hyperparameter settings tends to already manifest at 5 thousand out of 25 thousand training iterations.

Aside from the large variation between different speakers, we also observe substantial performance variation depending on model hyperparameters. Figure 7 shows the Macro score for predicting gesture category for 50 different hyperparameter runs, as described in Section 4.2. We can see that results vary greatly depending on hyperparameters. The performance variation attributable to the hyperparameters is much greater than the difference between many conditions in our experiments, indicating that the model is sensitive to the hyperparameter settings. We recommend future work to also perform hyperparameter search to obtain reliable performance.

6. Conclusions

We have studied the extent to which 13 different gesture-property labels – mainly ones of relevance to communicative gestures – can be predicted from speech. Numerous experiments on a direction-giving dataset show that the gesture properties we considered, such as gesture semantics and gesture phases, can be predicted from speech with Macro scores better than chance. Predicting gesture properties for speakers outside the training data was only slightly more challenging, suggesting that gesture-property prediction may generalise well.

Another central finding is that, for predicting gesture properties such as gesture category and gesture semantics, all that must be known is the text transcript, while for others, specifically phase, prosodic audio features are much more suitable.

Our 10% advantage over the chance baselines should be viewed in the context that human gestures are highly stochastic, and that state-of-the-art data-driven gesture synthesis does not compare favourably to random gesticulation kucherenko2020genea. Leveraging our gesture-property predictions to achieve semantically appropriate gestures even a fraction of the time could thus add important communicative value, and the prediction models we identify are well-suited for integration into modern data-driven gesture generation systems.

6.1. Future work

The present study opens up several directions for future research:

First, the study could be expanded, e.g., to perform more extensive architecture search and evaluate on a metric that (like human perception nirme2019motion) is less sensitive to small timing shifts than the measures in this article, which compare each frame individually.

Second, it would be interesting to perform a similar study on other datasets, e.g., in different languages or from situations other than direction giving. Also, while the SaGA dataset we used is the largest one we know that has been annotated at this high level of detail, larger datasets are also of interest as they become available.

Last, an important future goal is integrating gesture-property prediction into modern gesture-generation models, to enable more appropriate and meaningful gesture synthesis, as for example suggested in kucherenko2021speech2properties2gestures.

The authors are grateful to Stefan Kopp for providing the SaGA dataset and fruitful discussions about it, and to Olga Abramov for advising on the dataset gesture-property processing. This work was partially supported by the Swedish Foundation for Strategic Research Grant No. RIT15-0107 and by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

References