Body language is an essential part of face-to-face conversations. People consciously or unconsciously use head motion, hand gestures, and facial expressions while speaking. We use these modalities for multiple purposes including to emphasize ideas, parse sentences into smaller syntactic units, complement verbal information, and express our emotions. Therefore, incorporating naturalistic behaviors that fulfill these communication goals is important in the design of a conversational agent (CA) . CAs are playing a relevant role in several fields including business enterprises, healthcare, entertainment, and education. Their use has also increased with new website and mobile applications, providing a great platform for virtual reality, visual aid for hearing impaired individuals, and virtual agents for online shopping .
Creating behaviors that are perceived natural while conveying the underlying meaning in the message is challenging. Most studies in this field are focused on either rule-based or data-driven systems . Rule-based systems create contextual rules to trigger behaviors, emphasizing the semantic and syntactic information [8, 17, 1]. However, the variation of the gestures generated using rules is bounded by the predefined dictionary of handmade gestures . Furthermore, scheduling the movement with speech is challenging [39, 9]. Data-driven approaches learn the behaviors directly from data. There are several studies that have used speech to create behaviors [4, 7, 5, 23, 21, 11, 20]. Speech prosody is highly correlated with facial expression and head movements , so it is possible to generate behaviors that are timely aligned with speech (rhythm, emphasis). However, speech-driven methods do not consider the meaning of the sentence. While the gesture may be perfectly aligned with speech, its meaning may contradict the message (e.g., nodding while saying “no”).
This study leverages the advantages of rule-based and speech-driven systems, bridging the gap between these methods by overcoming their drawbacks. We address this problem by constraining our speech-driven model by contextual information to generate behaviors with meaning. This approach relies on dynamic Bayesian models (DBNs) capturing the temporal relationship between speech and gestures (in this study, hand and head motion). We introduce the constraints as an extra discrete variable that conditions the state configuration between speech and gestures, modeling the specific behavioral characteristics associated with the given constraint. We demonstrate the potential of the proposed approach with two evaluations, where the constraints are either discourse functions (negation, affirmation, questions and suggestion) or predefined prototypical hand and head gestures. With the discourse functions, we aim to synthesize head movement trajectories that are commonly associated with a given discourse function (e.g., head roll for questions, head shakes for negations). Instead of creating hand-crafted rules, the proposed model learns the statistical patterns from data. For prototypical gestures, we aim to learn statistical models that generate pre-defined hand and head behaviors, and their joint representations with prosody features. This model plays the role of a behavior realizer in the SAIBA framework , and has the potential to be integrated into a rule-based system. We consider three prototypical hand gestures (To-Fro, So-What, and Regress) and head gesture (Head Nod and Head Shake). The constraints are introduced as input to the system, changing the discrete variable that conditions the generated gestures. During synthesis, the models will create novel realizations of these gestures that are timely synchronized with speech. The proposed models are effective, producing realizations that are perceived more natural than the unconstrained models, bridging the gap between rule-based and speech-driven methods.
Ii Related Work
Several studies have proposed schemes to generate gestures, which can be categorized into rule-based and data-driven methods.
Ii-a Rule-Based Systems
Cassell et al. presented one of the early studies on rule-based framework to synthesize behaviors. They defined several rules to generate appropriate behaviors dictated by the meaning of the message. In a later study, Cassell et al. introduced the behavior expression animation toolkit
(BEAT), which uses text to create animations with appropriate and synchronized gestures. BEAT tags semantic labels in the text, which are mapped into appropriate behaviors by heuristics rules suggested after observing human-human nonverbal displays. The synchronization is decided based on the timing of the words in the text. Poggi et al. introduced GRETA, which is an embodied conversational agent (ECA), comprising several modules such as emotional mind, dialog manager, plan enrichment and body generator. The body enrichment module labels text with appropriate behaviors assigning synchronization points, which are realized by the body generator module. GRETA includes a number of predefined gestures which can be exploited to generate animations with specific communicative goals. Kopp and Wachsmuth  proposed to find a prominent word or phrase to synchronize speech and gestures. The prominent words convey the communicative goal, creating anchors for the peak of the gesture. Marsella et al. proposed a framework to generate animation from speech. Their system uses an automatic speech recognition (ASR) module to get the transcriptions, which are semantically analyzed to extract communicative goals in the message. They defined a list of behaviors associated with the communicative goals, mapping the text to behaviors. Their system also analyzes emotional cues in speech extracting arousal level, which dictates the selection of the behaviors generated for each communicative goal.
Ii-B Data-Driven Systems
An alternative approach to generate behaviors is using data-driven methods that exploit the relationship between body movements and acoustic features (e.g., prosody). Vigot et al. demonstrated that there is a statistically significant correlation between prosodic features and raw body movements. Graf et al. showed that there is correlation between prosodic events and behaviors such as eyebrow and head movements. Busso et al. reported that the correlation between prosodic features and head movements across different emotions are on average more than , using canonical correlation analysis (CCA). Speech and gestures also co-occur. The study from McNeill  showed that more than 90% of the gestures occur while speaking, showing the tied connections between these modalities. These results have motivated synthesizing behaviors using speech-driven models.
Busso et al. proposed emotion dependent hidden Markov models (HMM) to synthesize head movements with prosodic features. Mariooryad et al. investigated several dynamic Bayesian networks (DBNs) to jointly model head and eyebrow movements driven by speech, capturing the dependencies not only between speech and facial behaviors, but also between head and eyebrow motions. Some studies have argued that speech is correlated with the kinematics of the behaviors. Le et al. proposed to jointly model prosodic features and kinematic features of head motion using Gaussian Mixture Models (GMM). Levine et al. presented a system to synthesize body movement using hidden conditional random fields
(HCRFs), modeling the relationship between prosody and kinematic features of the joint rotations. The task was to predict kinematic parameters from speech. They use reinforcement learning to select behaviors in the database that match the inferred kinematics parameters. Bozkurt et al. designed a system for generating upper body beat gestures based on prosodic features. They clustered prosodic features into intonational phrases, and movements into gestural phrases. These units were jointly modeled using a hidden semi Markov model (HSMM), which allowed asynchrony between the gestures and prosodic phrases by modeling the state duration of the hidden state. Chiu et al. proposed to use
hierarchical factored conditional restricted Boltzmann machines(HFCRBMs) which learns how to generate the joint poses for the next frame based on the previous frame conditioned on the prosodic features.
Ii-C Hybrid Approaches
Rule-based and data-driven methods have advantages and disadvantages. Rule-based methods do not capture the range of behaviors observed during human interaction, are limited by the stored behaviors, and often result in repetitive behaviors. The synchronization between behaviors and speech is challenging, since they do not learn the synchronization from natural recordings. However, they can consider the meaning of the message to derive appropriate behaviors. Speech-driven methods can capture broader variations of behaviors, learning appropriate synchronization between speech and gestures. However, they may not create appropriate behaviors that match the intended communicative goal. Using pure speech-driven methods may be enough to predict beat gestures but not iconic or metaphoric gestures which are closely related to the message. Bridging the gap between rule-based and data-driven frameworks has the potential to create behaviors that are meaningful, timely synchronized, and representative of the range of variations observed during human interaction.
Studies have attempted to combine both approaches creating hybrid frameworks. Stone et al. designed a hybrid system to generate meaningful behaviors given the text. They jointly segment audio and motion capture recordings into units expressing pre-defined communicative intents. Given an input text, they parse the input into their predefined categories, using dynamic programming to find speech and motion capture segments that have the same communicative goal. The generation with this framework is limited to the stored speech segments. Sadoughi et al. proposed to constrain a speech-driven model based on the discourse functions of the sentence to generate more meaningful head and eyebrow motion. The study was limited to only two discourse functions: affirmation and question, where the subjective evaluation of the result showed improvements for the constrained model versus the unconstrained model when the constraint was question.
Ii-D Contribution and Relation to our Prior Work
This paper proposes a framework to create meaningful behaviors driven by speech. This framework creates animations based not only on prosodic features, but also on constraints which are either discourse functions (i.e., semantic structure of speech), or prototypical behaviors (e.g., head nods). This study builds upon our previous work, which we summarize in this section.
Sadoughi et al. proposed a model to generate speech-driven head and eyebrow movements constrained on discourse functions. The preliminary study tested the constrained model on one session of the IEMOCAP corpus, constraining the models on two dialog acts: question, and affirmation. The subjective evaluation of the models showed that the behaviors from the constrained models are more preferable, natural and appropriate compared to the unconstrained model. In Sadoughi and Busso , we explored the idea of constraining the models using prototypical behaviors such as head nods. The models were trained with gestures directly retrieved from the corpus by providing few examples.
The models presented in our preliminary studies have several limitations. First, the variability of the generated behaviors is limited since the model optimization is susceptible to a poor initialization to reduce the mean square error, often resulting in average trajectories. Second, the structure of the proposed models requires balanced datasets per constraint, which is an unnecessary restriction. This study presents an improved speech-driven model that overcomes these problems by changing the structure of the model and the training strategy, which systematically reduces the confusion between the constraints during training. The contributions of this paper are (1) designing a constrained speech-driven model to generate more meaningful behaviors (Sec. VII-A), (2) an initialization technique which increases the range of motion associated with the behaviors (Sec. VI-C), and (3) a novel training approach to effectively learn distinct patterns associated with different constraints (Sec. VII-B).
This study aims to improve nonverbal displays of CAs using speech-driven models that are constrained by either the underlying discourse function in the message or prototypical behaviors specified by rule-based systems. In an attempt to create meaningful gestures, Marsella et al. defined several functions based on the content of the speech. These discourse related functions create a mapping between content and behaviors. Likewise, Poggi et al. designed a toolkit with several embedded functions to generate behaviors. The inputs to these mappings are communicative goal of the utterance, which we call discourse functions. These discourse functions are associated with relevant gestures that contribute in understanding the underlying message of the speech.
Figure 1 gives the overview of our system, which takes as input speech and the underlying discourse function or intended gesture, producing meaningful behaviors that are timely synchronized with speech, convey the right message, and display the range of behaviors observed during human interactions. Our framework can take the role of behavior realizer proposed under the SAIBA framework , bridging the gap between rule-based and data-driven approaches.
We aim to answer the following questions in the context of CAs, “do discourse functions affect behaviors?” If so, “is there a principled framework to capture the characteristics behaviors associated with discourse functions?” Knowing the target gesture, “can we effectively create the gesture which is synchronized with and modulated by speech?” We address these questions by exploring four discourse functions: negation, affirmation, question and suggestion. The analysis in Section V reveals that the behaviors observed during these discourse functions are in fact different. We propose a principled speech-driven approach to capture the characteristic behaviors for each discourse function. We also propose to constrain the models with prototypical behaviors. While the framework is general, we evaluate three hand gesture and two head gestures.
This section provides a brief description of the corpus used in this study, focusing on the annotation process and the method used to retrieve target gestures.
Iv-a The MSP-AVATAR Database
We collected the MSP-AVATAR corpus  to study the role of discourse functions and gestures. This corpus was collected to provide data to synthesize more meaningful and naturalistic behaviors.
The MSP-AVATAR corpus contains recordings of dyadic interactions based on improvisations of daily scenarios. It encompasses the recordings from six actors interacting in four dyadic interactions. The scenarios are carefully designed such that they include the use of eight discourse functions: contrast, confirmation/negation, question, uncertainty, suggestion, giving orders, warning, and informing. There are also scenarios prompting the actors to use iconic gestures (e.g., large, small) and deictic gestures for pronouns (e.g., “me”, “you”). The discourse functions in this corpus are carefully chosen based on previous studies [24, 28], which are likely to elicit characteristic behaviors.
The corpus consists of audio, video and motion capture recordings, collected at the Motion Capture laboratory of the University of Texas at Dallas. In each dyadic session, one of the actors wore 43 facial markers, and a suit in which we attached 28 markers (Fig. 2). Therefore, we have motion capture data for four different subjects. The facial markers include most of the feature points (FPs) in the MPEG-4 standard . For the upper body, we follow the position of the markers in the Vicon skeleton template (VST). For each of the actors, we used a Lavalier microphone (SHURE MX150) connected to a portable digital recorder (TASCAM DR-100MKII). The microphone recorded the speech at a resolution of 16 bit and a sampling rate of 44.1 KHz. We used two Sony handycams HDR-XR100 which recorded at 1,920 1,080 resolution. The cameras were positioned to record the frontal view of each actor, without interfering with the Vicon system. In total, we have 74 sessions with a duration of two hours and fifty eight minutes.
Iv-B Annotation of Discourse Functions
We manually annotated the 74 sessions, identifying sentences associated with discourse functions. Some of the discourse classes are harder to reliably annotate, so we only consider four classes: asking questions (question), showing agreement (affirmation), showing disagreement (negation), and making suggestions (suggestion). The evaluation was conducted with Amazon mechanical turk (AMT), using the OCTAB interface designed by Park et al.. This toolkit is suitable for segmental annotations of the videos, where annotators can mark the beginning and end of segments in the videos where they noticed the requested discourse function. To improve the quality of the annotations, our approach identifies good evaluators using a screening phase. We ask the evaluators to annotate the discourse function questions, which is one of the easiest tasks. We also annotated this discourse function in our laboratory. We manually compared the annotations provided by each evaluator with our annotations, selecting the ones who provided reasonable answers. Then, we invited the selected evaluators to complete the rest of the assignments for the other three discourse functions. We recruited three annotators per assignment.
We use the method proposed by Zhou et al.
to aggregate the annotations coming from different annotators. This method solves a crowdsourcing model which estimates the hidden variables relevant to the difficulty of the tasks and the hidden variables relevant to the reliability of the annotators by using the minimax conditional entropy principle. The approach not only provides the hard labels after fusion, but also gives a confidence level in the assigned label (i.e., the soft assignment). To use this method, we consider our task as a binary classification where each video frame either belongs to the target discourse function or not (30fps). We derive a soft assignment for each of the four discourse functions using the three evaluations per frame. We only consider the frames whose soft assignments are more than 0.9 for one of the discourse function, increasing the reliability in the labeled segments. Notice that the annotated frames are not mutually exclusive among the discourse functions. If we separately consider the co-occurrences between two or more discourse functions (e.g.suggest and question) as extra constraints, we would need enough realizations of these combinations. Unfortunately, the total durations of the co-occurrences of labels between 2 or more discourse functions vary between 0.5s to 310s, which is not enough. For simplicity, we enforce mutually exclusive segments by removing the overlaps, keeping as many segments as possible. This approach results in total durations of 734.4s for affirmation, 1,118.7s for negation, 1,149.1s for question, 1,582.5s for suggestion, and 6,111.7s for other.
Iv-C Motion & Audio Features
The data-driven models take speech features as input, generating the most likely behaviors. This study considers head and hand gestures. We use the upper body joint rotations derived after solving the skeleton of the motion capture recordings in Blade. We consider the pitch, yaw, and roll rotations for the head (i.e., 3 degree of freedom (DOF)), arms (3 DOF 2) and forearms (2 DOF 2), normalizing these features using z-normalization per subject. The sampling rate for the motion capture data is 120fps.
, estimating their first and second order derivatives resulting in a 6D feature vector. These features are extracted using 40ms windows every 16.67ms with 23.3ms overlap (i.e., 60fps). We interpolate the unvoiced segments in the fundamental frequency to avoid discontinuities. The feature vector is up-sampled to match the sampling rate of the motion capture data (i.e., 120fps).
Iv-D Prototypical Behaviors
This study demonstrates that it is possible to create prototypical behaviors using data-driven models. While the framework is general, we only consider three prototypical behaviors for hand and two prototypical behaviors for head movements. Figure 3 illustrates the target hand gestures. The behaviors are defined as follows:
Head Nod: One or more pitch rotations of head.
Head Shake: One or more yaw rotations of head.
To-Fro: Movement of both hands form side to side.
So-What: Movement of both hands in an arc in an outward manner.
Regress: Movements of hands in circles towards the body.
The data-driven models require enough examples of these gestures to effectively train the models. We use the supervised framework introduced by Sadoughi and Busso to automatically retrieve these instances from the dataset. The key idea is to annotate few examples of the target behavior, and retrieve the rest of the segments until we have enough data to train the models. The approach is a supervised approach that simultaneously solves the segmentation and detection of the target gestures. The first step is downsampling the motion capture sequences using clusters. This is a nonuniform downsampling approach that discards segments without variations while keeping changes in the trajectories. Then, we use a multi-scale sliding window framework that considers windows of different sizes, accounting for variation in the duration of the gestures. The next step is to determine whether the selected segments include the target gesture. The approach consists of two steps. In the first step, we screen the segments using one-class support vector machine (SVM), which reduces the potential segments, removing everything that departs from trajectories of the training examples. The second step uses the dynamic time alignment kernel (DTAK) algorithm to evaluate the candidate segments in more detail. For DTAK, we use the implementation provided by Zhou et al..
In Sadoughi and Busso
, we set the detection threshold by maximizing the f-score. However, for this study it is more important that the selected segments are indeed from the target gestures (i.e, recall rate is less important). Therefore, in this study we set the detection thresholds per subject by maximizing the precision on the developing set.
We manually annotated three sessions per subject to evaluate the behavior retrieval framework (). Table I gives the number of examples annotated per target behavior in these 12 sessions (column #). These 12 sessions are partitioned into development (two session per speaker) and test (one session per speaker) sets using three-fold cross-validation. The development set is exclusively used to set the detection thresholds. Table II shows the accuracy of the behavior retrieval framework. The precision rates for head gestures are higher than 96%. For hand gestures the precision rates are higher for so-what (80%) and regress (90.5%). The precision rate is lower for the to-fro, since the behaviors are more complex.
Since our algorithm independently solves the detection of gestures, it is possible to have overlaps between two or more target gestures. We observe that the durations of these overlaps are 552.4s for head gestures, and between 30.2s to 91.3s for hand gestures. Similar to the annotation of discourse functions, we separately remove the overlap segments, resulting in mutually exclusive segments for hand and head gestures. For head gestures, we identify 1029.9s for shake, 2056.3s for nod. The remaining frames are labeled as other (7484.1s). For hand gestures, we identify 201.6s for so-what, 448.6s for to-fro, and 567.3s for regress. The remaining frames are labeled as other (9352.7s).
V Statistical Analysis of the Constraints
Before constraining our models on either discourse functions or target behaviors, we explore whether the hand and head behaviors vary across different categories of the constraints (e.g., differences in head motion for questions and affirmations). If all the different discourse functions or target behaviors do not have any effect on the behaviors, there is no value in constraining our speech-driven model, so this analysis is relevant.
We use statistical tests to evaluate whether the presence of different discourse functions creates a significant difference in the behaviors. We consider pitch, yaw, and roll rotations for the head and joint rotations for the arms and forearms. The Kolmogorov-Smirnov test indicates that the distribution of the data is not normal, so we rely on the Kruskal-Wallis (KW) test for the evaluation, asserting significance at -value
0.05. When the KW test rejects the null hypothesis, meaning that at least two of the distributions are not the same, we useDunn & Sidák’s approach (DSA) to perform pairwise comparisons to identify which pairs are different. Table III gives the results for the pairs whose distributions are statistically different. The behaviors associated with the four discourse functions are different across different regions. For instance, negation and affirmation show differences in head yaw and pitch rotations, but not in roll rotation. Also, there are differences between arm rotations for both hands when the sentence is either an affirmation or a suggestion.
We also analyze how discourse functions affect the prototypical behaviors for head and hand gestures. This analysis considers the frames assigned to each discourse function for each of the five prototypical gestures. Figure 4
displays the normalized distribution of the target behaviors per discourse function, which is estimated by normalizing by the number of frames assigned to each discourse function (the addition of the distribution is 1 in each subfigure). This normalization is necessary, since some discourse functions are more frequent than others. The figure displays separate results for head and hand movements. These histograms reveal some interesting relationships between discourse functions and behaviors. For example, the proportion ofhead shake increases for negation, whereas the proportion of head nod increases for affirmation. Some intuitive results are that the prototypical behavior so-what occurs more often during questions, regress occurs more often during suggestions, and to-fro occurs more often during questions. These histograms demonstrate that there are differences between the behaviors associated with discourse functions, which our models aim to capture.
Vi Baseline Model
We consider speech-driven methods built with dynamic Bayesian network (DBN). This section introduces the original DBN proposed by Mariooryad and Busso , which is the building block of the proposed models. This DBN framework also serves as a baseline for our models.
Figure 5 illustrates the baseline model, which was referred to as jDBN3 in Mariooryad and Busso . This structure was the best model to jointly capture not only the relation between speech and facial features, but also the relation between facial features. In the diagram, the circle nodes represent the observation variables and the rectangle nodes represent the hidden variables. In our model, the node Speech represents the prosodic features and the node Motion represents either hand or head motions. The nodes Speech and Motion
are continuous variables and are modeled with Gaussian distributions. The hidden discrete state variable
represents the state configuration between speech features and the gesture. It serves as a discrete codebook constraining the speech and gesture space. The transition matrix between the hidden variables is ergodic, where the transition probabilities follow the Markov property of order one. The time unit of the DBN is the time frame in the data (120fps).
In this model, the nodes Speech and Motion are conditionally independent given
. When the speech features are entered in the models, Bayesian inference updates the marginal probabilities of the state configuration node, affecting the node Motion. This model preserves the full dependencies of the features within a modality by having full covariance matrices. This section describes the inference and synthesis method, emphasizing the improvement presented in this study, which enhances the range of movements synthesized by the model.
There are differences in the inference process for learning and synthesizing the gestures. During learning, we have access to the observations for the nodes Motion and Speech, so we use the full observation probability () in Equation 1. During synthesis, we only have observations for the variable Speech, and the task is to predict the variable Motion. Therefore, we use partial observation probability () in Equation 2.
During synthesis of the Motion variable, we find the probabilities of the states at each time given the partial observation sequence and the model (Eq. 3), using the Viterbi algorithm ( is the state at time , is the partial observation and represents the parameters of the model). Equation 4 calculates the expected value for the node Motion given speech features, where is the mean of the state for the variable Motion.
Vi-C Initialization of the States using Vector Quantization
The parameters of the models are learned with conventional expectation maximization (EM). Since EM finds local optimum, the initialization is very important. In Mariooryad and Busso , we randomly initialized the models. Since the generated behaviors correspond to the expected values given the speech features (Eq. 4), the states may converge to the average position of the behaviors, reducing the range of behaviors generated by the model. Figure 6(a) visualizes this problem. The figure shows the distribution of the original data for two head angles. Each ellipse represents the 16 states after training the models with random initialization. Each ellipse is centered at its mean vector and shaped according to its covariance matrix. The figure shows that all the clusters converge to the origin limiting the range of behaviors in the models (e.g., limited variability in the codebook for the speech-behavior space). To address this problem, we increase the representation of the initial states by using the Linde-Buzo-Gray vector quantization (LBG-VQ) technique . After splitting the data into clusters, we initialize the statistical properties of the states ( and ) using the results of the LBG-VQ. This approach leads to sparser states which increase the range of behaviors generated by the models. Figure 6(b) shows the final 16 states achieved by the VQ-based initialization for two angles of the head position. The Figure shows that the VQ based initialization reaches a better representation of the data.
We smooth the trajectories generated by this network following the approach proposed by Busso et al.. The method selects equidistant key-points. The value of the joint rotations in these key-points are transformed into their quaternion representation, where they are interpolated. The interpolation connects the key-points providing smooth transitions. We implement this method using 12 key-points per second for the hand motion, and 15 key-points per second for the head motion.
Vii Proposed Constrained Models
This section describes the proposed model built upon the improved version of the DBN proposed by Mariooryad et al. described in Section VI. The key goal is to introduce constraints to generate meaningful behaviors. The constraints are either discourse functions or predefined prototypical gestures. The discourse function constraints bridge the gap between rule-based and data-driven system. The prototypical gesture constraints can serve as the behavior realizer in rule-based systems, capturing the intrinsic variability of each gesture, while preserving their temporal coupling with speech.
Vii-a Adding Constraints to the DBN Model
Figure 7 illustrates the constrained model proposed in this study, which we refer to as Constrained DBN (CDBN). The key addition with respect to the baseline model is the node Constraint which is introduced as a parent of the hidden state variable
. With this additional node, the state variable is directly conditioned on the given constraint, affecting the relationship between gesture and speech. Effectively, this model has transition matrices, prior probabilities, and state prior probabilities for each constraint, learning the intrinsic characteristics of the gestures conditioned on the given constraint.
This structure is different from the model proposed in our previous work, where the constraint was introduced as a child . By adding the constraint node as a parent we obtain the following advantages: (1) we separately model the prior probabilities of the constraints and their affect on the hidden states, (2) we handle constraint categories with unbalanced training data, and (3) we model a more reasonable cause-effect relationship between the variables.
The constraint added to the baseline model is a discrete observation node, representing the presence of a given constraint for each frame. We add the label other when the constraint is not specified as an input. Equation 5 explicitly highlights that the transition probabilities from the previous state to the current state depend on both the previous state and the current constraint:
where is the state at time , is the constraint at time , and is the transition probability between the and state when the constraint is . During synthesis, partial observation for this model includes Speech and Constraint. Equation 6 defines the expected value for the node Motion using partial inference. In this equation, represents the constraint sequence for the whole turn, meaning that depends not only on the , but also on the .
Vii-B Training Sparse Transition Matrices
The characteristic patterns associated with each constraint are captured by the constraint-dependent transition matrices (). If these transition probabilities are similar, the behaviors generated after imposing the constraints will also be similar, and the model will fail to generate the characteristic patterns analyzed in Section V. As a result, we want to increase the differences in the transition probability assigned to each constraint. For this purpose, we propose a novel training approach to make the conditional transition matrices sparse. First, we create states per constraint in node , which are separately trained using the data associated with the given constraint (e.g., data annotated with either discourse functions or prototypical gestures). These states capture the characteristic patterns for each discourse function. If we have constraints, this step will generate states. Using all these states is not practical since it unnecessarily increases the number of states in , and, therefore, the number of parameters. Furthermore, many of these states are redundant. Instead, we merge similar states, creating shared states and constraint specific states. We merge similar states using Kullback-Leibler divergence (KLD). Since each state is a multivariate Gaussian distribution, we use Equation 7 to find similar states:
where and are the multivariate conditional Gaussian distribution for states , and , with covariance matrices and , and mean vectors and , and is the dimension of the Gaussian. First, we select all the states associated with a constraint. For each of them, we find the closest state from the states associated with other constraints, as determined by the KLD metric. If the difference is less than a threshold (empirically set to 1), we merge the states, becoming a shared state across constraints. We sequentially repeat this process for the states of each of the constraints. Finally, we create a new state which is shared between all the constraints to allow transition between the constraints. The resulting conditional transition matrices for each constraint is sparse, allowing only transitions between the states plus the additional state shared across constraints. This is the initialization phase for the model, and the parameters are refined afterward using EM. Figure 8 gives an illustration of the states for a model with 3 constraints and 5 states per constraint. States 1, 2, 3, 5 and 8 are shared across more than one constraint, and states 4, 6, 7, and 9 are exclusive.
To illustrate the importance of these sparse transition matrices, we compare the transition matrices when (1) the states are shared across constraints, and (2) the states per constraint are defined using the proposed approach. We estimate these transition matrices for head shakes, head nods, and other, using . The average of the distances between the transition matrices conditioned on Head Nods (A), and Head Shakes (B) are 0.018 for option 1 (shared states) and 0.96 for option 2 (sparse matrices). This result shows that the training approach is more successful at capturing the differences between different constraints. Therefore, we rely on this approach for training the CDBN models.
Viii Experiments & Results
This section reports the experiments and the results of the evaluation of the proposed models. The baseline model is the improved framework proposed by Mariooryad and Busso  (Sec. VI). We compare the models using objective and subjective metrics. We separately train the models for head and hand, since the gesture constraints do not necessarily coincide. The evaluation relies on ten-fold cross-validation to maximize the use of the corpus.
Viii-a Optimization of Number of States
Before training the models, we need to determine the number of states. Optimizing this parameter is computationally expensive since the CDBN models need to be trained multiple times as we vary , repeating the approach for each fold in the cross-validation process. Therefore, we simplify the evaluation by setting one of the ten partitions as a validation set. We use the other nine partitions to train the models. This process is conducted once, using the optimal parameters for the rest of the evaluation.
We use two objective metrics. The first metric is the average canonical correlation analysis (CCA) between the original trajectory of the behaviors and the synthesized movements (, where is either hand or head motion). CCA projects two multidimensional data into a common space where their correlations are maximized. The value range between 0 and 1, where 1 implies perfect correlation and 0 no correlation between the variables. We estimate per turn, reporting the average results. The second metric is the average log likelihood rate (LLR) of the model (), where is the number of frames, and is the observation probability given the model parameter . We determine the number of state such that the and the LLR of the model are both high.
We separately estimate the number of parameters for each model (baseline models for head and hand, CDBN models for head and hand with discourse and gesture constraints). Figure 9 shows an example of the changes observed for LLR and in the validation set for the CDBN model for head and hand motions constrained on discourse functions. This figure shows that the and LLR values start to saturate when the number of states is =7 for head and =8 for hand. We obtain similar figures for other cases, not reported in the paper, setting the optimal value for . The row in Tables IV and V provides the number of states per constraint (e.g., ) for all the conditions. We use these parameters for the rest of objective and subjective evaluations
Viii-B Objective Evaluation
For the objective evaluation, we use ten-fold cross-validation approach. We avoid using the partition used for validation in the test set. Therefore, we only consider nine folds where the test set in each cross validation is one of the remaining nine partitions. After selecting the test set, we form the training set with the other nine partitions, adding the partition used for validation.
We separately evaluate the models constrained on discourse functions (CDBN-Dis) and prototypical gestures (CDBN-Ges) by comparing the generated trajectories with the baseline model (Sec. VI). We evaluate the generated movements in terms of the CCA between the original and generated motion sequences (), and the CCA between the generated motion sequences and the speech sequence (). We also use the KLD. The measures the amount of information lost when distribution is used to represent distribution . We evaluate the KLD between the synthesized movements () and the original movements (). Ideally, this value should be as small as possible, indicating that the generated movement sequences have similar distributions as the original motion sequences.
As a reference, the CCA between the original head motion and speech is , and the CCA between the original hand motion and speech is . These high correlation values show the strong coupling between the joint movements and the prosodic features. Tables IV and V give the results for the synthesized head and head movements, respectively. The results show that the constrained model on discourse function, achieves higher
than the baseline model for the hand region (t-test:) . The results also indicate that the constrained models for head region give lower than the baseline model (t-test: ). While the constrained models reduce the coupling between the generated trajectory and speech, their values are still higher than the coupling between original head movement and speech (e.g., ). As demonstrated by the subjective evaluation, the movements are more natural and appropriated when we constrain the models with discourse functions. The results for the KLD shows improvements on all the constrained models compared with the baseline models. The distributions of the generated behaviors are closer to the distributions of the original trajectories, compared to the baseline model.
Viii-C Subjective Evaluations
This section reports the subjective evaluations of the behaviors generated with the proposed models. We use the Smartbody toolkit  for rendering the movements, where the only variable that we control is the hand gesture and head motion. Everything else is kept consistent across conditions (e.g., facial expressions). We separately evaluate the models constrained on either discourse functions or prototypical gestures. The train and test partitions are the same as the ones described in Section VIII-B.
The first part of the subjective evaluation is when we constrain the models on the discourse functions (Sec. IV-B). We evaluate the perceived appropriateness and naturalness of the movements generated by the baseline model and CDBN model constrained on the discourse functions. For each constraint, we randomly selected 10 segments labeled with the corresponding discourse function. To provide enough context, we include the speaking turn preceding the selected turn. The animation is idle when the CA is listening to the other speaker (the MSP-AVATAR corpus consists of dyadic scenarios).
We use Amazon mechanical turk (AMT) for the perceptual evaluations, using the interface shown in Figure 10. We display the questions after the video is played to assure that the evaluators do not answer the questions before the video is played. We randomize the order of the videos for each evaluator. We only allow workers from the United States with overall acceptance rate of more than 80%.
We evaluate segments ( discourse functions conditions videos) animated using (1) the original motion capture data (Original), (2) the baseline model (baseline), and (3) the CDBN models constrained on discourse functions (CDBN). Each video was annotated by five different evaluators. We asked the subjects to rate the animations in terms of naturalness and appropriateness of the movements using a five-point Likert-like scale (Fig. 10). In total, we have 15 evaluators, where nine are females and six are males (average age is 31.1).
0.48. The Kruskal-Wallis test shows that the videos synthesized by the three conditions are different (). The pairwise comparisons of the results are denoted in the figures with a color coded asterisks. The color of the asterisk indicates that the given condition is statistically higher than the condition associated with the bar with the given color (we assert performance at -value0.05). The pairwise comparison of the results shows that the original motion capture recordings are perceived as more natural and appropriate than the animations synthesized by both models (). However, the CDBN models are perceived with higher level of appropriateness and naturalness than the baseline models. The difference is statistically significant for naturalness ().
We also analyze the performance of the constrained model per discourse functions. Figure 12 gives the results. With the exception of questions, the CDBN model improves the perception of naturalness and appropriateness over the baseline models. The differences are statistically different for affirmation (), where the values are slightly higher than the videos rendered with the original sequences. The consistency in the results reveal that the proposed models and training strategy can effectively capture the range of behaviors characteristic of the given constraint.
The second part of the subjective evaluation is when we constrain the models on the prototypical behaviors (Sec. IV-D). We synthesize 60 segments per gesture (i.e., where the constraint is the target gesture). We randomly choose these segments from the fully annotated sessions. We find the accuracy per gesture by watching the animations generated for these segments, where a success is considered when the generated behavior matches the target gesture. Table VI gives the accuracy for different head and hand gestures. The generated head gestures for Nod and Shake match the target gesture more than 80% of times. This high accuracy demonstrates the benefits of the proposed constrained models. For hand movements, the gesture So-What has the highest accuracy (85%). The accuracy for To-Fro and Regress is not as high. We hypothesize that the accuracy for Regress may be due to the high variability observed in the training samples. For To-Fro, the result may be related to the lower precision of the samples retrieved for this prototypical behavior (Table II). The results from these evaluations are encouraging, suggesting that there is room for improvements.
This paper explored the idea of introducing constraints in speech-driven models to generate behaviors with meaning that are timely coupled with speech. We evaluated a unified model with two types of constraints for hand and head movements: discourse functions and prototypical gestures. We incorporated discourse functions into the speech-driven framework to capture the characteristics behaviors associated with each of the four classes considered in the study (negations, affirmations, questions and suggestions). As demonstrated by the analysis, individuals displayed characteristic patterns for these discourse functions, which our model aims to capture (e.g., generating meaningful gestures when people are asking questions). Likewise, we constrained the models with predefined prototypical gestures for head (shake, nod) and hand (so-what, to-fro, regress) gestures. This model can be used by a rule-based system as a behavior realizer. The proposed approach not only creates the appropriate behavior, but also captures the temporal coupling between speech and the synthesized movement, which is not easily achieved with only rule-based systems.
The proposed framework is built upon the DBN models proposed by Mariooryad and Busso , providing three important contributions to effectively constrain the models on the underlying discourse function or prototypical gesture. First, we introduce a variable that constrains the state configuration between speech and gestures, capturing the cause-effect relation of the gesture production. Second, we proposed a better initialization of the models using vector quantization. The proposed training approach effectively increases the range of the movements generated by the model. Third, we introduced shared and exclusive states for each of the constraints, creating sparse transition probability matrices. Some of the states are shared between constraints, while other are exclusively associated with a constraint. This approach effectively captures the differences in the behaviors across constraints.
The results from the objective and subjective evaluations demonstrated the benefits of the proposed approach. The results of the perceptual evaluation showed significant improvement for the constrained model over the unconstrained baseline model, especially for affirmation. The results for prototypical gestures also revealed the potential of the proposed work. The head gestures synthesized by the constrained model generated the target gesture with 80% accuracy. The hand gestures generated by the constrained model showed 85% accuracy for so-what. For to-fro, and regress the accuracies are lower.
The study opens interesting opportunities to increase the role of data-driven models in CAs. For example, the proposed approach can be combined with rules driven from natural recordings  to create meaningful and naturalistic gestures. One limitation of the approach is that it requires speech. We are exploring training schemes to extend the models by driving the behaviors using synthetic speech [30, 35]
. If we can solve the challenges in using synthetic speech instead of natural speech, we can increase the range of CA applications for data-driven models. With the transcription, we can also infer discourse functions using automatic algorithms. Dialog acts are semantic tags which can be retrieved from the text using supervised classifiers. These tags can then be translated into discourse functions, resulting in an autonomous meaningful behavior generator. Finally, we can address the lower performance for prototypical hand gestures by adding more data, capturing intrinsic variability between people, and by using more powerful frameworks. Advances in deep learning, in particular, offer appealing alternatives for this task[15, 19, 32].
The authors would like to thank Sunghyun Park, Philippa Shoemark, and Louis-Philippe Morency for sharing the OCTAB interface. This work was funded by National Science Foundation grants IIS:1352950 and IIS:1718944.
E. Bevacqua, M. Mancini, R. Niewiadomski, and C. Pelachaud.
An expressive ECA showing complex emotions.
Proceedings of the Artificial Intelligence and Simulation of Behaviour (AISB 2007) Annual Convention, pages 208–216, Newcastle, UK, April 2007.
-  P. Boersma and D. Weenink. Praat, a system for doing phonetics by computer. Technical Report 132, Institute of Phonetic Sciences of the University of Amsterdam, Amsterdam, Netherlands, 1996. http://www.praat.org.
-  E. Bozkurt, S. Asta, S. Ozkul, Y. Yemez, and E. Erzin. Multimodal analysis of speech prosody and upper body gestures using hidden semi-Markov models. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), pages 3652–3656, Vancouver, BC, Canada, May 2013.
-  M. Brand. Voice puppetry. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques (SIGGRAPH 1999), pages 21–28, New York, NY, USA, 1999.
-  C. Busso, Z. Deng, M. Grimm, U. Neumann, and S. Narayanan. Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing, 15(3):1075–1086, March 2007.
-  C. Busso and S. Narayanan. Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Transactions on Audio, Speech and Language Processing, 15(8):2331–2347, November 2007.
-  Y. Cao, W. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283–1302, October 2005.
-  J. Cassell, C. Pelachaud, N. Badler, M. Steedman, B. Achorn, T. Bechet, B. Douville, S. Prevost, and M. Stone. Animated conversation: Rule-based generation of facial expression gesture and spoken intonation for multiple conversational agents. In Computer Graphics (Proc. of ACM SIGGRAPH’94), pages 413–420, Orlando, FL,USA, 1994.
-  J. Cassell, H. Vilhjálmsson, and T. Bickmore. BEAT: the behavior expression animation toolkit. In H. Prendinger and M. Ishizuka, editors, Life-Like Characters: Tools, Affective Functions, and Applications, Cognitive Technologies, pages 163–185. Springer Berlin Heidelberg, New York, NY, USA, November 2003.
-  V. Chattaraman, W.-S. Kwon, J. E. Gilbert, and Y. Li. Virtual shopping agents: Persona effects for older users. Journal of Research in Interactive Marketing, 8(2):144–162, 2014.
-  C.-C. Chiu and S. Marsella. How to train your avatar: A data driven approach to gesture generation. In H. H. Vilhjálmsson, S. Kopp, S. Marsella, and K. Thórisson, editors, Intelligent Virtual Agents, volume 6895 of Lecture Notes in Computer Science, pages 127–140. Springer Berlin Heidelberg, Reykjavik, Iceland, September 2011.
-  C.-C. Chiu, L. Morency, and S. Marsella. Predicting co-verbal gestures: a deep and temporal modeling approach. In W. Brinkman, J. Broekens, and D. Heylen, editors, International Conference on Intelligent Virtual Agents (IVA 2015), volume 9238 of Lecture Notes in Computer Science, pages 152–166. Springer, Cham, Delft, The Netherlands, August 2015.
-  M. E. Foster. Comparing rule-based and data-driven selection of facial displays. In Workshop on Embodied Language Processing, Association for Computational Linguistics, pages 1–8, Prague, Czech Republic, June 2007.
-  H. P. Graf, E. Cosatto, V. Strom, and F. J. Huang. Visual prosody: Facial movements accompanying speech. In Proc. of IEEE International Conference on Automatic Faces and Gesture Recognition, pages 396–401, Washington, D.C., USA, May 2002.
-  K. Haag and H. Shimodaira. Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In D. Traum, W. Swartout, P. Khooshabeh, S. Kopp, S. Scherer, and A. Leuski, editors, International Conference on Intelligent Virtual Agents (IVA 2016), volume 10011 of Lecture Notes in Computer Science, pages 198–207. Springer Berlin Heidelberg, Los Angeles, CA, USA, September 2016.
-  M. Kipp. Gesture Generation by Imitation: From Human Behavior to Computer Character Animation. PhD thesis, Universität des Saarlandes, Saarbrücken, Germany, December 2003.
-  S. Kopp, B. Krenn, S. Marsella, A. N. Marshall, C. Pelachaud, H. Pirker, K. R. Thórisson, and H. Vilhjálmsson. Towards a common framework for multimodal generation: The behavior markup language. In International Conference on Intelligent Virtual Agents (IVA 2006), pages 205–217, Marina Del Rey, CA, USA, August 2006.
-  S. Kopp and I. Wachsmuth. Synthesizing multimodal utterances for conversational agents. Computer animation & virtual worlds, 15(1):39–52, March 2004.
-  X. Lan, X. Li, Y. Ning, Z. Wu, H. Meng, J. Jia, and L. Cai. Low level descriptors based DBLSTM bottleneck feature for speech driven talking avatar. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), pages 5550–5554, Shanghai, China, March 2016.
-  B. H. Le, X. Ma, and Z. Deng. Live speech driven head-and-eye motion generators. IEEE Transactions on Visualization and Computer Graphics, 18(11):1902–1914, November 2012.
-  S. Levine, P. Krähenbühl, S. Thrun, and V. Koltun. Gesture controllers. ACM Transactions on Graphics, 29(4):124:1–124:11, July 2010.
-  Y. Linde, A. Buzo, and R. Gray. An algorithm for vector quantizer design. IEEE Transactions on Communications, 28(1):84–95, Jan 1980.
-  S. Mariooryad and C. Busso. Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech and Language Processing, 20(8):2329–2340, October 2012.
-  S. Marsella, Y. Xu, M. Lhommet, A. Feng, S. Scherer, and A. Shapiro. Virtual character performance from speech. In ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA 2013), pages 25–35, Anaheim, CA, USA, July 2013.
-  D. McNeill. Hand and Mind: What gestures reveal about thought. The University of Chicago Press, Chicago, IL, USA, 1992.
-  I. Pandzic and R. Forchheimer. MPEG-4 Facial Animation - The standard, implementations and applications. John Wiley & Sons, November 2002.
-  S. Park, G. Mohammadi, R. Artstein, and L. P. Morency. Crowdsourcing micro-level multimedia annotations: The challenges of evaluation and interface. In ACM Multimedia 2012 workshop on Crowdsourcing for multimedia (CrowdMM), pages 29–34, Nara, Japan, October 2012.
-  I. Poggi, C. Pelachaud, F. de Rosis, V. Carofiglio, and B. de Carolis. Greta. a believable embodied conversational agent. In O. Stock and M. Zancanaro, editors, Multimodal Intelligent Information Presentation, Text, Speech and Language Technology, pages 3–25. Springer Netherlands, Dordrecht, The Netherlands, February 2005.
-  N. Sadoughi and C. Busso. Retrieving target gestures toward speech driven animation with meaningful behaviors. In International conference on Multimodal interaction (ICMI 2015), pages 115–122, Seattle, WA, USA, November 2015.
-  N. Sadoughi and C. Busso. Head motion generation with synthetic speech: a data driven approach. In Interspeech 2016, pages 52–56, San Francisco, CA, USA, September 2016.
-  N. Sadoughi and C. Busso. Head motion generation. In B. Müller, S. Wolf, G.-P. Brueggemann, Z. Deng, A. McIntosh, F. Miller, and W. Scott Selbie, editors, Handbook of Human Motion, pages 1–25. Springer International Publishing, January 2017.
N. Sadoughi and C. Busso.
Joint learning of speech-driven facial motion with bidirectional long-short term memory.In J. Beskow, C. Peters, G. Castellano, C. O’Sullivan, I. Leite, S. Kopp, M. Mancini, and G. Varni, editors, International Conference on Intelligent Virtual Agents (IVA 2017), Lecture Notes in Computer Science. Springer Berlin Heidelberg, Stockholm, Sweden, August 2017.
-  N. Sadoughi, Y. Liu, and C. Busso. Speech-driven animation constrained by appropriate discourse functions. In International conference on multimodal interaction (ICMI 2014), pages 148–155, Istanbul, Turkey, November 2014.
-  N. Sadoughi, Y. Liu, and C. Busso. MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents. In 1st International Workshop on Understanding Human Activities through 3D Sensors (UHA3DS 2015), pages 1–6, Ljubljana, Slovenia, May 2015.
-  N. Sadoughi, Y. Liu, and C. Busso. Meaningful head movements driven by emotional synthetic speech. Speech Communication, to appear, 2017.
-  M. Stone, D. DeCarlo, I. Oh, C. Rodriguez, A. Stere, A. Lees, and C. Bregler. Speaking with hands: Creating animated conversational characters from recordings of human performance. ACM Transactions on Graphics (TOG), 23(3):506–513, August 2004.
-  M. Thiebaux, S. Marsella, A. N. Marshall, and M. Kallmann. Smartbody: Behavior realization for embodied conversational agents. In Proceedings of the 7th international joint conference on Autonomous agents and multiagent systems-Volume 1, volume 1, pages 151–158, Estoril, Portugal, May 2008.
-  R. Voigt, R. J. Podesva, and D. Jurafsky. Speaker movement correlates with prosodic indicators of engagement. In Speech Prosody (SP 2014), pages 70–74, Dublin, Republic of Ireland, May 2014.
-  H. Welbergen, A. Nijholt, D. Reidsma, and J. Zwiers. Presenting in virtual worlds: Towards an architecture for a 3D presenter explaining 2D-presented information. In M. Maybury, O. Stock, and W. Wahlster, editors, Intelligent Technologies for Interactive Entertainment (INTETAIN 2005), volume 3814 of Lecture Notes in Computer Science, pages 203–212. Springer Berlin Heidelberg, Madonna di Campiglio, Italy, November-December 2005.
D. Zhou, Q. Liu, J. Platt, and C. Meek.
Aggregating ordinal labels from crowds by minimax conditional
International Conference on Machine Learning (ICML 2014), pages 262–270, Beijing, China, June 2014.
F. Zhou, F. De la Torre, and J. K. Hodgins.
Aligned cluster analysis for temporal segmentation of human motion.In IEEE International Conference on Automatic Face and Gesture Recognition (FG 2008), Amsterdam, The Netherlands, September 2008.