Prediction of head motion from speech waveforms with a canonical-correlation-constrained autoencoder

02/05/2020 ∙ by JinHong Lu, et al. ∙ 0

This study investigates the direct use of speech waveforms to predict head motion for speech-driven head-motion synthesis, whereas the use of spectral features such as MFCC as basic input features together with additional features such as energy and F0 is common in the literature. We claim that, rather than combining different features that originate from waveforms, it is more effective to use waveforms directly predicting corresponding head motion. The challenge with the waveform-based approach is that waveforms contain a large amount of information irrelevant to predict head motion, which hinders the training of neural networks. To overcome the problem, we propose a canonical-correlation-constrained autoencoder (CCCAE), where hidden layers are trained to not only minimise the error but also maximise the canonical correlation with head motion. Compared with an MFCC-based system, the proposed system shows comparable performance in objective evaluation, and better performance in subject evaluation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Head motion such as nodding and shaking is an important nonverbal communication channel in human-human communication. In addition to the head motion as nonverbal signals, Hadar et al. [1] have shown another type of head motion that is directly related to speech production. It is essential for animated talking heads to realise both types of natural head motion as well as lip-sync to make the avatar more human-like. Compared with lip-sync, the synthesis of head motion from audio speech is more challenging, since the link between speech and head motion is less clear, and not only speech, but also various factors such as emotion, intention, and stance are involved.

The present study considers a link between head motion and acoustic speech features, whose original representation is given as acoustic waveform signals, and seeks compact and efficient representation of speech features to predict corresponding head motion. Kuratate et al. [2] found that fundamental frequency (F0) and head motion had a correlation of 0.83 at sentence-level. Busso et al. [3] also confirmed a strong sentence-level correlation (r=0.8) between head motion and mel-frequency cepstral coefficients (MFCCs), where data was recorded for an actor reading the scripts of short sentences. As we show in experiments, it is a different scenario in natural conversations, where there is a much larger degree of variation in head motion and we cannot find such strong correlations. A similar observation is reported for a dialogue corpus by Sadoughi et al. [4], where they have found a global CCA of 0.1931 between the original head movements and speech (F0 and energy).

In order to tackle the problem of a weak link between speech and head motion, other features and their combination have been explored. Ben-Youssef et al. [5]

found that the articulatory features or EMA features that were estimated from speech were more useful to predict head motion. Ding

et al. [6] examined LPC, MFCC, and filter bank (FBank) features and showed that FBank-based system outperformed MFCC-based one. Haag et al. [7] combined MFCC and EMA features to build bottleneck features, which were then fed to DNN-BLSTM to predict head motion.

The purpose of using a combination of different features in the previous studies was to use richer information (e.g., prosodic features) to train models and predict head motion. Since all the acoustic features described above are derived from raw speech waveforms, it is natural to consider the original waveforms as the input to neural networks, so that we will be able to fully make use of the information in the original observations. So far, no one has investigated the use of original raw waveforms to predict head motion. This is mainly because of (1) the high dimensionality of raw waveform signals, which slows down the training of neural networks and requires high capacity in the hardware support; (2) a large amount of irrelevant information to predict head motion, which hinders the training of neural networks.

To overcome the problems of high dimensionality and irrelevant information, we propose a canonical-correlation-constrained autoencoder (CCCAE) to extract low-dimensional features from raw waveforms, where hidden layers are trained not only to minimise the error of encoding and decoding, but also maximise the canonical correlation with head motion. The extracted features of a low dimension are then fed to another neural network for regression to predict head motion. We show that the features obtained with the proposed approach are more useful for head-motion prediction than those with a standard autoencoder. We evaluate the new approach through comparisons with other acoustic features in terms of objective and subjective measures.

Figure 1: Overview of the proposed system comprised of three modules: (A) waveform embedding with CCCAE, (B) DNN-based head motion regression from the embedded features, (C) post filter with an autoencoder.

2 Relation to prior work

While using raw waveforms for acoustic modelling with neural networks is one of the active areas in automatic speech recognition

[8, 9, 10, 11], to the best of our knowledge, no one has investigated the use of raw waveforms for speech-driven head-motion synthesis, in which a set of two data streams, speech and head motion, is dealt with rather than a single stream of speech. Chandar et al. [12] and Wang et al. [13] have proposed the framework of correlational neural networks and deep canonically correlated autoencoder (DCCAE), respectively, to effectively model two data streams, and they applied the models to cross-language tasks and multi-view feature learning, where you can expect reasonably high correlations between two data streams. The present study is different in that the correlation between speech and head-motion features are much weaker, and our proposed model employs only one autoencoder whereas they employ two.

3 Proposed System

Our proposed system can be separated into three modules; (1) a canonical-correlation-constrained autoencoder (CCCAE) for compressing the high-dimensional waveform input to distributed embedding of low dimensions; (2) a regression model for predicting the head motion from the compressed embedding; (3) a post-filtering autoencoder for reconstructing smooth head motion. The overall framework of our proposed model is shown in Figure 1.

3.1 Waveform Embedding

The framework of autoencoder for a set of two data streams is proposed by Chandar et al. [12] and Wang et al. [13]. DCCAE [13]

consists of two autoencoders and optimises the combination of canonical correlation between the learned ”bottleneck” representations and the reconstruction errors of the autoencoders. Since head motion is parameterised with a time series of rotation vectors of three dimensions in the present study, we do not need to use an autoencoder to reduce the dimensionality further. We thus employ a single autoencoder, in which hidden layers are trained in such a way as to not only minimise the reconstruction error, but also maximise the canonical correlation with head motion. Thus, instead of projecting the two features to a common subspace, we project raw waveforms to a subspace so that the embedded features are well correlated with head motion.

We train the proposed CCCAE with the following objective function,


where represents the input raw waveform vector at a time instance to the encoder, represents the projection with the encoder, represents the reconstruction with the decoder, and denote the whole sequences of waveform vectors and head motion vectors, respectively, and is the canonical correlation function. is the weighting factor, where

corresponds to a standard autoencoder with an MSE loss function.

3.2 Head motion regression

A simple feed-forward deep neural network is applied here for the regression from the waveform embedded features to head motion. We do not consider more complex models such as CNN and LSTM, because the present study focuses on a compact and efficient representation of speech features rather than the regression of head motion; and previous studies [6, 7] showed no large differences among the models. Accordingly, we also do not consider auto-regressive models such as WaveNet [14].

As is shown in Figure 1 (B), a context window of frames, which is equivalent to 525ms effective speech content, is employed to predict head motion parameters.

3.3 Post-filter

Since the output trajectories of our neural networks are noisy or discontinuous due to the nature of speech, we applied a post-filter as post-processing to obtain smooth head motion trajectories for animation. We built a neural network based de-noising autoencoder following the architecture, trained with the ”clean” data [15].

4 Experiments and Results

4.1 Dataset

We used the University of Edinburgh Speaker Personality and Mocap Dataset [16]. This dataset contains expressive dialogues of 13 native English speaker semi-professional actors in extroverted and introverted speaking styles and the dialogues are non-scripted and spontaneous. For the purpose of our experiments, we selected data from one male (Subject A) and one female (Subject B). Six recordings (around 30 minutes) of each subject were used for training, two (around 10 minutes) for validation, and the remaining two (around 10 minutes) for evaluation, ignoring the differences in terms of the speaking style. We trained our models for each subject. Note that speaker-dependent training is a common practice in speech-driven head motion synthesis [17, 6, 18].

Speech Features Audio in the database was recorded with a headset microphone at 44.1 kHz with 32-bit depth and a MOTU 8pre mixer [19]

. Separate recording channels were used for the two speakers and a synchronisation signal was recorded on a third channel in the mixer. For the purpose of this work, the audio signal was downsampled to 4 kHz prior to feature extraction. Raw waveform vectors were extracted using 25 ms windows with 10 ms shifting, which resulted in 100 dimensions. 13 MFCCs feature is formed by combining 1 energy coefficient and 12 Mel-cepstral coefficients, using SPTK 


. We also added their first and second-order derivatives, resulting in 39-MFCCs. Voicing probability and energy were computed using openSMILE 


, and smoothed with a moving average filter with a window length of 10 frames. All the features were normalised in terms of variance for each dimension.

Head Motion Features Movements of the head as a 3D rigid-body were recorded with the NaturalPoint Optitrack [22]

motion capture system at a 100 Hz sampling rate. From the marker coordinates, rotation matrices for the head motion were computed using singular value decomposition 

[23], which were further converted to rotation vectors of three dimensions.

4.2 Experimental Setups

We conducted preliminary experiments to decide the depth and width of the models, which are shown in Figure 1. We tested different numbers of nodes, 15, 30, and 60, for the embedding layer of CCCAE, and decided to use 30 nodes based on the performance of the autoencoder. In training, we only used the frames where the target speaker for head-motion prediction was speaking, so that the models learnt the relationship between speech and head motion properly. In evaluation, we made use of all the input audio sequences to generate head motion parameters.

The following notations are used in the rest experiments.

  •  : Embedded features extracted from the standard autoencoder (i.e., the output of proposed CCCAE with )

  •  : Embedded features extracted from the proposed CCCAE with

  •  : Regression model trained with MFCC feature

  •  : Regression model trained with

  •  : Regression model trained with

 ,  , and  use the same architecture in Figure 1(B) to predict head motion, while each model takes different feature vectors as input.

Training was conducted on a GPU machine and a multi-CPU machine with Tensorflow version 1.12 by mini-batch training using Adam optimisation (learning rate 0.0002) 

[24]. We also employed layer-wide pre-training [25].

In the evaluation, test data of the same speaker is fed to the trained regression model and head motion is predicted frame by frame. After that, the output of the prediction model is then joined to form distinct head motion of 50 time frames, which are fed to the post-filtering autoencoder. The final output for animation was generated with the overlap-add method.

Feature Subject CCA
Training Valid Test
F0+Energy A
FBank A
waveform A
Table 1: Local CCA between speech features and original head motion.
System Subject Training Test
Table 2: Comparison of different systems in terms of performance of head motion prediction, where NMSE and local CCA are calculated between predicted head motion and ground truth.

4.3 Objective Evaluation

To measure the similarity between two sequences of vectors, we employed normalised mean-squared error (NMSE), where MSE is normalised by the variance of ground truth, and local canonical correlation analysis (local CCA) [7]. As opposed to global CCA, which calculates canonical correlations over the whole sequence, local CCA calculates CCA scores for every sub-sequence obtained with a time window and takes the average of the resulting scores. We used local CCA rather than global CCA, because head motion trajectories are not stationary and linear correlations rarely hold over long periods. We used a time window of 300 frames or 3 seconds.

In addition to the speech features described before, for comparison purposes, we also used F0+Energy (6 dimensions with delta and delta delta features) [4], FBank (27 dimensions of 26 filter-bank channels and log energy), and waveform (100 dimensions), which is the input to the proposed CCCAE.

4.3.1 CCA between speech features and original head motion

Before evaluating the performance of head-motion prediction from speech, we carried out a basic analysis to find correlations between speech features and head motion in terms of local CCA. Table 1 shows local CCA for each speech feature and for each subject. Note that CCA scores on training and validation sets are not shown for those features in which training is not involved. It can be found that F0+Energy gives the smallest, and MFCC and  achieve the largest CCA scores with head motion. Compared to waveform, we can see a large improvement on the test set (by 33% for Subject A and 69% for Subject B) with  , whereas there is a small improvement with  .

4.3.2 Evaluation of predicted head motion from speech

Based on the result of the basic analysis, we chose three features, MFCC,  , and  for the evaluation of head-motion prediction. Table 2 shows the comparison of different systems, where the quality of predicted head motion was evaluated in terms of NMSE and local CCA with the ground truth (original head motion). We also computed local CCA between the ground truth and randomised sequences of another subject different from Subjects A and B to estimate a chance score for the two original head-motion sequences that are totally different and unsynchronised from each other and supposed to have no correlations. The estimated chance score for Subject A and Subject B respectively is and .

Although  shows the lowest NMSE on the test set, it is just because the predicted head motion had little movement, which resulted in NMSE being close to 1.0. This is also reflected in the local CCA that,  is worse than the chance score for both subjects.  gets performance comparable to  in terms of NMSE.  gets the highest local CCA. Overall, the quality of  and  in the test dataset is higher than the chance score.

CCA captures only one aspect of similarity, i.e., linear correlations between two data streams, and it does not tell us how similar the two streams are in terms of other aspects such as dynamic range and smoothness, which we believe are also crucial factors in human perception. We thus calculated the standard deviation (SD) of each head motion trajectory and its derivative, i.e., velocity, whose result is shown in Figure 

2.  has the smallest SD in all trajectories and velocities, which confirms that  has very little movement as mentioned above. The ground truth has the largest SD over trajectories.  and  show comparable performance, but yet not close enough to the ground truth. Compared in terms of velocity, the ground truth,  , and  are similar to each other. This indicates that they are likely to have the same level of smoothness.

Figure 2: Comparison of predicted head motion trajectories with different models in terms of standard deviation of each rotation parameter () and its derivative, where values are averaged over the two subjects.

4.4 Subjective Evaluation

We conducted a perceptual test using the MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA)[26]. MUSHRA requires fewer participants and provides a better resolution in scores than the mean opinion score (MOS) test. We developed five test groups from Subject A, where each test group consisted of 3 randomly selected audio samples in the test set, and animations were created from each sample using 5 models: Ground Truth,  ,  ,  , and Anchor. Each animation lasts seconds long. The purpose of the anchor is to take the scale closer to an absolute scale, making sure that minor artefacts are not rated as having very bad quality. The anchor is created by selecting the original head motion of another speaker with different utterances. This ensures that the anchor head motion has a natural behaviour, but it does not synchronise with the audio. The evaluation is performed such that every participant is assigned one test group and the animations of each test group are shown in a random order. Then, each participant watches each head-motion animation and gives a score, between , for each animation. A group of 20 participants were involved in this evaluation and they were asked to give a score to each animation according to the naturalness of the synthesised head motion.

The result is shown in Figure 3.  scored the lowest among all including the anchor. We think the reason could be that as the predicted head motion with  conducted a relatively minor movement, which may seem contrary to regular human beings’ behaviour, from the participants’ perspective. The anchor scored the second lowest as expected, participants were able to figure out the non-synchronicity between the head motion and audio. Compared between  and  models, participants scored higher for  .

Figure 3: The Boxplot of the MUSHRA score for the Subject A’s animation of each model - horizontal line indicates the median.

5 Conclusions

In this paper, we have proposed an approach to create a highly correlated feature with head motion from raw waveform data using CCCAE. From the objective evaluations, we can conclude that (1) CCCAE enables creation of a more correlated feature ( ) with the head motion than  and other popular spectral features such as MFCC and FBank. (2) the  achieved the lowest NMSE in test dataset, although the local CCA is not the highest. (3) the analysis based on SD shows that  and  are comparable performance. (2) and (3) indicate that  is capable of being used in achieving state-of-the-art results for predicting natural head motion with the advantage of the CCCAE. (4) MUSHRA test shows that excluding the ground truth, participants preferred to choose the animation generated by  over the others. Overall, our  shows better performance than  . In the future, since  is highly correlated with head motion, we would like to develop more advanced architecture in the regression model.