Head motion such as nodding and shaking is an important nonverbal communication channel in human-human communication. In addition to the head motion as nonverbal signals, Hadar et al.  have shown another type of head motion that is directly related to speech production. It is essential for animated talking heads to realise both types of natural head motion as well as lip-sync to make the avatar more human-like. Compared with lip-sync, the synthesis of head motion from audio speech is more challenging, since the link between speech and head motion is less clear, and not only speech, but also various factors such as emotion, intention, and stance are involved.
The present study considers a link between head motion and acoustic speech features, whose original representation is given as acoustic waveform signals, and seeks compact and efficient representation of speech features to predict corresponding head motion. Kuratate et al.  found that fundamental frequency (F0) and head motion had a correlation of 0.83 at sentence-level. Busso et al.  also confirmed a strong sentence-level correlation (r=0.8) between head motion and mel-frequency cepstral coefficients (MFCCs), where data was recorded for an actor reading the scripts of short sentences. As we show in experiments, it is a different scenario in natural conversations, where there is a much larger degree of variation in head motion and we cannot find such strong correlations. A similar observation is reported for a dialogue corpus by Sadoughi et al. , where they have found a global CCA of 0.1931 between the original head movements and speech (F0 and energy).
In order to tackle the problem of a weak link between speech and head motion, other features and their combination have been explored. Ben-Youssef et al. 
found that the articulatory features or EMA features that were estimated from speech were more useful to predict head motion. Dinget al.  examined LPC, MFCC, and filter bank (FBank) features and showed that FBank-based system outperformed MFCC-based one. Haag et al.  combined MFCC and EMA features to build bottleneck features, which were then fed to DNN-BLSTM to predict head motion.
The purpose of using a combination of different features in the previous studies was to use richer information (e.g., prosodic features) to train models and predict head motion. Since all the acoustic features described above are derived from raw speech waveforms, it is natural to consider the original waveforms as the input to neural networks, so that we will be able to fully make use of the information in the original observations. So far, no one has investigated the use of original raw waveforms to predict head motion. This is mainly because of (1) the high dimensionality of raw waveform signals, which slows down the training of neural networks and requires high capacity in the hardware support; (2) a large amount of irrelevant information to predict head motion, which hinders the training of neural networks.
To overcome the problems of high dimensionality and irrelevant information, we propose a canonical-correlation-constrained autoencoder (CCCAE) to extract low-dimensional features from raw waveforms, where hidden layers are trained not only to minimise the error of encoding and decoding, but also maximise the canonical correlation with head motion. The extracted features of a low dimension are then fed to another neural network for regression to predict head motion. We show that the features obtained with the proposed approach are more useful for head-motion prediction than those with a standard autoencoder. We evaluate the new approach through comparisons with other acoustic features in terms of objective and subjective measures.
2 Relation to prior work
While using raw waveforms for acoustic modelling with neural networks is one of the active areas in automatic speech recognition[8, 9, 10, 11], to the best of our knowledge, no one has investigated the use of raw waveforms for speech-driven head-motion synthesis, in which a set of two data streams, speech and head motion, is dealt with rather than a single stream of speech. Chandar et al.  and Wang et al.  have proposed the framework of correlational neural networks and deep canonically correlated autoencoder (DCCAE), respectively, to effectively model two data streams, and they applied the models to cross-language tasks and multi-view feature learning, where you can expect reasonably high correlations between two data streams. The present study is different in that the correlation between speech and head-motion features are much weaker, and our proposed model employs only one autoencoder whereas they employ two.
3 Proposed System
Our proposed system can be separated into three modules; (1) a canonical-correlation-constrained autoencoder (CCCAE) for compressing the high-dimensional waveform input to distributed embedding of low dimensions; (2) a regression model for predicting the head motion from the compressed embedding; (3) a post-filtering autoencoder for reconstructing smooth head motion. The overall framework of our proposed model is shown in Figure 1.
3.1 Waveform Embedding
consists of two autoencoders and optimises the combination of canonical correlation between the learned ”bottleneck” representations and the reconstruction errors of the autoencoders. Since head motion is parameterised with a time series of rotation vectors of three dimensions in the present study, we do not need to use an autoencoder to reduce the dimensionality further. We thus employ a single autoencoder, in which hidden layers are trained in such a way as to not only minimise the reconstruction error, but also maximise the canonical correlation with head motion. Thus, instead of projecting the two features to a common subspace, we project raw waveforms to a subspace so that the embedded features are well correlated with head motion.
We train the proposed CCCAE with the following objective function,
where represents the input raw waveform vector at a time instance to the encoder, represents the projection with the encoder, represents the reconstruction with the decoder, and denote the whole sequences of waveform vectors and head motion vectors, respectively, and is the canonical correlation function. is the weighting factor, where
corresponds to a standard autoencoder with an MSE loss function.
3.2 Head motion regression
A simple feed-forward deep neural network is applied here for the regression from the waveform embedded features to head motion. We do not consider more complex models such as CNN and LSTM, because the present study focuses on a compact and efficient representation of speech features rather than the regression of head motion; and previous studies [6, 7] showed no large differences among the models. Accordingly, we also do not consider auto-regressive models such as WaveNet .
As is shown in Figure 1 (B), a context window of frames, which is equivalent to 525ms effective speech content, is employed to predict head motion parameters.
Since the output trajectories of our neural networks are noisy or discontinuous due to the nature of speech, we applied a post-filter as post-processing to obtain smooth head motion trajectories for animation. We built a neural network based de-noising autoencoder following the architecture, trained with the ”clean” data .
4 Experiments and Results
We used the University of Edinburgh Speaker Personality and Mocap Dataset . This dataset contains expressive dialogues of 13 native English speaker semi-professional actors in extroverted and introverted speaking styles and the dialogues are non-scripted and spontaneous. For the purpose of our experiments, we selected data from one male (Subject A) and one female (Subject B). Six recordings (around 30 minutes) of each subject were used for training, two (around 10 minutes) for validation, and the remaining two (around 10 minutes) for evaluation, ignoring the differences in terms of the speaking style. We trained our models for each subject. Note that speaker-dependent training is a common practice in speech-driven head motion synthesis [17, 6, 18].
Speech Features Audio in the database was recorded with a headset microphone at 44.1 kHz with 32-bit depth and a MOTU 8pre mixer 
. Separate recording channels were used for the two speakers and a synchronisation signal was recorded on a third channel in the mixer. For the purpose of this work, the audio signal was downsampled to 4 kHz prior to feature extraction. Raw waveform vectors were extracted using 25 ms windows with 10 ms shifting, which resulted in 100 dimensions. 13 MFCCs feature is formed by combining 1 energy coefficient and 12 Mel-cepstral coefficients, using SPTK
. We also added their first and second-order derivatives, resulting in 39-MFCCs. Voicing probability and energy were computed using openSMILE
, and smoothed with a moving average filter with a window length of 10 frames. All the features were normalised in terms of variance for each dimension.
Head Motion Features Movements of the head as a 3D rigid-body were recorded with the NaturalPoint Optitrack 
motion capture system at a 100 Hz sampling rate. From the marker coordinates, rotation matrices for the head motion were computed using singular value decomposition, which were further converted to rotation vectors of three dimensions.
4.2 Experimental Setups
We conducted preliminary experiments to decide the depth and width of the models, which are shown in Figure 1. We tested different numbers of nodes, 15, 30, and 60, for the embedding layer of CCCAE, and decided to use 30 nodes based on the performance of the autoencoder. In training, we only used the frames where the target speaker for head-motion prediction was speaking, so that the models learnt the relationship between speech and head motion properly. In evaluation, we made use of all the input audio sequences to generate head motion parameters.
The following notations are used in the rest experiments.
: Embedded features extracted from the standard autoencoder (i.e., the output of proposed CCCAE with )
: Embedded features extracted from the proposed CCCAE with
: Regression model trained with MFCC feature
: Regression model trained with
: Regression model trained with
, , and use the same architecture in Figure 1(B) to predict head motion, while each model takes different feature vectors as input.
Training was conducted on a GPU machine and a multi-CPU machine with Tensorflow version 1.12 by mini-batch training using Adam optimisation (learning rate 0.0002). We also employed layer-wide pre-training .
In the evaluation, test data of the same speaker is fed to the trained regression model and head motion is predicted frame by frame. After that, the output of the prediction model is then joined to form distinct head motion of 50 time frames, which are fed to the post-filtering autoencoder. The final output for animation was generated with the overlap-add method.
4.3 Objective Evaluation
To measure the similarity between two sequences of vectors, we employed normalised mean-squared error (NMSE), where MSE is normalised by the variance of ground truth, and local canonical correlation analysis (local CCA) . As opposed to global CCA, which calculates canonical correlations over the whole sequence, local CCA calculates CCA scores for every sub-sequence obtained with a time window and takes the average of the resulting scores. We used local CCA rather than global CCA, because head motion trajectories are not stationary and linear correlations rarely hold over long periods. We used a time window of 300 frames or 3 seconds.
In addition to the speech features described before, for comparison purposes, we also used F0+Energy (6 dimensions with delta and delta delta features) , FBank (27 dimensions of 26 filter-bank channels and log energy), and waveform (100 dimensions), which is the input to the proposed CCCAE.
4.3.1 CCA between speech features and original head motion
Before evaluating the performance of head-motion prediction from speech, we carried out a basic analysis to find correlations between speech features and head motion in terms of local CCA. Table 1 shows local CCA for each speech feature and for each subject. Note that CCA scores on training and validation sets are not shown for those features in which training is not involved. It can be found that F0+Energy gives the smallest, and MFCC and achieve the largest CCA scores with head motion. Compared to waveform, we can see a large improvement on the test set (by 33% for Subject A and 69% for Subject B) with , whereas there is a small improvement with .
4.3.2 Evaluation of predicted head motion from speech
Based on the result of the basic analysis, we chose three features, MFCC, , and for the evaluation of head-motion prediction. Table 2 shows the comparison of different systems, where the quality of predicted head motion was evaluated in terms of NMSE and local CCA with the ground truth (original head motion). We also computed local CCA between the ground truth and randomised sequences of another subject different from Subjects A and B to estimate a chance score for the two original head-motion sequences that are totally different and unsynchronised from each other and supposed to have no correlations. The estimated chance score for Subject A and Subject B respectively is and .
Although shows the lowest NMSE on the test set, it is just because the predicted head motion had little movement, which resulted in NMSE being close to 1.0. This is also reflected in the local CCA that, is worse than the chance score for both subjects. gets performance comparable to in terms of NMSE. gets the highest local CCA. Overall, the quality of and in the test dataset is higher than the chance score.
CCA captures only one aspect of similarity, i.e., linear correlations between two data streams, and it does not tell us how similar the two streams are in terms of other aspects such as dynamic range and smoothness, which we believe are also crucial factors in human perception. We thus calculated the standard deviation (SD) of each head motion trajectory and its derivative, i.e., velocity, whose result is shown in Figure2. has the smallest SD in all trajectories and velocities, which confirms that has very little movement as mentioned above. The ground truth has the largest SD over trajectories. and show comparable performance, but yet not close enough to the ground truth. Compared in terms of velocity, the ground truth, , and are similar to each other. This indicates that they are likely to have the same level of smoothness.
4.4 Subjective Evaluation
We conducted a perceptual test using the MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA). MUSHRA requires fewer participants and provides a better resolution in scores than the mean opinion score (MOS) test. We developed five test groups from Subject A, where each test group consisted of 3 randomly selected audio samples in the test set, and animations were created from each sample using 5 models: Ground Truth, , , , and Anchor. Each animation lasts seconds long. The purpose of the anchor is to take the scale closer to an absolute scale, making sure that minor artefacts are not rated as having very bad quality. The anchor is created by selecting the original head motion of another speaker with different utterances. This ensures that the anchor head motion has a natural behaviour, but it does not synchronise with the audio. The evaluation is performed such that every participant is assigned one test group and the animations of each test group are shown in a random order. Then, each participant watches each head-motion animation and gives a score, between , for each animation. A group of 20 participants were involved in this evaluation and they were asked to give a score to each animation according to the naturalness of the synthesised head motion.
The result is shown in Figure 3. scored the lowest among all including the anchor. We think the reason could be that as the predicted head motion with conducted a relatively minor movement, which may seem contrary to regular human beings’ behaviour, from the participants’ perspective. The anchor scored the second lowest as expected, participants were able to figure out the non-synchronicity between the head motion and audio. Compared between and models, participants scored higher for .
In this paper, we have proposed an approach to create a highly correlated feature with head motion from raw waveform data using CCCAE. From the objective evaluations, we can conclude that (1) CCCAE enables creation of a more correlated feature ( ) with the head motion than and other popular spectral features such as MFCC and FBank. (2) the achieved the lowest NMSE in test dataset, although the local CCA is not the highest. (3) the analysis based on SD shows that and are comparable performance. (2) and (3) indicate that is capable of being used in achieving state-of-the-art results for predicting natural head motion with the advantage of the CCCAE. (4) MUSHRA test shows that excluding the ground truth, participants preferred to choose the animation generated by over the others. Overall, our shows better performance than . In the future, since is highly correlated with head motion, we would like to develop more advanced architecture in the regression model.
-  U. Hadar, T.J. Steiner, E.C. Grant, and F.C. Rose, “Head movement correlates of juncture and stress at sentence level,” Language and Speech, vol. 26, no. 2, pp. 117–129, 1983.
-  Takaaki Kuratate, Kevin G. Munhall, Philip E. Rubin, Eric Vatikiotis-Bateson, and Hani Yehia, “Audio-visual synthesis of talking faces from speech production correlates,” in Eurospeech’99, 1999, vol. 3, pp. 1279–1282.
-  Carlos Busso and Shrikanth Narayanan, “Interrelation Between Speech and Facial Gestures in Emotional Utterances: A Single Subject Study,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2331–2347, November 2007.
-  Najmeh Sadoughi and Carlos Busso, “Speech-driven Animation with Meaningful Behaviors,” CoRR, vol. abs/1708.01640, 2017.
-  Atef Ben Youssef, Hiroshi Shimodaira, and David Braude, “Speech driven talking head from estimated articulatory features,” in Proc. ICASSP, 2014, pp. 4573–4577.
-  Chuang Ding, Lei Xie, and Pengcheng Zhu, “Head motion synthesis from speech using deep neural networks,” Multimedia Tools and Applications, vol. 74, no. 22, pp. 9871–9888, November 2015.
-  Kathrin Haag and Hiroshi Shimodaira, “Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis,” in Intelligent Virtual Agents, 2016, pp. 198 – 207.
-  Tara N. Sainath, Ron J. Weiss, Andrew W. Senior, Kevin W. Wilson, and Oriol Vinyals, “Learning the speech front-end with raw waveform CLDNNs,” in INTERSPEECH, 2015.
-  Pegah Ghahremani, Vimal Manohar, Daniel Povey, and Sanjeev Khudanpur, “Acoustic Modelling from the Signal Domain Using CNNs,” in INTERSPEECH, 2016, pp. 3434–3438.
-  Zoltán Tüske, Ralf Schlüter, and Hermann Ney, “Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing,” in ICASSP, 2018, pp. 4859–4863.
-  Patrick von Platen, Chao Zhang, and Philip Woodland, “Multi-Span Acoustic Modelling using Raw Waveform Signals,” ArXiv, vol. abs/1906.11047, 2019.
-  Sarath Chandar, Mitesh M. Khapra, Hugo Larochelle, and Balaraman Ravindran, “Correlational Neural Networks,” CoRR, vol. abs/1504.07225, 2015.
-  Weiran Wang, Raman Arora, Karen Livescu, and Jeff A. Bilmes, “On Deep Multi-View Representation Learning: Objectives and Optimization,” CoRR, vol. abs/1602.01024, 2016.
-  Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” CoRR, vol. abs/1609.03499, 2016.
-  JinHong Lu and Hiroshi Shimodaira, “A neural network based post-filter for speech-driven head motion synthesis,” arXiv e-prints, p. arXiv:1907.10585, July 2019.
-  Kathrin Haag and Hiroshi Shimodaira, “The University of Edinburgh Speaker Personality and MoCap Dataset,” in Facial Analysis and Animation Proceedings, Vienna, 2015.
-  Carlos Busso, Zhigang Deng, Ulrich Neumann, and Shrikanth Narayanan, “Natural head motion synthesis driven by acoustic prosodic features: Virtual Humans and Social agents,” Computer Animation and Virtual Worlds, vol. 16, pp. 283–290, July 2005.
-  Najmeh Sadoughi and Carlos Busso, “Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks,” in ICASSP, April 2018, pp. 6169–6173.
-  “MOTU,” http://www.motu.com/.
-  “Speech Signal Processing Toolkit (SPTK),” http://sptk.sourceforge.net.
Florian Eyben, Martin Wöllmer, and Björn Schuller,
“Opensmile: The Munich Versatile and Fast Open-source Audio Feature Extractor,”in Proceedings of the 18th ACM International Conference on Multimedia, 2010, pp. 1459–1462.
-  “Naturalpoint Optitrack,” http://www.naturalpoint.com/optitrack.
-  Inge Soderkvist and PA Wedin, “Determining the movements of the skeleton using well-configured markers,” Journal of Biomechanics, vol. 26, pp. 1473–1477, January 1994.
-  Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014.
-  S. Takaki and J. Yamagishi, “A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric speech synthesis,” in ICASSP, March 2016, pp. 5535–5539.
-  “Method for the subjective assessment of intermediate quality level of coding systems,” Recommendation ITU-R BS.1534 https://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1534-1-200301-S!!PDF-E.pdf.