End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech

10/02/2017 ∙ by Hai X. Pham, et al. ∙ 1

We present a deep learning framework for real-time speech-driven 3D facial animation from just raw waveforms. Our deep neural network directly maps an input sequence of speech audio to a series of micro facial action unit activations and head rotations to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or in a surprised state, both eyebrows raise higher. Experiments on a diverse audiovisual corpus of different actors across a wide range of emotional states show interesting and promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Face synthesis is essential to many applications, such as computer games, animated movies, teleconferencing, talking agents, etc. Traditional facial capture approaches have gained tremendous successes, reconstructing high level of realism. Yet, active face capture rigs utilizing motion sensors/markers are expensive and time-consuming to use. Alternatively, passive techniques capturing facial transformations from cameras, although less accurate, have achieved very impressive performance.

There lies one problem with vision-based facial capture approaches, however, where part of the face is occluded, e.g. when a person is wearing a mixed reality visor, or in the extreme situation where the entire visual appearance is non-existent. In such cases, other input modalities, such as audio, may be exploited to infer facial actions. Indeed, research on speech-driven face synthesis has regained attention of the community in recent time. Latest works  [16, 21, 30, 31] employ deep neural networks in order to model the highly non-linear mapping from input speech domain, either as audio or phonemes, to visual facial features. Particularly, in approaches by Karras et al. [16] and Pham et al. [21], the reconstruction of facial emotion is also taken into account to generate fully transform 3D facial shapes. The method in [16] explicitly specifies the emotional state as an additional input beside waveforms, whereas [21] implicitly infers affective states from acoustic features, and represents emotions via blendshape weights [6].

In this work, we further improve the approach of [21]

in several ways, in order to recreate a better 3D talking avatar that can naturally rotate and perform micro facial actions to represent the time-varying contextual information and emotional intensity from speech in real-time. Firstly, we forgo using handcrafted, high-level acoustic features such as chromagram or mel-frequency cepstral coefficients (MFCC), which, as the authors conjectured, may cause the loss of important information to identify some specific emotions, e.g. happy. Instead, we directly use Fourier transformed spectrogram as input to our neural network. Secondly, we employ convolutional neural networks (CNN) to learn meaningful acoustic feature representations, and take advantage of the locality and shift invariance in the frequency domain of audio signal. Thirdly, we combine these convolutional layers with recurrent layer in an end-to-end framework, which learns both temporal transition of facial movements, as well as spontaneous actions and varying emotional states from only speech sequences. Experiments on the RAVDESS audiovisual corpus 

[19] demonstrate promising results of our approach in real-time speech-driven 3D facial animation.

The organization of the paper is as follows. Section II summarizes other studies related to our work. Our approach is explained in details in Section III. Experiments are described in Section IV, before Section V concludes our work.

Ii Related Work

”Talking head”, is a research topic where an avatar is animated to imitate human talking. Various approaches have been developed to synthesize a face model driven by either speech audio [13, 37, 28] or transcripts [33, 9]

. Essentially, every talking head animation technique develops a mapping from an input speech to visual features, and can be formulated as a classification or regression task. Classification approaches usually identify phonetic unit (phonemes) from speech and map to visual units (visemes) based on specific rules, and animation is generated by morphing these key images. On the other hand, regression approaches can directly generate visual parameters and their trajectories from input features. Early research on talking head used Hidden Markov Models (HMMs) with some successes  

[34, 35], despite certain limitations of HMM framework such as oversmoothing trajectory.

In recent years, deep neural networks have been successfully applied to speech synthesis  [24, 38] and facial animation [11, 39, 13]

with superior performance. This is because deep neural networks (DNN) are able to learn the correlation of high-dimensional input data, and, in case of recurrent neural network (RNN), long-term relation, as well as the highly non-linear mapping between input and output features. Taylor et al. 

[31] propose a system using DNN to estimate active appearance model (AAM) coefficients from input phonemes, which can be generalized well to different speeches and languages, and face shapes can be retargeted to drive 3D face models. Suwajanakorn et al. [30]

use long short-term memory (LSTM) RNN to predict 2D lip landmarks from input acoustic features, which are used to synthesize lip movements. Fan et al. 

[13] use both acoustic and text features to estimate active appearance model AAM coefficients of the mouth area, which then be grafted onto an actual image to produce a photo-realistic talking head. Karras et al. [16] propose a deep convolutional neural network (CNN) that jointly takes audio autocorrelation coefficients and emotional state to output an entire 3D face shape.

In terms of the underlying face model, these approaches can be categorized into image-based [5, 9, 12, 34, 37, 13] and model-based [4, 3, 29, 36, 11, 7] approaches. Image-based approaches compose photo-realistic output by concatenating short clips, or stitch different regions from a sample database together. However, their performance and quality are limited by the amount of samples in the database, thus it is difficult to generalize to a large corpus of speeches, which would require a tremendous amount of image samples to cover all possible facial appearances. In contrast, although lacking in photo-realism, model-based approaches enjoy the flexibility of a deformable model, which is controlled by only a set of parameters, and more straightforward modeling. Pham et al. [21] propose a mapping from acoustic features to blending weights of a blendshape model [6]. This face model allows emotional representation that can be inferred from speech, without explicitly defining the emotion as input, or artificially adding emotion to the face model in postprocessing. Our approach also enjoys the flexibility of blendshape model in 3D face reconstruction from speech.

CNN-based speech modeling. Convolutional neural networks [18] have achieved great successes in many vision tasks e.g. image classification or segmentation. Their efficient filter design allows deeper network, enables learning features from data directly while being robust to noise and small shift, thus usually having better performance than prior modeling techniques. In recent years, CNNs have been also employed in speech recognition tasks, that directly model the raw waveforms by taking advantage of the locality and translation invariance in time [32, 20, 15] and frequency domain  [10, 2, 1, 25, 26, 27]. In this work, we also employ convolutions in the time-frequency domain, and formulate an end-to-end deep neural network that directly maps input waveforms to blendshape weights.

Iii Deep End-to-End Learning
For 3d Face Synthesis From Speech

Iii-a Face Representation

Our work makes use of the 3D blendshape face model from the FaceWarehouse database [6], which has been utilized successfully in visual 3D face tracking tasks [23, 22]. An arbitrary fully transformed facial shape can be composed as:


where are rotation and expression blending parameters, respectively, are personalized expression blendshape bases of a particular person, and are constrained within . is the neutral posed blendshape.

Similar to [21], our deep model also generates , where is represented by three free parameters of a quaternion, and

is a vector of length

. We use the 3D face tracker in [23] to extract these parameters from training videos.

Fig. 1: A few samples from the RAVDESS database, where a 3D facial blendshape (right) is aligned to the face of the actor (left) in the corresponding frame. Red dots indicate 3D landmarks of the model projected to the image plane.

Iii-B Framework Architecture

Fig. 2: The proposed end-to-end speech-driven 3D facial animation framework. The input spectrogram is first convolved over frequency axis (F-convolution), then over time (T-convolution). Detailed network architecture is described in Table I.

Our end-to-end deep neural network is illustrated in Figure 2. The input to our model is raw time-frequency spectrograms of audio signals. Specifically, each spectrogram contains 128 frequency power bands across 32 time frames, in the form of a 2D (frequency-time) array suitable for CNN. We apply convolutions on frequency and time separately, similar to [16, 27], as this practice has been empirically shown to reduce overfitting, furthermore, using smaller filters requires less computation, which consequently speeds up the training and inference. The network architecture is detailed in Table I. Specifically, the input spectrogram is first convolved and pooled on the frequency axis with the downsampling factor of two, until the frequency dimension is reduced to one. Then, convolution and pooling is applied on the time axis. A dense layer is placed on top of CNN, which feeds to a unidirectional recurrent layer. In this work, we report the model performance where the recurrent layer utilizes either LSTM [14]

or gated recurrent unit (GRU) 


cells. The output from RNN is passed to another dense layer, whose purpose is to reduce the oversmoothing tendency of RNN and to allow more spontaneous changes in facial unit activations. Every convolutional layer is followed by a batch normalization layer, except the last one (

Conv8 in Table I).

Name Filter Size Stride Hidden Layer Size Activation
Conv1 (3,1) (2,1) ReLU
Pool1 (2,1) (2,1)
Conv2 (3,1) (2,1) ReLU
Pool2 (2,1) (2,1)
Conv3 (3,1) (2,1) ReLU
Conv4 (3,1) (2,1) ReLU
Conv5 (2,1) (2,1) ReLU
Pool5 (1,2) (1,2)
Conv6 (1,3) (1,2) ReLU
Conv7 (1,3) (1,2) ReLU
Conv8 (1,4) (1,4) ReLU
Dense1 256 tanh
RNN 256
Dense2 256 tanh
Output 49
TABLE I: Configuration of our hidden neural layers.

Iii-C Training Details

Audio processing. For each video frame in the corpus, we extract a 96ms audio frame sampled at 44.1kHz, including the acoustic data of the current frame and the previous frames. With the intended application in real-time animation, we do not consider any delay to gather future data, i.e. audio samples of frames onward, as they are unknown in a live streaming scenario. Instead, temporal transition will be modeled by the recurrent layer. We apply FFT with window size of 256 and hop length of 128, to recover a power spectrogram of 128 frequency bands and 32 time frames.

Loss function. Our framework maps input sequence of spectrograms to output sequence of shape parameter vectors , where is the number of video frames. Thus, at any given time , the deep model estimates from an input spectrogram . Similar to [21], we split the output into two separate layers, for rotation and for expression weights. has tanh activation to c, whereas uses sigmoid activation to constrain the range of output values:


We train the model by minimizing the squared error:



is the expected output, which we extract from training videos. We use the CNTK deep learning toolkit to implement our neural network models. Training hyperparameters are choosen as follows: minibatch size is 300, epoch size is 150,000 and learning rate is 0.0001. The network parameters are learned by Adam optimizer 

[17] in 300 epochs.

Iv Experiments

Mean Neutral Calm Happy Sad Angry Fearful Disgu. Surpri. Actor21 Actor22 Actor23 Actor24
LSTM [21] 1.039 1.059 1.039 1.082 1.010 1.033 1.016 1.049 1.038 1.007 0.941 1.056 1.139
CNN-static 0.741 0.725 0.715 0.760 0.728 0.746 0.746 0.781 0.746 0.723 0.697 0.719 0.817
CNN+LSTM 1.042 1.077 1.029 1.092 1.029 1.013 1.031 1.035 1.054 1.026 0.946 1.021 1.162
CNN+GRU 1.022 1.034 0.995 1.081 0.999 1.012 1.008 1.023 1.045 0.998 0.952 0.985 1.139
TABLE II: RMSE of 3D landmarks in millimeter, categorized by types of emotions of test sequences, and by actors.
Mean Neutral Calm Happy Sad Angry Fearful Disgu. Surpri. Actor21 Actor22 Actor23 Actor24
LSTM [21] 0.065 0.067 0.066 0.069 0.068 0.058 0.062 0.071 0.063 0.040 0.061 0.070 0.088
CNN-static 0.018 0.016 0.016 0.018 0.019 0.016 0.018 0.022 0.019 0.012 0.016 0.019 0.022
CNN+LSTM 0.074 0.072 0.074 0.083 0.074 0.063 0.073 0.074 0.079 0.052 0.061 0.075 0.106
CNN+GRU 0.067 0.065 0.065 0.075 0.069 0.059 0.065 0.073 0.070 0.041 0.061 0.066 0.100
TABLE III: Mean squared error of expression blending weights, categorized by types of emotions of test sequences, and by actors.
0.59 0.94 0.54 0.52
TABLE IV: Training errors after 300 epochs of four models.

Iv-a Dataset

We use the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [19] for training and evaluation. The database consists of 24 professional actors speaking and singing with various emotions: neutral, calm, happy, sad, angry, fearful, disgust and surprised. We use video sequences of the first 20 actors for training, with around 250,000 frames in total, which translate to about 2hr of audio, and evaluate the model on the data of four remaining actors.

Iv-B Experimental Settings

We train our deep neural network in two configurations: CNN+LSTM and CNN+GRU, in which the recurrent layer uses LSTM and GRU cells, respectively. As a baseline model, we drop the recurrent layer in our proposed neural network, and denote it as CNN-static. This model cannot handle smoothly temporal transition, it estimates facial parameters in a frame-by-frame basis. We also compared our proposed models with the method described in [21], which uses engineered features as input.

We compare these models on two metrics: RMSE of 3D landmark errors and mean squared error of facial parameters, specifically, expression blending weights with respect to ground truth recovered by the visual tracker [23]. Landmark errors are calculated as distances from the inner landmarks (showed in Figure 1) on the reconstructed 3D face shape, to those of the ground truth 3D shape. We ignore error metrics for head rotation, since it is rather difficult to infer head pose correctly from the speech. We include head pose estimation primarily to generate plausible rigid motions to augment the realism of the talking avatar. However, these error metrics do not truly reflect the performance of our deep generative model, based on our observations.

Iv-C Evaluation

Table II and III show the aforementioned error metrics of four models, organized into different categories corresponding to different emotions and testers. The proposed model with GRU cells slightly outperforms  [21] as well as the proposed model with LSTM cells, in terms of landmark errors. On the other hand, CNN+GRU has similar blendshape coefficient errors to [21], whereas CNN+LSTM has highest errors.

Interestingly, on both metrics, the CNN-static outperforms all other models with recurrent layers. As shown in Table II, the RMSE of CNN-static is about 0.7mm consistently across different categories and testers, and lower than errors of other models. CNN-static also scores lower parameter estimation errors. These result suggests that the baseline can actually generalizes well to test data in terms of facial action estimations, although visualization of 3D face reconstruction shows that the baseline model inherently generates non-smooth sequence of facial actions, which is shown in the supplementary video. From these testing results, combined with the training errors listed in Table IV, we hypothesize that our proposed models, CNN+LSTM and CNN+GRU, overfit the training data. And CNN+GRU, being simpler, can generalize somewhat better than CNN+LSTM. The baseline model CNN-static, being the simplest among the four i.e. it has the least number of parameters, generalizes well and achieves the best performance on the test set in terms of error metrics, as well as emotional facial reconstruction from speech, demonstrated in Figure 3. These results once again prove the robustness, generalization and adaptability of the CNN architecture and suggest that we have pursued the right direction in using CNN to model facial action from raw waveforms, but there are limitations and deficiencies in our current approach that need to be addressed.

Fig. 3: A few reconstruction samples. On the left are true face appearances. From the second to the last columns are reconstruction results by Pham et al. [21], CNN-static, CNN+LSTM and CNN+GRU, respectively. We use a generic 3D face model animated with the parameters generated by each model. The reconstructions by CNN-static depict emotions of the speakers reasonably well, however, it cannot generate smooth transition between frames.

V Conclusion and Future Work

This paper introduces a deep learning framework for speech-driven 3D facial animation from raw waveforms. Our proposed deep neural network learns a mapping from audio signal to the temporally varying context of the speech, as well as emotional states of the speaker represented implicitly by blending weights of a 3D face model. Experiments demonstrate that our approach could estimate the form of lip movements with the emotional intensity of the speaker reasonably. However, there are certain limitations in our network architecture that prevent the model from reflecting the emotion in the speech perfectly. In the future, we will improve the generalization of our deep neural network, and explore other generative models to increase the facial reconstruction quality.


  • [1] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu. Convolutional neural networks for speech recognition. IEEE Transaction on Audio, Speech, and Language Processing, 22(10), October 2014.
  • [2] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, and G. Penn. Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2012.
  • [3] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In SIGGAPH, pages 187–194, 1999.
  • [4] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Eurographics, pages 641–650, 2003.
  • [5] C. Bregler, M. Covell, and M. Slaney. Video rewrite: driving visual speech with audio. In SIGGRAPH, pages 353–360, 2007.
  • [6] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. FaceWarehouse: A 3D Facial Expression Database for Visual Computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, March 2014.
  • [7] Y. Cao, W. C. Tien, P. Faloutsos, and F. Pighin. Expressive speech-driven facial animation. ACM Transactions on Graphics, 24(4):1283–1302, 2005.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Deep Learning and Representation Learning Workshop, 2014.
  • [9] E. Cosatto, J. Ostermann, H. P. Graf, and J. Schroeter. Lifelike talking faces for interactive services. Proc IEEE, 91(9):1406–1429, 2003.
  • [10] L. Deng, O. Abdel-Hamid, and D. Yu. A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
  • [11] C. Ding, L. Xie, and P. Zhu. Head motion synthesis from speech using deep neural network. Multimed Tools Appl, 74:9871–9888, 2015.
  • [12] T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animatio. In SIGGRAPH, pages 388–397, 2002.
  • [13] B. Fan, L. Xie, S. Yang, L. Wang, and F. K. Soong. A deep bidirectional lstm approach for video-realistic talking head. Multimed Tools Appl, 75:5287–5309, 2016.
  • [14] S. Hochreiter and J. Scmidhuber. Long short-term memory. Neural Comput, 9(8):1735–1780, 1997.
  • [15] Y. Hoshen, R. J. Weiss, and K. W. Wilson. Speech acoustic modeling from raw multichannel waveforms. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
  • [16] T. Karras, T. Aila, S. Laine, A. Herva, and J. Lehtinen. Audio-driven facial animation by joint end-to-end learning of pose and emotion. In SIGGRAPH, 2017.
  • [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In 3rd International Conference for Learning Representations, 2015.
  • [18] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time-series. pages 255–258, 1998.
  • [19] S. R. Livingstone, K. Peck, and F. A. Russo. Ravdess: The ryerson audio-visual database of emotional speech and song. In 22nd Annual Meeting of the Canadian Society for Brain, Behaviour and Cognitive Science (CSBBCS), 2012.
  • [20] D. Palaz, R. Collobert, and M. Magimai-Doss.

    Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks.

    In Interspeech, 2013.
  • [21] H. X. Pham, S. Cheung, and V. Pavlovic. Speech-driven 3d facial animation with implicit emotional awareness: a deep learning approach. In The 1st DALCOM workshop, CVPR, 2017.
  • [22] H. X. Pham and V. Pavlovic. Robust real-time 3d face tracking from rgbd videos under extreme pose, depth, and expression variations. In 3DV, 2016.
  • [23] H. X. Pham, V. Pavlovic, J. Cai, and T. jen Cham. Robust real-time performance-driven 3d face tracking. In ICPR, 2016.
  • [24] Y. Qian, Y. Fan, and F. K. Soong. On the training aspects of deep neural network (dnn) for parametric tts synthesis. In ICASSP, pages 3829–3833, 2014.
  • [25] T. N. Sainath, B. Kingsbury, G. Saon, H. Soltau, A. rahman Mohamed, G. Dahl, and B. Ramabhadran. Deep convolutional neural networks for large-scale speech tasks. Neural Network, 64:39–48, 2015.
  • [26] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak. Convolutional, long short-term memory, fully connected deep neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2015.
  • [27] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals. Learning the speech front-end with raw waveforms cldnns. In Interspeech, 2015.
  • [28] S. Sako, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Hmm-based text-to-audio-visual speech synthesis. In ICSLP, pages 25–28, 2000.
  • [29] G. Salvi, J. Beskow, S. Moubayed, and B. Granstrom. Synface: speech-driven facial animation for virtual speech-reading support. URASIP journal on Audio, speech, and music processing, 2009.
  • [30] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Schlizerman. Synthesizing obama: learning lip sync from audio. In SIGGRAPH, 2017.
  • [31] S. Taylor, T. Kim, Y. Yue, M. Mahler, J. Krahe, A. G. Rodriguez, J. Hodgins, and I. Matthews. A deep learning approach for generalized speech animation. In SIGGRAPH, 2017.
  • [32] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou, B. Schuller, and S. Zafeiriou. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Interspeech, 2016.
  • [33] A. Wang, M. Emmi, and P. Faloutsos. Assembling an expressive facial animation system. ACM SIGGRAPH Video Game Symposium (Sandbox), pages 21–26, 2007.
  • [34] L. Wang, X. Qian, W. Han, and F. K. Soong. Synthesizing photo-real talking head via trajectoryguided sample selection. In Interspeech, pages 446–449, 2010.
  • [35] L. Wang, X. Qian, F. K. Soong, and Q. Huo. Text driven 3d photo-realistic talking head. In Interspeech, pages 3307–3310, 2011.
  • [36] Z. Wu, S. Zhang, L. Cai, and H. Meng. Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In Interspeech, pages 1802–1805, 2006.
  • [37] L. Xie and Z. Liu. Realistic mouth-synching for speech-driven talking face using articulatory modeling. IEEE Trans Multimed, 9(23):500–510, 2007.
  • [38] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In ICASSP, pages 7962–7966, 2013.
  • [39] X. Zhang, L. Wang, G. Li, F. Seide, and F. K. Soong. A new language independent, photo realistic talking head driven by voice only. In Interspeech, 2013.