Attention-based Residual Speech Portrait Model for Speech to Face Generation

Given a speaker's speech, it is interesting to see if it is possible to generate this speaker's face. One main challenge in this task is to alleviate the natural mismatch between face and speech. To this end, in this paper, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) by introducing the ideal of the residual into a hybrid encoder-decoder architecture, where face prior features are merged with the output of speech encoder to form the final face feature. In particular, we innovatively establish a tri-item loss function, which is a weighted linear combination of the L2-norm, L1-norm and negative cosine loss, to train our model by comparing the final face feature and true face feature. Evaluation on AVSpeech dataset shows that our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.



There are no comments yet.


page 6


Residual-guided Personalized Speech Synthesis based on Face Image

Previous works derive personalized speech features by training the model...

Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

In this paper, we propose an effective method to synthesize speaker-spec...

Crossmodal Voice Conversion

Humans are able to imagine a person's voice from the person's appearance...

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

We propose three regularization-based speaker adaptation approaches to a...

End-to-end Anchored Speech Recognition

Voice-controlled house-hold devices, like Amazon Echo or Google Home, fa...

Auxiliary Loss of Transformer with Residual Connection for End-to-End Speaker Diarization

End-to-end neural diarization (EEND) with self-attention directly predic...

Geometry-guided Dense Perspective Network for Speech-Driven Facial Animation

Realistic speech-driven 3D facial animation is a challenging problem due...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, Audio-Visual Cross-Modal Learning becomes more and more popular [PuttingaFacetotheVoice], and one of the most interesting topics is to generate face appearances of speakers according to their audio speech. In fact, the research on “speech to portrait” has great impacts on our daily life, especially in community security. For example, in the real scene of identifying suspects [Parkhi2015Deep], when there are no images or appearance features but only audio samples of the suspects, police can use the technology of “speech to portrait” to generate the face images, which are similar to suspects’ real faces.

In order to explore the feasibility of generating faces from speech, some previous studies have shown that a person’s voice is strongly related to his or her facial structures [Thinkingthevoice, PuttingtheFacetotheVoice]. For example, it was shown in  [mermelstein1967determination, teager1990evidence] that facial bone, joint structures and the tissues covering them are closely related to the shape and size of the organs that produce sound. Meanwhile, genetic factors, biological factors and environmental factors, especially gender [Kotti2008Gender, inproceedings], age [PtacekAge, Ageestimation, Zazo2018Age] and ethnicity, can largely influence one’s voice and face.

Recently, some research works on speech to face generation using deep learning methods have emerged

[ProfilingHumansfromtheirVoice, oh2019speech2face], among which Speech2Face [oh2019speech2face]

is the state-of-the-art (SOTA). It directly generates a speech by Convolutional Neural Network (CNN)

[Krizhevsky2012ImageNet] that learns facial features [merler2019diversity], and then uses this feature to generate face [ColeSynthesizing]. This method obtains good results, but the CNN structure of speech encoder is complicated, and conventional CNNs may suffer from the gradient vanishing problem. More importantly, due to the natural mismatch between speech and face [singh2019reconstruction], there is still room to reduce the mismatch and improve the performance.

To overcome the shortcomings of previous works and further alleviate the mismatch between speech and face, in this work, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) within an encoder-decoder architecture by introducing prior face features, which guides the speech feature to approach to the original face feature as much as possible. Besides, Convolutional Block Attention Module (CBAM) [woo2018cbam] is incorporated into the encoder to extract the key information of the speech and makes the decoder only focus on the face feature, ignoring the background noise. Particularly, we propose a tri-item loss function for encoder, which contains the linear combination of the L2-norm, L1-norm and the negative cosine loss to take into account both errors in values and directions. The decoder loss function is constructed by replacing the cosine loss with the Structural Similarity Index (MS-SSIM) [ZhaoLoss]. The qualitative and quantitative evaluation on AVSpeech dataset [ephrat2018looking] show the superior performance of our proposed method, which outperforms the SOTA results [oh2019speech2face]. An overview of our proposed AR-SPM is shown in Figure 1.

Overall, our contributions can be summarized as follows:

  • We propose a novel AR-SPM based on an encoder-decoder architecture. A residual strategy is introduced by incorporating a prior face feature to make the network capture the most representative face features and improve the learning effect of speech portrait model.

  • We improve the network structure by adding a spatial-channel attention mechanism and constructing a symmetrical face decoder network. Moreover, we innovatively establish a tri-item loss function, which contains L2-norm, L1-norm and negative cosine loss to accelerate the convergence of training.

  • Experimental results on AVSpeech dataset [ephrat2018looking] show that our model achieves the SOTA results and significantly reduces the cos similarity degree from to .

2 Related Work

In the literature, three classical methods are used to construct a mapping between speech and face. The first method used some attributes classifiers to predict some fixed attributes from speaker’s voice such as gender, age

[Zazo2018Age] and nationality. Then it searched the most matching face images automatically based on the prediction results from the local face images database. Lastly, it merged the selected face images using the method in [zhmoginov2016inverting]. However, this speech portrait method has some limits in the accuracy of the attributes classifiers since it needs to be trained in a supervised manner and thus limits the correlation between face and speech within a group of attributes.

The second method that realized a speech portrait model is based on a direct regression using a Deep Neural Network (DNN) [Yu2015Deep]. Specifically, the DNN structure is used to construct the mapping between speech and fixed-resolution RGB face image [duarte2019wav2pix, Goodfellow2014Generative, ReedGenerative, XiaAuxiliary]. However, this method is hard to realize because such a model needs speaker’s original face image as a training label. As a result, the model is easily susceptible to uncontrollable factors in the face image, such as facial expression, head posture, object occlusion, lighting conditions and background, which will greatly affect the regression effect of the model. In addition, the model must learn autonomously and parse out many complex nonlinear transformations [oh2019speech2face].

The third method is an improvement of the second method by splitting the networks into two parts. The encoder independently learns to extract face-related information from the speech, and the decoder independently learns to restore a standard resolution face image from face feature as in [oh2019speech2face]. Such a model is trained in a self-supervised way, by utilizing the pair of face and speech extracted from videos. The description formula of the network is :


where is the speech and is face image. SE means speech encoder network and FD means face decoder network. Even this method [oh2019speech2face] is creative and remarkable, it still needs further improvement. Due to the natural mismatch between speech and face [singh2019reconstruction], there is still room to improve the performance and reduce the mismatch.

To this end, in this work, we propose a novel AR-SPM to further optimize the SE structure and propose to incorporate a prior face feature to complement the speech feature.

3 Method

In the following, we describe the details of our proposed AR-SPM (see Figure 1), which includes the prior face features, the CBAM and two new loss functions as well as the structure for SE and FD network, respectively.

Figure 1: The pipeline of our AR-SPM with the neutral prior face.

3.1 Prior face feature

By introducing the prior face feature, we exploit the idea of the residual to remove the main similar part of the face (i.e., prior face feature), thereby highlighting the small changes depicted by the speech feature. Our AR-SPM converts the speech to face images by a network , which is defined as follows.


where is the generated face, means the spectrogram of input speech, and is the prior face feature, which is calculated before the training stage. Finally, the merged face feature is upsampled into RGB face image by FD. Two ways are investigated to obtain the final face feature. The first is the sum of prior face feature and speech feature output from SE, and the second one is first feeding the sum of the output feature of SE and prior face feature to a fully connected () layer, and then taking its output as the final face feature. Recalled that the goal of SE is to learn speech features that mimic facial features. Motivated by [he2016deep], SE is designed to learn the residual, which is the difference between final face feature and the face prior, instead of learning speech face feature directly.

A well-defined prior face feature can be approximate to speaker’s face feature in many dimensions (e.g., eye contour and lip contour), and contains the main similar part of face, we thus take advantage of adding the prior facial features in the SE to reduce the training difficulties and learn more representative facial features.

Two face priors are investigated in this work. The first one is neutral face prior, which is the arithmetic mean of a large gender-balance face image dataset:


where denotes the face image and is the number of face images. VGGFace denotes the CNN structure in [Parkhi2015Deep], which is exploited to extract face features from face images by taking = 10, 50, 100, 500, 1000, 5000 and 10000. We experimentally find that when gradually increases, the neutral prior feature tends to converge, indicating that the calculated prior face feature is more representative, as shown in Figure 2 and Table 1. In this work, we finally take equals 10000.

Neutral male female
27.861 43.908 36.924
12.155 24.302 16.318
10.179 14.937 15.256
3.815 5.059 5.797
3.176 4.254 4.505
1.179 1.967 1.913
Table 1: distance between VGGFace face features with different sample face image number . In particular, we calculated them in cases of neutral, male, and female.
Figure 2: The face images restored from neutral prior face-feature by Face Decoder Network when =10, 50, 100, 500, 1000, 5000, 10000.

The second prior face feature is a gender-dependent prior by assigning two prior face features to male and female, respectively. To achieve this, a robust classifier network is first needed to predict the gender based on the audio speech. We establish a lightweight CNN network, which contains five convolution layers, three max pooling layers and two fully connected layers to predict the gender of speakers. It is trained and tested on the VGGFace dataset, and the gender accuracy reache 97.01% in test dataset.

Based on the classification results, we calculate the male and female prior face features using the same method as the neutral prior face feature. The gender prior features tend to converge as well when the number of face images increases (see Table 1).

It should be noted that these two prior face features are unbiased on the age due to the AVSpeech dataset contains candidates aged from 20 to 50 years old. Besides, in this work, we do not take into account some potential attributes such as skin color and face shape, which do not affect the speaker’s portrait.

3.2 Convolutional Block Attention Module

To make our proposed AR-SPM owes an ability to focus on important features and ignore the irrelevant information, we utilize a lightweight general-purpose module CBAM [woo2018cbam], which can be easily integrated into an end-to-end CNN architecture.

CBAM contains a channel and a spatial attention, which are embed into both encoder and decoder. Concerning the channel attention module, each channel of a feature represents a specialized detector, so that it can focus on what are important features, while the spatial attention module is used to determine where are useful features. The channel attention and spatial attention can be combined together in a parallel or tandem approach. We experimentally find that “first channel attention and then spatial attention” in a tandem manner achieves the best result. Besides, we also tried channel-focused attention mechanism [HuSqueeze], and it turns out that CBAM performs better.

As shown in Figure 1, we apply CBAM in both the SE and FD, so that SE obtain a better feature representation, and FD focus on the feature information of the face, instead of the background noise.

Layer Kernel Stride Padding Out Padding Out Channels
Fc1 1 - - - 1000
Fc2 1 - - - 25088
ConvTrans1 5 2 2 1 512
ConvTrans2 5 1 2 0 512
ConvTrans3 5 1 2 0 512
ConvTrans4 5 1 2 0 512
ConvTrans5 5 1 2 0 512
ConvTrans6 5 2 2 1 256
ConvTrans7 5 1 2 0 256
ConvTrans8 5 1 2 0 256
ConvTrans9 5 2 2 1 64
ConvTrans10 5 1 2 0 64
ConvTrans11 5 2 2 1 32
Conv1 1 1 0 - 3
Table 2: The structure of FD network.

3.3 Encoder-decoder structure

In this work, we redesigned the SE network as shown in Table  3, where the Fc2 layer is a fusion layer for speech feature and prior face feature. The CBAM [woo2018cbam] is added to the CNN structure. We have also tried the general deep learning structures like Resnets [he2016deep] or Densenets [huang2017densely] as the SE network, but through experiments, we found that these kind of networks are not effective enough as the role of extract face information in speech portrait model. We hypothesize that Resnets and Densnets are skip-connected networks, which may introduce noise or silencing fragments, causing longer training steps to extract effective speech feature information.

As for the FD network, we exploit the structure of transposed convolutions in [cole2017synthesizing] and design a network that is symmetrical to the VGGFace model. The structure of FD network is shown in Table 2

. To reduce the number of hyperparameters, instead of using a

layers (), we first use two layers ( and ) to resize the input feature as a feature map that is the same as the last feature map of VGGFace model. Then we use our designed transposed convolutions to upsample the feature map as a feature map. Lastly, a convolution is used to convert the feature map to a size .

Figure 3: Overview of FD networks.

3.4 Proposed loss functions

In the following, we introduce our proposed loss functions of FD and SE, respectively.

3.4.1 Face Decoder loss function

Concerning the loss function of FD networks, we modify the join loss function, which is similar to the loss function in [cole2017synthesizing]. As shown in Figure 3, our FD loss function is composed of an image loss, which is the error between generated image and original image, and a face feature loss, which is the error between face features of generated image and original image. Such a joint loss function can not only penalize the pixel error between image directly, but also the different of abstract identity. Specifically, we adopt the mixed loss of MS-SSIM and loss [zhao2015loss] as the image loss:


where and

is the parameters of Gaussian distribution in MS-SSIM. It considers the influence of resolution, so that it can retain high-frequency information (e.g., image edges or details) but tend to cause brightness changes and color deviations. In order to tackle this problem, we exploit

function to keep brightness and color unchanged. Combining them together can produce a good penalization for image generation task.

In order to pay more attention to face contour in the image rather than the value of the pixel itself, we introduce the cosine similarity loss, which is generally used to measure the difference of two embeddings in the feature space:


Finally, the total loss function of the FD is:


3.4.2 Speech Encoder loss function

A tri-item loss function is proposed for SE network. The first item is loss between the unitized (i.e., the output speech feature of SE network) and unitized (i.e., the real speaker’s face feature obtained by VGGFace). The second item is loss between the output of the FD network’s first layer (: ), which penalizes the difference of hidden layer features. The third item is Cosine Similarity Loss between the output of the VGGFace third layer (: ), which penalizes the difference of identity feature. We have also tried the knowledge distillation loss in [oh2019speech2face], and it shows inferior performance to the Cosine Similarity Loss. Finally, the tri-item loss function is:


where , , (, , ) are balanced coefficients which make the gradient of loss items within a similar scale.

Since FD takes part in the calculation of loss function in SE, FD will be trained before SE, and the parameters of FD will be frozen during the training of SE.

4 Experiment

4.1 Datasets and Implementation Details

The AVSpeech dataset [ephrat2018looking] is used to evaluate our model, which is a large-scale speaking video dataset from YouTube. Besides, we use VGGFace dataset to train our gender prediction network to classify the gender, and 0.15 million videos are used as training set and 8 thousand videos as test set. First, we crop the face image to size from AVSpeech training set to train the FD network, whose structure and implementation details are shown in Table 2. Then, we clip and rescale the audio from each AVSpeech video to 6s by concatenating if an audio length is shorter than 6s or cutting if an audio length is longer than 6s. The audio waveform is resampled at 16 kHz and only a single channel is used. Spectrograms are computed by the speaker-independent audio-visual model according to [ephrat2018looking].

We use the spectrogram and face-image pairs to train the SE network, whose structure and implementation details are shown in Table 3

. Our model is implemented by PyTorch 1.1.0 and optimized by Adam

[kingma2014adam] with ,

, the learning rate of 0.001 with the exponentially decay rate of 0.9 at every 2 epochs. We finally train our model with 50 epochs, and the batch size is 16.

4.2 Evaluation Metrics

The face image generated by speech portrait model may be interfered with variable face pose, expression, background or noncritical details. Therefore, a series of pixel level evaluation index is not applicable. In this work, we adopt a part of evaluation metrics commonly employed in SOTA work


More precisely, we not directly compared the similarity of generated face images with original face images, but compare the similarity of face features extracted from them, because face feature has good robustness on expression, head posture light conditions and background

[cole2017synthesizing]. Besides, we compared the speech features from SE network and the corresponding face features from the ground truth face image, using the loss, loss and Cosine Similarity (). When and tends to 0 or tends to 1, the difference between two images tends to be small or two features are similar in the direction. We also measure the similarity between the feature of the face image generated by the AR-SPM and the corresponding face features from ground truth images using the same three metrics denoted as , , . Furthermore, we compare the face feature between the original face features and the features of the face generated by the original face features. For convenience, we call it Face-to-Face in this work, which can be regarded as a benchmark for comparison, since the speech portrait model relies on the face feature from original face image for guidance.

Layer Kernel Stride Padding Out Channels
Conv1 7 2 1 64
MaxPool1 3 2 0 -
Conv2 5 2 1 128
MaxPool2 3 2 0 -
Conv3 3 1 1 256
Conv4 3 1 1 512
Conv5 3 1 1 512
MaxPool3 0 -
CBAM - - - -
Fc1 1 - - 4096
AvgPool1 - - -
Fc2 1 - - 4096
Table 3: The structure of SE network.

4.3 Quantitative and Qualitative Analysis

In the following, we show the quantitative and qualitative results of the proposed FD network and the AR-SPM.

4.3.1 The Performance of Face Decoder Network

We train the FD network on face images from AVSpeech dataset, and show qualitative results of face reconstruction in Figure 4, we can see that the faces generated by the FD are similar to the ground truth. Besides, since the CBAM attention module is used in our model, the network will concentrate on extracting face features and ignore the background, causing a fuzzy background.

The quantitative results of FD network are shown in the second row of Table 4, showing the face feature similarity of face image generated by FD network using the original face image.

Figure 4: The qualitative results of FD network on the AVSpeech dataset. The first row shows the original face image, and the second row shows the image generated by FD.

4.3.2 The Performance of AP-SPM

To verify the performance of AR-SPM, we train the SE network on spectrogram and face-image pairs from AVSpeech dataset. Let and denote the model with neutral prior face feature and the model with gender prior face feature (given by the proposed automatic gender classifier), trained with data containing both male and female, respectively. and mean the models using the male and female face prior, and training with only male and only female data, respectively. Besides, we denote prior feature models, which use a layer to fuse the prior face feature and output speech feature as and , respectively.

Figure 5: The decline of the loss function, where the blue line denotes the loss of , and the red line denotes the loss of .
Face-to-Face - - - 97.158 1.890 14.534
162.957 3.176 29.191 179.594 3.499 30.231
126.524 2.470 15.204 144.027 2.809 25.842
143.335 2.795 27.748 148.610 2.897 25.973
155.008 3.023 28.115 171.860 3.350 29.308
145.192 2.831 27.994 150.900 2.941 26.104
128.641 2.501 19.438 156.399 3.048 21.721
130.861 2.542 19.610 152.985 2.981 21.409
Table 4: Quantitative results of the ablation study. The firt column represents the different methods, the second, third and fourth columns represent the L1, L2 and cosine distances between the final face features generated by the SE network and the face features of the original face image, the fifth, sixth and seventh columns represent the L1, L2 and cosine distances between the face features extracted from the face image generated by our model and the face features of the original image.

An ablation study is carried out by comparing the similarity between speech feature from SE and the corresponding face features from the original face image by measuring the , distance and (i.e., cosine similarity in form of degree). Results are shown in Table 4, where we notice that achieves the best performance among all cases, while has the worst performance. and have a slight improvement compared to the result of , showing that the use of gender face prior gains less compared with the using of neutral face prior. When we switch to the and , we find that they are better than . This is reasonable since they train female and male data separately (i.e., data dependent) and thus make the task easier. More importantly, a fair comparsion in case of the shows that our method achieves , outperforming the SOTA [oh2019speech2face] results by a large margin. Besides, we see the output feature of layer brings evident benefits when combined with the neutral prior, while it leads to a narrowly worse performance when combined with the gender prior.

Furthermore, we show the decline of the loss function of and by plotting every 1000 steps in batch training in Figure 5. It shows that our converges faster than , and it is able to avoid the unstable training causing by the abnormal data samples. We can see that two abnormal training loss values are appeared at the beginning of the training stage for , while this phenomenon does not appear in the case of .

The qualitative results of the ablation study are shown in Figure 6. We take 10 groups randomly from test set for the demonstration. We first notice that face image generated by the Face-to-Face is the most similar to the original face image since it is the reference standard. More importantly, we find that face image generated by the is also highly similar to the original face image, showing superior quality to . Besides, the performs better than , confirming that the neutral prior is more robust than the gender prior in this work. Recalled that by using the CBAM, our model only focus on the face and ignores the background, leading to blur background in generated face images.

Figure 6: Face generated by the AR-SPM. The 1st row: original face images cropped from video, the 2nd row: the Face-to-Face result, the 3rd row: face images generated from speech by , the 4th row: face images generated from speech by , and the 5th row: face images generated from speech by .

4.3.3 The performance of gender and age recognition

To further evaluate the quality of our proposed AR-SPM, we conduct a gender and age recognition experiment based on the generated face images by our model. Face++ API [faceplus] is utilized to evaluate the gender and age. Results are shown in Table 5 and Table 6, respectively. Due to and obtains superior performance to and as shown in Table 4, we only present the recognition results using and . Naturally, Face-to-Face gives the upper limit of the recognition accuracy. Except that, we find the achieves the highest accuracy in all cases. Besides, we can see that and are narrowly worse than . It should be mentioned that the AVSpeech dataset is not age-balanced (about 35% for Young, 55% for Mid-age and 10% for Elder). The lower accuracy of the elder age comes from the limited older images in the dataset.

These results show that introducing the idea of the residual by using the neutral prior can improve the efficiency of the model training and further enhance the robustness.

Model Male Female Total
Face-to-Face 97.4% 92.7% 95.5%
95.6% 89.8% 93.2%
97.9% 89.9% 94.7%
87.7% 93.6% 90.1%
97.4% 68.0% 85.8%
Table 5: The accuracy of gender recognition using the face generated by the proposed AR-SPM.
Model Young Mid-age Elder Total
Face-to-Face 76.0% 70.5% 46.7% 70.3%
53.6% 65.5% 33.1% 57.1%
67.6% 66.1% 51.2% 65.2%
55.5% 70.4% 23.5% 63.0%
64.8% 64.2% 40.4% 62.9%
Table 6: The accuracy of age recognition using the face generated by the proposed AR-SPM. Young means under 35 year old, Mid-age means 35 to 65 year old, and Elder means more than 65 years old.

5 Conclusion

To alleviate the mismatch between speech and face in the speech-based face generation, we propose a novel AR-SPM based on an end-to-end encoder-decoder structure, which utilizes the additional prior face feature to complement the speech feature in the SE network. Two prior face features (i.e., neutral and gender prior face features) are explored according to the gender. In addition, we re-design the encoder and decoder by incorporating the CBAM into the SE and FD to capture the spatial and channel relationships and suppress noise. Results on AVSpeech dataset show that our proposed AR-SPM accelerates the convergence of training and achieves the SOTA performance. In the future, our model will be explored to eliminate the influence of attributes such as hair and image background on the speaker’s face reconstruction, or apply in other application fields, such as preliminary medical image generation or diagnosis from speaker’s speech.