1 Introduction
Recently, Audio-Visual Cross-Modal Learning becomes more and more popular [PuttingaFacetotheVoice], and one of the most interesting topics is to generate face appearances of speakers according to their audio speech. In fact, the research on “speech to portrait” has great impacts on our daily life, especially in community security. For example, in the real scene of identifying suspects [Parkhi2015Deep], when there are no images or appearance features but only audio samples of the suspects, police can use the technology of “speech to portrait” to generate the face images, which are similar to suspects’ real faces.
In order to explore the feasibility of generating faces from speech, some previous studies have shown that a person’s voice is strongly related to his or her facial structures [Thinkingthevoice, PuttingtheFacetotheVoice]. For example, it was shown in [mermelstein1967determination, teager1990evidence] that facial bone, joint structures and the tissues covering them are closely related to the shape and size of the organs that produce sound. Meanwhile, genetic factors, biological factors and environmental factors, especially gender [Kotti2008Gender, inproceedings], age [PtacekAge, Ageestimation, Zazo2018Age] and ethnicity, can largely influence one’s voice and face.
Recently, some research works on speech to face generation using deep learning methods have emerged
[ProfilingHumansfromtheirVoice, oh2019speech2face], among which Speech2Face [oh2019speech2face]is the state-of-the-art (SOTA). It directly generates a speech by Convolutional Neural Network (CNN)
[Krizhevsky2012ImageNet] that learns facial features [merler2019diversity], and then uses this feature to generate face [ColeSynthesizing]. This method obtains good results, but the CNN structure of speech encoder is complicated, and conventional CNNs may suffer from the gradient vanishing problem. More importantly, due to the natural mismatch between speech and face [singh2019reconstruction], there is still room to reduce the mismatch and improve the performance.To overcome the shortcomings of previous works and further alleviate the mismatch between speech and face, in this work, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) within an encoder-decoder architecture by introducing prior face features, which guides the speech feature to approach to the original face feature as much as possible. Besides, Convolutional Block Attention Module (CBAM) [woo2018cbam] is incorporated into the encoder to extract the key information of the speech and makes the decoder only focus on the face feature, ignoring the background noise. Particularly, we propose a tri-item loss function for encoder, which contains the linear combination of the L2-norm, L1-norm and the negative cosine loss to take into account both errors in values and directions. The decoder loss function is constructed by replacing the cosine loss with the Structural Similarity Index (MS-SSIM) [ZhaoLoss]. The qualitative and quantitative evaluation on AVSpeech dataset [ephrat2018looking] show the superior performance of our proposed method, which outperforms the SOTA results [oh2019speech2face]. An overview of our proposed AR-SPM is shown in Figure 1.
Overall, our contributions can be summarized as follows:
-
We propose a novel AR-SPM based on an encoder-decoder architecture. A residual strategy is introduced by incorporating a prior face feature to make the network capture the most representative face features and improve the learning effect of speech portrait model.
-
We improve the network structure by adding a spatial-channel attention mechanism and constructing a symmetrical face decoder network. Moreover, we innovatively establish a tri-item loss function, which contains L2-norm, L1-norm and negative cosine loss to accelerate the convergence of training.
-
Experimental results on AVSpeech dataset [ephrat2018looking] show that our model achieves the SOTA results and significantly reduces the cos similarity degree from to .
2 Related Work
In the literature, three classical methods are used to construct a mapping between speech and face. The first method used some attributes classifiers to predict some fixed attributes from speaker’s voice such as gender, age
[Zazo2018Age] and nationality. Then it searched the most matching face images automatically based on the prediction results from the local face images database. Lastly, it merged the selected face images using the method in [zhmoginov2016inverting]. However, this speech portrait method has some limits in the accuracy of the attributes classifiers since it needs to be trained in a supervised manner and thus limits the correlation between face and speech within a group of attributes.The second method that realized a speech portrait model is based on a direct regression using a Deep Neural Network (DNN) [Yu2015Deep]. Specifically, the DNN structure is used to construct the mapping between speech and fixed-resolution RGB face image [duarte2019wav2pix, Goodfellow2014Generative, ReedGenerative, XiaAuxiliary]. However, this method is hard to realize because such a model needs speaker’s original face image as a training label. As a result, the model is easily susceptible to uncontrollable factors in the face image, such as facial expression, head posture, object occlusion, lighting conditions and background, which will greatly affect the regression effect of the model. In addition, the model must learn autonomously and parse out many complex nonlinear transformations [oh2019speech2face].
The third method is an improvement of the second method by splitting the networks into two parts. The encoder independently learns to extract face-related information from the speech, and the decoder independently learns to restore a standard resolution face image from face feature as in [oh2019speech2face]. Such a model is trained in a self-supervised way, by utilizing the pair of face and speech extracted from videos. The description formula of the network is :
(1) |
where is the speech and is face image. SE means speech encoder network and FD means face decoder network. Even this method [oh2019speech2face] is creative and remarkable, it still needs further improvement. Due to the natural mismatch between speech and face [singh2019reconstruction], there is still room to improve the performance and reduce the mismatch.
To this end, in this work, we propose a novel AR-SPM to further optimize the SE structure and propose to incorporate a prior face feature to complement the speech feature.
3 Method
In the following, we describe the details of our proposed AR-SPM (see Figure 1), which includes the prior face features, the CBAM and two new loss functions as well as the structure for SE and FD network, respectively.

3.1 Prior face feature
By introducing the prior face feature, we exploit the idea of the residual to remove the main similar part of the face (i.e., prior face feature), thereby highlighting the small changes depicted by the speech feature. Our AR-SPM converts the speech to face images by a network , which is defined as follows.
(2) |
where is the generated face, means the spectrogram of input speech, and is the prior face feature, which is calculated before the training stage. Finally, the merged face feature is upsampled into RGB face image by FD. Two ways are investigated to obtain the final face feature. The first is the sum of prior face feature and speech feature output from SE, and the second one is first feeding the sum of the output feature of SE and prior face feature to a fully connected () layer, and then taking its output as the final face feature. Recalled that the goal of SE is to learn speech features that mimic facial features. Motivated by [he2016deep], SE is designed to learn the residual, which is the difference between final face feature and the face prior, instead of learning speech face feature directly.
A well-defined prior face feature can be approximate to speaker’s face feature in many dimensions (e.g., eye contour and lip contour), and contains the main similar part of face, we thus take advantage of adding the prior facial features in the SE to reduce the training difficulties and learn more representative facial features.
Two face priors are investigated in this work. The first one is neutral face prior, which is the arithmetic mean of a large gender-balance face image dataset:
(3) |
where denotes the face image and is the number of face images. VGGFace denotes the CNN structure in [Parkhi2015Deep], which is exploited to extract face features from face images by taking = 10, 50, 100, 500, 1000, 5000 and 10000. We experimentally find that when gradually increases, the neutral prior feature tends to converge, indicating that the calculated prior face feature is more representative, as shown in Figure 2 and Table 1. In this work, we finally take equals 10000.
Neutral | male | female | ||
---|---|---|---|---|
27.861 | 43.908 | 36.924 | ||
12.155 | 24.302 | 16.318 | ||
10.179 | 14.937 | 15.256 | ||
3.815 | 5.059 | 5.797 | ||
3.176 | 4.254 | 4.505 | ||
1.179 | 1.967 | 1.913 |

The second prior face feature is a gender-dependent prior by assigning two prior face features to male and female, respectively. To achieve this, a robust classifier network is first needed to predict the gender based on the audio speech. We establish a lightweight CNN network, which contains five convolution layers, three max pooling layers and two fully connected layers to predict the gender of speakers. It is trained and tested on the VGGFace dataset, and the gender accuracy reache 97.01% in test dataset.
Based on the classification results, we calculate the male and female prior face features using the same method as the neutral prior face feature. The gender prior features tend to converge as well when the number of face images increases (see Table 1).
It should be noted that these two prior face features are unbiased on the age due to the AVSpeech dataset contains candidates aged from 20 to 50 years old. Besides, in this work, we do not take into account some potential attributes such as skin color and face shape, which do not affect the speaker’s portrait.
3.2 Convolutional Block Attention Module
To make our proposed AR-SPM owes an ability to focus on important features and ignore the irrelevant information, we utilize a lightweight general-purpose module CBAM [woo2018cbam], which can be easily integrated into an end-to-end CNN architecture.
CBAM contains a channel and a spatial attention, which are embed into both encoder and decoder. Concerning the channel attention module, each channel of a feature represents a specialized detector, so that it can focus on what are important features, while the spatial attention module is used to determine where are useful features. The channel attention and spatial attention can be combined together in a parallel or tandem approach. We experimentally find that “first channel attention and then spatial attention” in a tandem manner achieves the best result. Besides, we also tried channel-focused attention mechanism [HuSqueeze], and it turns out that CBAM performs better.
As shown in Figure 1, we apply CBAM in both the SE and FD, so that SE obtain a better feature representation, and FD focus on the feature information of the face, instead of the background noise.
Layer | Kernel | Stride | Padding | Out Padding | Out Channels |
---|---|---|---|---|---|
Fc1 | 1 | - | - | - | 1000 |
Fc2 | 1 | - | - | - | 25088 |
ConvTrans1 | 5 | 2 | 2 | 1 | 512 |
ConvTrans2 | 5 | 1 | 2 | 0 | 512 |
ConvTrans3 | 5 | 1 | 2 | 0 | 512 |
ConvTrans4 | 5 | 1 | 2 | 0 | 512 |
ConvTrans5 | 5 | 1 | 2 | 0 | 512 |
ConvTrans6 | 5 | 2 | 2 | 1 | 256 |
ConvTrans7 | 5 | 1 | 2 | 0 | 256 |
ConvTrans8 | 5 | 1 | 2 | 0 | 256 |
ConvTrans9 | 5 | 2 | 2 | 1 | 64 |
ConvTrans10 | 5 | 1 | 2 | 0 | 64 |
ConvTrans11 | 5 | 2 | 2 | 1 | 32 |
Conv1 | 1 | 1 | 0 | - | 3 |
3.3 Encoder-decoder structure
In this work, we redesigned the SE network as shown in Table 3, where the Fc2 layer is a fusion layer for speech feature and prior face feature. The CBAM [woo2018cbam] is added to the CNN structure. We have also tried the general deep learning structures like Resnets [he2016deep] or Densenets [huang2017densely] as the SE network, but through experiments, we found that these kind of networks are not effective enough as the role of extract face information in speech portrait model. We hypothesize that Resnets and Densnets are skip-connected networks, which may introduce noise or silencing fragments, causing longer training steps to extract effective speech feature information.
As for the FD network, we exploit the structure of transposed convolutions in [cole2017synthesizing] and design a network that is symmetrical to the VGGFace model. The structure of FD network is shown in Table 2
. To reduce the number of hyperparameters, instead of using a
layers (), we first use two layers ( and ) to resize the input feature as a feature map that is the same as the last feature map of VGGFace model. Then we use our designed transposed convolutions to upsample the feature map as a feature map. Lastly, a convolution is used to convert the feature map to a size .
3.4 Proposed loss functions
In the following, we introduce our proposed loss functions of FD and SE, respectively.
3.4.1 Face Decoder loss function
Concerning the loss function of FD networks, we modify the join loss function, which is similar to the loss function in [cole2017synthesizing]. As shown in Figure 3, our FD loss function is composed of an image loss, which is the error between generated image and original image, and a face feature loss, which is the error between face features of generated image and original image. Such a joint loss function can not only penalize the pixel error between image directly, but also the different of abstract identity. Specifically, we adopt the mixed loss of MS-SSIM and loss [zhao2015loss] as the image loss:
(4) |
where and
is the parameters of Gaussian distribution in MS-SSIM. It considers the influence of resolution, so that it can retain high-frequency information (e.g., image edges or details) but tend to cause brightness changes and color deviations. In order to tackle this problem, we exploit
function to keep brightness and color unchanged. Combining them together can produce a good penalization for image generation task.In order to pay more attention to face contour in the image rather than the value of the pixel itself, we introduce the cosine similarity loss, which is generally used to measure the difference of two embeddings in the feature space:
(5) |
Finally, the total loss function of the FD is:
(6) |
3.4.2 Speech Encoder loss function
A tri-item loss function is proposed for SE network. The first item is loss between the unitized (i.e., the output speech feature of SE network) and unitized (i.e., the real speaker’s face feature obtained by VGGFace). The second item is loss between the output of the FD network’s first layer (: ), which penalizes the difference of hidden layer features. The third item is Cosine Similarity Loss between the output of the VGGFace third layer (: ), which penalizes the difference of identity feature. We have also tried the knowledge distillation loss in [oh2019speech2face], and it shows inferior performance to the Cosine Similarity Loss. Finally, the tri-item loss function is:
(7) |
where , , (, , ) are balanced coefficients which make the gradient of loss items within a similar scale.
Since FD takes part in the calculation of loss function in SE, FD will be trained before SE, and the parameters of FD will be frozen during the training of SE.
4 Experiment
4.1 Datasets and Implementation Details
The AVSpeech dataset [ephrat2018looking] is used to evaluate our model, which is a large-scale speaking video dataset from YouTube. Besides, we use VGGFace dataset to train our gender prediction network to classify the gender, and 0.15 million videos are used as training set and 8 thousand videos as test set. First, we crop the face image to size from AVSpeech training set to train the FD network, whose structure and implementation details are shown in Table 2. Then, we clip and rescale the audio from each AVSpeech video to 6s by concatenating if an audio length is shorter than 6s or cutting if an audio length is longer than 6s. The audio waveform is resampled at 16 kHz and only a single channel is used. Spectrograms are computed by the speaker-independent audio-visual model according to [ephrat2018looking].
We use the spectrogram and face-image pairs to train the SE network, whose structure and implementation details are shown in Table 3
. Our model is implemented by PyTorch 1.1.0 and optimized by Adam
[kingma2014adam] with ,, the learning rate of 0.001 with the exponentially decay rate of 0.9 at every 2 epochs. We finally train our model with 50 epochs, and the batch size is 16.
4.2 Evaluation Metrics
The face image generated by speech portrait model may be interfered with variable face pose, expression, background or noncritical details. Therefore, a series of pixel level evaluation index is not applicable. In this work, we adopt a part of evaluation metrics commonly employed in SOTA work
[oh2019speech2face].More precisely, we not directly compared the similarity of generated face images with original face images, but compare the similarity of face features extracted from them, because face feature has good robustness on expression, head posture light conditions and background
[cole2017synthesizing]. Besides, we compared the speech features from SE network and the corresponding face features from the ground truth face image, using the loss, loss and Cosine Similarity (). When and tends to 0 or tends to 1, the difference between two images tends to be small or two features are similar in the direction. We also measure the similarity between the feature of the face image generated by the AR-SPM and the corresponding face features from ground truth images using the same three metrics denoted as , , . Furthermore, we compare the face feature between the original face features and the features of the face generated by the original face features. For convenience, we call it Face-to-Face in this work, which can be regarded as a benchmark for comparison, since the speech portrait model relies on the face feature from original face image for guidance.Layer | Kernel | Stride | Padding | Out Channels |
---|---|---|---|---|
Conv1 | 7 | 2 | 1 | 64 |
MaxPool1 | 3 | 2 | 0 | - |
Conv2 | 5 | 2 | 1 | 128 |
MaxPool2 | 3 | 2 | 0 | - |
Conv3 | 3 | 1 | 1 | 256 |
Conv4 | 3 | 1 | 1 | 512 |
Conv5 | 3 | 1 | 1 | 512 |
MaxPool3 | 0 | - | ||
CBAM | - | - | - | - |
Fc1 | 1 | - | - | 4096 |
AvgPool1 | - | - | - | |
Fc2 | 1 | - | - | 4096 |
4.3 Quantitative and Qualitative Analysis
In the following, we show the quantitative and qualitative results of the proposed FD network and the AR-SPM.
4.3.1 The Performance of Face Decoder Network
We train the FD network on face images from AVSpeech dataset, and show qualitative results of face reconstruction in Figure 4, we can see that the faces generated by the FD are similar to the ground truth. Besides, since the CBAM attention module is used in our model, the network will concentrate on extracting face features and ignore the background, causing a fuzzy background.
The quantitative results of FD network are shown in the second row of Table 4, showing the face feature similarity of face image generated by FD network using the original face image.

4.3.2 The Performance of AP-SPM
To verify the performance of AR-SPM, we train the SE network on spectrogram and face-image pairs from AVSpeech dataset. Let and denote the model with neutral prior face feature and the model with gender prior face feature (given by the proposed automatic gender classifier), trained with data containing both male and female, respectively. and mean the models using the male and female face prior, and training with only male and only female data, respectively. Besides, we denote prior feature models, which use a layer to fuse the prior face feature and output speech feature as and , respectively.

Model | ||||||
---|---|---|---|---|---|---|
Face-to-Face | - | - | - | 97.158 | 1.890 | 14.534 |
162.957 | 3.176 | 29.191 | 179.594 | 3.499 | 30.231 | |
126.524 | 2.470 | 15.204 | 144.027 | 2.809 | 25.842 | |
143.335 | 2.795 | 27.748 | 148.610 | 2.897 | 25.973 | |
155.008 | 3.023 | 28.115 | 171.860 | 3.350 | 29.308 | |
145.192 | 2.831 | 27.994 | 150.900 | 2.941 | 26.104 | |
128.641 | 2.501 | 19.438 | 156.399 | 3.048 | 21.721 | |
130.861 | 2.542 | 19.610 | 152.985 | 2.981 | 21.409 |
An ablation study is carried out by comparing the similarity between speech feature from SE and the corresponding face features from the original face image by measuring the , distance and (i.e., cosine similarity in form of degree). Results are shown in Table 4, where we notice that achieves the best performance among all cases, while has the worst performance. and have a slight improvement compared to the result of , showing that the use of gender face prior gains less compared with the using of neutral face prior. When we switch to the and , we find that they are better than . This is reasonable since they train female and male data separately (i.e., data dependent) and thus make the task easier. More importantly, a fair comparsion in case of the shows that our method achieves , outperforming the SOTA [oh2019speech2face] results by a large margin. Besides, we see the output feature of layer brings evident benefits when combined with the neutral prior, while it leads to a narrowly worse performance when combined with the gender prior.
Furthermore, we show the decline of the loss function of and by plotting every 1000 steps in batch training in Figure 5. It shows that our converges faster than , and it is able to avoid the unstable training causing by the abnormal data samples. We can see that two abnormal training loss values are appeared at the beginning of the training stage for , while this phenomenon does not appear in the case of .
The qualitative results of the ablation study are shown in Figure 6. We take 10 groups randomly from test set for the demonstration. We first notice that face image generated by the Face-to-Face is the most similar to the original face image since it is the reference standard. More importantly, we find that face image generated by the is also highly similar to the original face image, showing superior quality to . Besides, the performs better than , confirming that the neutral prior is more robust than the gender prior in this work. Recalled that by using the CBAM, our model only focus on the face and ignores the background, leading to blur background in generated face images.

4.3.3 The performance of gender and age recognition
To further evaluate the quality of our proposed AR-SPM, we conduct a gender and age recognition experiment based on the generated face images by our model. Face++ API [faceplus] is utilized to evaluate the gender and age. Results are shown in Table 5 and Table 6, respectively. Due to and obtains superior performance to and as shown in Table 4, we only present the recognition results using and . Naturally, Face-to-Face gives the upper limit of the recognition accuracy. Except that, we find the achieves the highest accuracy in all cases. Besides, we can see that and are narrowly worse than . It should be mentioned that the AVSpeech dataset is not age-balanced (about 35% for Young, 55% for Mid-age and 10% for Elder). The lower accuracy of the elder age comes from the limited older images in the dataset.
These results show that introducing the idea of the residual by using the neutral prior can improve the efficiency of the model training and further enhance the robustness.
Model | Male | Female | Total |
---|---|---|---|
Face-to-Face | 97.4% | 92.7% | 95.5% |
95.6% | 89.8% | 93.2% | |
97.9% | 89.9% | 94.7% | |
87.7% | 93.6% | 90.1% | |
97.4% | 68.0% | 85.8% |
Model | Young | Mid-age | Elder | Total |
---|---|---|---|---|
Face-to-Face | 76.0% | 70.5% | 46.7% | 70.3% |
53.6% | 65.5% | 33.1% | 57.1% | |
67.6% | 66.1% | 51.2% | 65.2% | |
55.5% | 70.4% | 23.5% | 63.0% | |
64.8% | 64.2% | 40.4% | 62.9% |
5 Conclusion
To alleviate the mismatch between speech and face in the speech-based face generation, we propose a novel AR-SPM based on an end-to-end encoder-decoder structure, which utilizes the additional prior face feature to complement the speech feature in the SE network. Two prior face features (i.e., neutral and gender prior face features) are explored according to the gender. In addition, we re-design the encoder and decoder by incorporating the CBAM into the SE and FD to capture the spatial and channel relationships and suppress noise. Results on AVSpeech dataset show that our proposed AR-SPM accelerates the convergence of training and achieves the SOTA performance. In the future, our model will be explored to eliminate the influence of attributes such as hair and image background on the speaker’s face reconstruction, or apply in other application fields, such as preliminary medical image generation or diagnosis from speaker’s speech.
Comments
There are no comments yet.