1 Introduction
Humans are able to imagine a person’s voice solely from that person’s appearance and imagine the person’s appearance solely from his/her voice. Although such predictions are not always accurate, the fact that we can sense if there is a mismatch between voice and appearance should indicate the possibility of being a certain correlation between voices and appearance. In fact, recent studies by Smith et al. [1] have revealed that the information provided by faces and voices is so similar that people can match novel faces and voices of the same sex, ethnicity, and agegroup at a level significantly above chance. Here, an interesting question is whether it is technically possible to predict the voice of a person only from an image of his/her face and predict a person’s face only from his/her voice. In this paper, we make the first attempt to develop a method that can convert speech into a voice that matches an input face image and that can generate a face image that matches the voice providing input speech by learning and leveraging the underlying correlation between faces and voices.
Several attempts have recently been made to tackle the tasks of crossmodal audio/image processing, including voice/face recognition
[2] and audio/image generation [3, 4, 5]. The former task involves detecting which of two given face images is that of the speaker, given only an audio clip of someone speaking. Hence, this task differs from ours in that it does not involve audio/image generation. The latter task involves generating sounds from images/videos. The methods presented in [3, 4, 5] are designed to predict very short sound clips (e.g., 0.5 to 2 seconds long) such as the sounds made by musical instruments, dogs, and babies crying, and are unsuited to generating longer audio clips with richer variations in time such as speech utterances. By contrast, our task is crossmodal voice conversion (VC), namely converting given speech utterances where the target voice characteristics are determined by visual inputs.VC is a technique for converting the voice characteristics of an input utterance such as the perceived identity of a speaker while preserving linguistic information. Potential applications of VC techniques include speakeridentity modification, speaking aids, speech enhancement, and pronunciation conversion. Typically, many conventional VC methods utilize accurately aligned parallel utterances of source and target speech to train acoustic models for feature mapping [6, 7, 8]. Recently, some attempts have also been made to develop nonparallel VC methods [9, 10, 11, 12, 13, 14, 15, 16]
, which require no parallel utterances, transcriptions, or time alignment procedures. One approach to nonparallel VC involves a framework based on conditional variational autoencoders (CVAEs)
[11, 12, 13, 14]. As the name implies, variational autoencoders (VAEs) [17] are a probabilistic counterpart of autoencoders, consisting of encoder and decoder networks. CVAEs [18] are an extended version of VAEs where the encoder and decoder networks can additionally take an auxiliary input. By using acoustic features as the training examples and the associated attribute (e.g., speaker identity) labels as the auxiliary input, the networks are able to learn how to convert an attribute of source speech to a target attribute according to the attribute label fed into the decoder. As a different approach, in [15] we proposed a method using a variant of a generative adversarial network (GAN) [19] called a cycleconsistent GAN (CycleGAN) [20, 21, 22]. Although this method was shown to work reasonably well, one major limitation is that it is designed to learn only mappings between a pair of domains. To overcome this limitation, we subsequently proposed in [16] a method incorporating an extension of CycleGAN called StarGAN [23]. This method is capable of simultaneously learning mappings between multiple domains using a single generator network where the attributes of the generator outputs are controlled by an auxiliary input. StarGAN uses an auxiliary classifier to train the generator so that the attributes of the generator outputs are correctly predicted by the classifier. We further proposed a method based on a concept that combined StarGAN and CVAE, called an auxiliary classifier VAE (ACVAE)
[14]. An ACVAE employs a generator with a CVAE structure and uses an auxiliary classifier to train the generator in the same way as StarGAN. Training the generator in this way can be interpreted as increasing the lower bound of the mutual information between the auxiliary input and the generator output.In this paper, we propose extending the idea behind the ACVAE to build a model for crossmodal VC. Specifically, we use the latent code of an auxiliary face image input encoded by a face encoder as the auxiliary input into the speech generator and use a voice encoder to train the generator so that the original latent code can be recovered from the generated speech using the voice encoder. We also train a face decoder along with the face encoder to ensure that the latent code will contain sufficient information to reconstruct the input face image. In this way, the speech generator is expected to learn how to convert input speech into a voice characteristic that matches an auxiliary face image input and the voice encoder and the face decoder can be used to generate a face image that matches the voice characteristic of input speech.
2 Method
2.1 Variational Autoencoder (VAE)
Our model employs VAEs [17, 18] as building blocks. Here, we briefly introduce the principle behind VAEs.
VAEs are stochastic neural network models consisting of encoder and decoder networks. The encoder aims to encode given data
into a (typically) lower dimensional latent representation whereas the decoder aims to recover the data from the latent representation . The decoder is modeled as a neural network (decoder network) that produces a set of parameters for a conditional distribution where denotes the network parameters. To obtain an encoder using , we must compute the posterior . However, computing the exact posterior is usually difficult since involves an intractable integral over . The idea of VAEs is to sidestep the direct computation of this posterior by introducing another neural network (encoder network) for approximating the exact posterior . As with the decoder network, the encoder network generates a set of parameters for the conditional distribution where denotes the network parameters. The goal of VAEs is to learn the parameters of the encoder and decoder networks so that the encoder distribution becomes consistent with the posterior . We can show that the KullbackLeibler (KL) divergence between and is given as(1) 
Here, it should be noted that since , is shown to be a lower bound for . Given training examples,
(2) 
can be used as the training criterion to be maximized with respect to and , where denotes the sample mean over the training examples. Obviously, is maximized when the exact posterior is obtained .
One typical way of modeling , and
is to assume Gaussian distributions
(3)  
(4)  
(5) 
where and are the outputs of an encoder network with parameter , and and are the outputs of a decoder network with parameter . The first term of (2) can be interpreted as an autoencoder reconstruction error. Here, it should be noted that to compute this term, we must compute the expectation with respect to . Since this expectation cannot be expressed in an analytical form, one way of computing it involves using a Monte Carlo approximation. However, simply sampling from does not work, since once is sampled, is no longer a function of and so it becomes impossible to evaluate the gradient of with respect to . Fortunately, by using a reparameterization with , sampling from can be replaced by sampling from the distribution, which is independent of . This allows us to compute the gradient of the first term of with respect to by using a Monte Carlo approximation of the expectation . The second term is given as the negative KL divergence between and
. This term can be interpreted as a regularization term that forces each element of the encoder output to be uncorrelated and normally distributed. It should be noted that when
and are Gaussians, this term can be expressed as a function of .Conditional VAEs (CVAEs) [18] are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary input . With CVAEs, (3) and (4) are replaced with
(6)  
(7) 
and the training criterion to be maximized becomes
(8) 
where denotes the sample mean over the training examples.
2.2 Proposed model
We use and
to denote the acoustic feature vector sequence of a speech utterance and the face image of the corresponding speaker. Now, we combine two VAEs to model the joint distribution of
and . The encoder for speech (hereafter, the utterance encoder) aims to encode into a timedependent latent variable sequence whereas the decoder (hereafter, the utterance decoder) aims to reconstruct from using an auxiliary input . Ideally, we would like to capture only the linguistic information contained in and to contain information about the target voice characteristics. Hence, we expect that the encoder and decoder work as acoustic models for speech recognition and speech synthesis so that they can be used to convert the voice of an input utterance according to the auxiliary input . We use the timeindependent latent code of an image encoded by the encoder for face images (hereafter, the face encoder) as the auxiliary input into the utterance decoder. The decoder for face images (hereafter, the face decoder) is designed to reconstruct from . Fig. 1 shows the assumed graphical model for the joint distribution .Our model can be formally described as follows. The utterance/face decoders and the utterance/face encoders are represented as the conditional distributions , , and , expressed using NNs with parameters , , and , respectively. Our aim is to approximate the exact posterior by . The KL divergence between these distributions is given as
(9) 
Hence, given the training examples of speech and face pairs , we can use
(10) 
as the training criterion to be maximized with respect to , , , and , where , and denote the sample means over the training examples. We assume the encoder/decoder distributions for and to be Gaussian distributions:
(11)  
(12)  
(13)  
(14) 
where and are the outputs of the utterance encoder network, and are the outputs of the utterance decoder network, and are the outputs of the face encoder network, and and are the outputs of the face decoder network. We further assume and to be standard Gaussian distributions, namely and . It should be noted that we can use the same reparametrization trick as in 2.1 to compute the gradients of with respect to and .
Since there are no explicit restrictions on the manner in which the utterance decoder may use the auxiliary input , we introduce an informationtheoretic regularization term to assist the utterance decoder output to be correlated with as far as possible. The mutual information for and conditioned on can be written as
(15) 
where represents the entropy of , which can be considered a constant term. In practice, is hard to optimize directly since it requires access to the posterior . Fortunately, we can obtain the lower bound of the first term of by introducing an auxiliary distribution
(16) 
This technique of lower bounding mutual information is called variational information maximization [24]. The equality holds in (16) when . Hence, maximizing the lower bound (16) with respect to corresponds to approximating by as well as approximating by this lower bound. We can therefore indirectly increase by increasing the lower bound alternately with respect to and . One way to do this involves expressing using an NN and training it along with all other networks. Let us use the notation to indicate expressed using an NN with parameter . The role of (hereafter, the voice encoder) is to recover timeindependent information about the voice characteristics of . For example, we can assume to be a Gaussian distribution
(17) 
where and are the outputs of the voice encoder network. Under this assumption, (16) becomes a negative weighted squared error between and . Thus, maximizing (16) corresponds to forcing the outputs of the face and voice encoders to be as consistent as possible. Hence, the regularization term that we would like to maximize with respect to , , and becomes
(18) 
where and denote the sample means over the training examples. Here, it should be noted that to compute , we must sample from , from and from . Fortunately, we can use the same reparameterization trick as in 2.1 to compute the gradients of with respect to , , and .
Overall, the training criterion to be maximized becomes
(19) 
Fig. 2 shows the overview of the proposed model.
2.3 Generation processes
Given the acoustic feature sequence of input speech and a target face image , can be converted via
(20) 
A timedomain signal can then be generated using an appropriate vocoder. We can also generate a face image corresponding to the input speech via
(21) 
2.4 Network architectures
Utterance encoder/decoder: As detailed in Fig. 3, the utterance encoder/decoder networks are designed using fully convolutional architectures with gated linear units (GLUs) [25]. The output of the GLU block used in the present model is defined as where is the layer input, and denote convolution layers, and
denote batch normalization layers, and
denotes a sigmoid gate function. We used 2D convolutions to design the convolution layers in the encoder and decoder, where is treated as an image of size with 1 channel.Face encoder/decoder: The face encoder/decoder networks are designed using architectures inspired by those introduced in [26] for conditional image generation.
Voice encoder: As with the utterance encoder/decoder, the voice encoder is designed using a fully convolutional architecture with GLUs. As shown in Fig. 3
, the voice encoder is designed to produce a time sequence of the means (and variances) of latent vectors. Here, we expect each of these latent vectors to represent information about the voice characteristics of input speech within a different time region, which must be timeindependent. One way of implementing (
17) would be to add a pooling layer after the final layer so that the network produces the time average of the latent vectors. However, rather than the time average of these values, we would want each of these values to be as close to as possible. Hence, here we choose to implement (17) by treating as a broadcast version of the latent code generated from the face encoder so that the and arrays have compatible shapes.3 Experiments
To evaluate the proposed method, we created a virtual dataset consisting of speech and face pairs by combining the Voice Conversion Challenge 2018 (VCC2018) [27] and Largescale CelebFaces Attributes (CelebA) [28] datasets. First, we divided the speech data in the VCC2018 dataset and the face image data in the CelebA dataset into training and test sets. For each set, we segmented the speech and face image data according to gender (male/female) and age (young/aged) attributes. We then treated each pair, which consisted of a speech signal and a face image randomly selected from groups with the same attributes, as virtually paired data. This indicates that the correlation between each speech and face image data pair was artificial. However, despite this, we believe that testing with this dataset can still provide a useful insight into the ability of the present method to capture and leverage the underlying correlation to convert speech or to generate images in a crossmodal manner.
All the face images were downsampled to 3232 pixels and all the speech signals were sampled at 22,050 Hz. For each utterance, a spectral envelope, a logarithmic fundamental frequency (log ), and aperiodicities (APs) were extracted every 5 ms using the WORLD analyzer [29, 30]. 36 melcepstral coefficients (MCCs) were then extracted from each spectral envelope using the Speech Processing Toolkit (SPTK) [31]. The aperiodicities were used directly without modification. The signals of the converted speech were obtained from the converted acoustic feature sequences using the WORLD synthesizer.
We implemented two methods as baselines for comparison, which assume the availability of the gender and age attribute label assigned to each data. One is a naive method that simply adjusts the mean and variance of the feature vectors of the input speech for each feature dimension so that they match those of the training examples with the same attributes as the input speech. We refer to this method as “Baseline1”. The other is a twostage method, which performs face attribute detection followed by attributeconditioned VC. For the face attribute detector, we used the same architecture as the face encoder described in Fig. 3
with the only difference being that we added a softmax layer after the final layer so that the network produced the probabilities of the input face image being “male” and “young”. We trained this network using gender/age attribute labels. For the attributeconditioned VC, we used the ACVAEVC
[14], also trained using gender/age attribute labels. We refer to this method as “Baseline2”.We conducted ABX tests to compare how well the voice of speech generated by each of the methods matched the face image input, where “A” and “B” were converted speech samples obtained with the proposed and baseline methods and “X” was the face image used for the auxiliary input. With these listening tests, “A” and “B” were presented in random order to eliminate bias in the order of stimuli. Eleven listeners participated in our listening tests. Each listener was presented “A”,“B”,“X” 30 utterances. Each listener was then asked to select “A”, “B” or “fair” by evaluating which of the two matches “X” better. The results are shown in Fig. 4. As the results reveal, the proposed method significantly outperformed Baseline1 and performed comparably to Baseline2. It is particularly noteworthy that the performance of the proposed method was comparable to that of Baseline2 even though the baseline methods had the advantage of using the attribute labels. Audio examples are provided at [32].
Fig. 5 shows several examples of the face images predicted by the proposed method from female and male speech. As can be seen from these examples, the gender and age of the predicted face images are reasonably consistent with those of the input speech, demonstrating an interesting effect of the proposed method.
4 Conclusions
This paper described the first attempt to solve the crossmodal VC problem by introducing an extension of our previously proposed nonparallel VC method called ACVAEVC. Through experiments using a virtual dataset combining the VCC2018 and CelebA datasets, we confirmed that our method could convert input speech into a voice that matches an auxiliary face image input and generate a face image that matches input speech reasonably well. We are also interested in developing a crossmodal texttospeech system, where the task is to synthesize speech from text with voice characteristics determined by an auxiliary face image input.
Acknowledgements: We thank Mr. Ken Shirakawa (Kyoto University) for his help in annotating the virtual corpus during his summer internship at NTT. This work was supported by JSPS KAKENHI 17H01763.
References
 [1] H. M. J. Smith, A. K. Dunn, T. Baguley, and P. C. Stacey, “Concordant cues in faces and voices: Testing the backup signal hypothesis,” Evolutionary Psychology, vol. 14, no. 1, pp. 1–10, 2016.
 [2] A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices and hearing faces: Crossmodal biometric matching,” arXiv:1804.00326 [cs.CV], 2018.
 [3] L. Chen, S. Srivastava, Z. Duan, and C. Xu, “Deep crossmodal audiovisual generation,” arXiv:1704.08292 [cs.CV], 2017.
 [4] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg, “Visual to sound: Generating natural sound for videos in the wild,” arXiv:1712.01393 [cs.CV], 2018.
 [5] W.L. Hao, Z. Zhang, and H. Guan, “CMCGAN: A uniform framework for crossmodal visualaudio mutual generation,” in Proc. AAAI, 2018.
 [6] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. SAP, vol. 6, no. 2, pp. 131–142, 1998.

[7]
T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximumlikelihood estimation of spectral parameter trajectory,”
IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222–2235, 2007.  [8] K. Kobayashi and T. Toda, “sprocket: Opensource voice conversion software,” in Proc. Odyssey, 2018, pp. 203–210.
 [9] F.L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNNbased approach to voice conversion without parallel training sentences,” in Proc. Interspeech, 2016, pp. 287–291.
 [10] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Nonparallel voice conversion using ivector PLDA: Towards unifying speaker verification and transformation,” in Proc. ICASSP, 2017, pp. 5535–5539.
 [11] C.C. Hsu, H.T. Hwang, Y.C. Wu, Y. Tsao, and H.M. Wang, “Voice conversion from nonparallel corpora using variational autoencoder,” in Proc. APSIPA, 2016.
 [12] ——, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. Interspeech, 2017, pp. 3364–3368.
 [13] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Nonparallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and dvectors,” in Proc. ICASSP, 2018, pp. 5274–5278.
 [14] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “ACVAEVC: Nonparallel manytomany voice conversion with auxiliary classifier variational autoencoder,” arXiv:1808.05092 [stat.ML], Aug. 2018.
 [15] T. Kaneko and H. Kameoka, “Paralleldatafree voice conversion using cycleconsistent adversarial networks,” arXiv:1711.11293 [stat.ML], Nov. 2017.
 [16] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGANVC: Nonparallel manytomany voice conversion with star generative adversarial networks,” arXiv:1806.02169 [cs.SD], Jun. 2018.
 [17] D. P. Kingma and M. Welling, “Autoencoding variational Bayes,” in Proc. ICLR, 2014.

[18]
D. P. Kingma, D. J. Rezendey, S. Mohamedy, and M. Welling, “Semisupervised learning with deep generative models,” in
Adv. NIPS, 2014, pp. 3581–3589.  [19] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Adv. NIPS, 2014, pp. 2672–2680.

[20]
J.Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired imagetoimage translation using cycleconsistent adversarial networks,” in
Proc. ICCV, 2017, pp. 2223–2232.  [21] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover crossdomain relations with generative adversarial networks,” in Proc. ICML, 2017, pp. 1857–1865.
 [22] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for imagetoimage translation,” in Proc. ICCV, 2017, pp. 2849–2857.
 [23] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multidomain imagetoimage translation,” arXiv:1711.09020 [cs.CV], Nov. 2017.
 [24] D. Barber and F. V. Agakov, “The IM algorithm: A variational approach to information maximization,” in Proc. NIPS, 2003.
 [25] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
 [26] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2Image: Conditional image generation from visual attributes,” in Proc. ECCV, 2016.
 [27] J. LorenzoTrueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv:1804.04262 [eess.AS], Apr. 2018.

[28]
Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in
Proc. ICCV, 2015.  [29] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoderbased highquality speech synthesis system for realtime applications,” IEICE Trans. Inf. Syst., vol. E99D, no. 7, pp. 1877–1884, 2016.
 [30] https://github.com/JeremyCCHsu/PythonWrapperforWorldVocoder.
 [31] https://github.com/r9y9/pysptk.
 [32] http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/crossmodalvc/.
Comments
There are no comments yet.