Voice Conversion (VC) is a technique used to change the perceived identity of a source speaker to that of a target speaker, while maintaining linguistic information unchanged. It has many potential applications, such as generating new voices for TTS systems [Kain1998], dubbing in movies and videogames, speaking assistance [Kain2007, Nakamura2012] and speech enhancement [Inanoglu2009, Turk2010, Toda2012]
. This technique consists in creating a mapping function between voice features of two or more speakers. Many ways of obtaining this result have been explored: Gaussian mixture models (GMM)[Stylianou1998, Toda2007, Helander2010]
, restricted Boltzmann machine (RBM)[Chen2014, Nakashika2014]
, feed forward neural networks (NN)[Sun2015]
, recurrent neural networks (RNN)[Desai2010, Mohammadi2014]
and convolutional neural networks (CNN)[Kaneko2017]. The majority of the mentioned VC approaches make use of parallel speech data, or, in other terms, recordings of different speakers speaking the same utterances. When the recordings are not perfectly time aligned, these techniques require some sort of automatic time alignment of the speech data between the different speakers, which can often be tricky and not robust. Methods that don’t require parallel data have also been explored: some of them need a high degree of supervision [Dong2015, MengZhang2008], requiring transcripts of every speech recording and lacking the ability to accurately capture non-verbal information. Methods involving Generative Adversarial Networks [Goodfellow2014] have also been proposed [Kaneko2018, Kameoka2018, Kaneko2019]: while often producing realistic results, they only allow the conversion of speech samples with a fixed or maximum length. We propose a voice conversion method that doesn’t rely on parallel speech recordings and other kinds of supervision and is able to convert samples of arbitrary length. It consists of a Generative Adversarial Network architecture, made of a single generator and discriminator. The generator takes high definition spectrograms of speaker A as input and converts them to spectrograms of speaker B. A siamese network is used to maintain linguistic information during the conversion by the generator. An identity loss is also used to strengthen the linguistic connection between the source and generated samples. We are able to translate specrograms with a time axis that is arbitrarily long. To accomplish that, we split spectrograms along the time axis, feed the resulting samples to the generator, concatenate pairs of generated samples along the time axis and feed them to the discriminator. This allows us to obtain a generated concatenated spectrogram that doesn’t present any discontinuities in the concatenated edges. We finally show that the same technique previously described can also translate a music sample of one genre to another genre, proving that the algorithm is flexible enough to perform different kinds of audio style transfer.
2 Related Work
Generative Adversarial Networks (GANs, [Goodfellow2014]
) have been especially used in the context of image generation and image-to-image translation[Radford2016, Karras2018, Brock2019, Isola2017, Zhu2017, Liu2017, Liu2019]. Applying the same GAN architectures designed for images to other kinds of data such as audio data is possible and has been explored before. [Donahue2018a] shows that generating audio with GANs is possible, using a convoutional architecture on waveform data and spectrogram data. [Kaneko2018, Kameoka2018, Kaneko2019] propose to use the different GAN architectures to perform voice conversion translating different features (MCEPs, log , APs) instead of spectrograms. We choose a single generator and discriminator architecture as explained in [Amodio2019]
, where a siamese network is also used to preserve content information in the translation. The proposed TraVeL loss aims at making the generator preserve vector arithmetic in the latent space produced by the siamese network. The trasformation vector between images (or spectrograms) of the source domain (or speaker) must be the same as the trasformation vector between the same images converted by the generator to the target domain. In this way the network doesn’t rely on pixel-wise differences (cycle-consistency constraint) and proves to be more flexible at translating between domains with substantially different low-level pixel features. This is particularly effective on the conversion of speech spectrograms, which can be quite visibly different from speaker to speaker. Furthermore, not relying on pixel-wise constraints we are also able to work with audio data different from speech such as music, translating between totally different music genres.
Given audio samples in the form of spectrograms from a source domain (speaker, music genre, etc…), our goal is to generate realistic audio samples of the target domain while keeping content (linguistic information in the case of voice translation) from the original samples (see Fig. 1).
3.1 Spectrogram Splitting and Concatenation
Let be the source domain and the target domain. and are the spectrogram representations of the audio samples in the training dataset, each with shape , where represents the height of the spectrogram (mel channels in the case of mel-spectrograms) and where , the time axis, varies from sample to sample. In order to be able to translate spectrograms with a time axis of arbitrary length, we extract from the training spectrograms and , each with a shape , where is a constant. We then split each along the time axis obtaining and with shape . Translating each pair with a generator results in pairs : concatenating them together along the time axis results in with shape , where . We finally feed the real samples and the generated and concatenated samples to a discriminator . With this technique the generator is forced to generate realistic samples with no discontinuities on the edges of the frequency axes, so that when concatenated with adjacent spectrogram samples the final spectrograms look realistic to the discriminator. After training, when translating a spectrogram, we first split it in sequential
samples (we use padding ifis not a multiple of ), feed them to the generator and concatenate them back together into the original shape.
3.2 Adversarial Loss
MelGAN-VC relies on a generator and discriminator . is a mapping function from the distribution to the distribution . We call the generated distribution. distinguishes between the real and the generated . Given training samples and with shape the generator must learn the mapping function. Notice that with we consider the function that takes the spectrogram with shape as input, splits it along the time axis into with shape , feeds each one of the two samples to and concatenates the outputs to obtain a final spectrogram. An adversarial loss is used: we notice that the hinge loss [Zhang2018] performs well for this task. Thus we use the following adversarial losses for and
The discriminator iteratively learns how to distinguish real samples of distribution from generated samples of distribution , while the generator iteratively learns how to improve its mapping to increase the loss of . In this way generates the distribution as similar to as possible, achieving realism in the generated samples.
3.3 TraVeL Loss
Originally introduced in [Amodio2019], the TraVeL loss (Transformation Vector Learning loss) aims at keeping transformation vectors between encodings of pairs of samples equal, being the generated samples with shape . This allows the generator to preserve content in the translation without relying on pixel-wise losses such as the cycle-consistency constraint [Zhu2017], as this makes the translation between complex and heterogeneous domains substantially difficult.We define a transformation vector between as
We use a cooperative siamese network to encode samples in a semantic latent space and formulate a loss to preserve vector arithmetic in the space such as
where with is the output vector of siamese network . Thus the loss is the following
We consider both cosine similarity and euclidean distance so that both angle and magnitude of transformation vectors must be preserved in latent space. The TraVeL loss is minimized by bothand : the two networks must ’cooperate’ to satisfy the loss requirement. can learn a trivial function to satisfy (5), thus we add the standard siamese margin-based contrastive loss [Melekhov2016, Ong2017] to eliminate this possibility.
where is a fixed value. With this constraint, encodes samples such as in latent space each encoding must be at least apart from every other encoding, saving the network from collapsing into a trivial function.
3.4 Identity Mapping
We notice that when training the system for a voice conversion task with the constraints explained above, while the generated voices sound realistic, some linguistic information is lost during the translation process. This is to be expected given the reconstruction flexibility of the generator under the TraVeL constraint. We extract samples of shape from original spectrograms in domain and we adopt an identity mapping [Taigman2017, Zhu2017] to solve this issue.
The identity mapping constraint isn’t necessary when training for audio style transfer tasks that are different from voice conversion, as there is no linguistic information to be preserved in the translation.
3.5 MelGAN-VC Loss
The final losses for , and are the following
While aims at making the generator preserve low-level content information, without relying on pixel-wise constraints, influences the generator to preserve high-level features. Tweaking the weight constant allows to balance the two content-preserving constraints. A high value of will result in generated samples that have similar high-level structure as the source samples, but generally inferior resemblance to the style of the target samples, thus less realistic. On the other hand, eliminating the identity mapping component from the loss () will generally result in more realistic translated samples with less similar structure to the source ones.
4 Implementation Details
We use fully convolutional architectures for the generator (), discriminator (, PatchGAN discriminator [Isola2017]), and siamese () networks. outputs a vector of length . For our experiments we choose . relies on a u-net architecture, with convolutions for downscaling and sub-pixel convolutions [Shi2016] for upscaling, to eliminate the possibility of checkerboard artifacts of transposed convolutions. Following recent trends in GAN research [Miyato2018, Zhang2018], each convolutional filter of both and
is spectrally normalized as this greatly improves training stability. Batch normalization is used inand . After experimenting with different loss weight values, we choose , , when training for voice conversion tasks, while we eliminate the identity mapping constraint () for any other kind of audio style transfer. During training, we choose , Adam [Kingma2015] as the optimizer, as the learning rate for and as the learning rate for and [Heusel2017, Zhang2018], while we update multiple times for each and update. We use audio files with a sampling rate of 16 kHz. We extract spectrograms in the mel scale with log-scaled amplitudes (Fig. 2), normalizing them between -1 and 1 to match the output of the
activation of the generator. The following hyperparameters are used:, , , . We notice that a higher value of allows the network to model longer range dependencies, while increasing the computational cost. To invert the mel-spectrograms back into waveform audio the traditional Griffin-Lim algorithm [Griffin1983] is used, which, thanks to the high dimensionality of the spectrograms, doesn’t result in a significant loss in quality.
We experiment with the ARCTIC dataset111http://www.festvox.org/cmu_arctic/ for voice conversion tasks. We perform intra-gender and inter-gender voice translation. In both cases MelGAN-VC produces realistic results with clearly understandable linguistic information that is preserved in the translation. We also extract audio from a number of online videos from featuring speeches of Donald Trump. The extracted audio data appears noisier and more more heterogeneous than the speech data from the ARCTIC dataset, as the Donald Trump speeches were recorded in multiple different real-world conditions. Training MelGAN-VC for voice translation using the real-world noisy data as source or target predictably results in noisier translated speeches with less understandable linguistic information, but the final generated voices are overall realistic. We finally experiment with the GTZAN dataset222http://marsyas.info/downloads/datasets.html, which contains 30 seconds samples of different musical pieces belonging to multiple genres. We train MelGAN-VC to perform genre conversion between different musical genres (Fig. 3). After training with and without the identity mapping constraint we conclude that it is not necessary for this task, where high-level information is less important, and we decide not to implement it during the rest of our experiments in genre conversion, as this greatly reduces computational costs. If implemented however, we notice that the translated music samples have a stronger resemblance to the source ones, and in some applications this result could be preferred. Translated samples of speech and music are available on 333https://youtu.be/3BN577LK62Y.
We proposed a method to perform voice translation and other kinds of audio style transfer that doesn’t rely on parallel data and is able to translate samples of arbitrary length. The generator-discriminator architecture and the adversarial constraint result in highly realistic generated samples, while the TraVeL loss shows to be an effective constraint to preserve content in the translation, while not relying on cycle-consistency. We conducted experiments and showed the flexibility of our method with substantially different tasks. We believe it is important to discuss the possibility of misuse of our technique, especially given the level of realism achievable by our technique as well as other methods. While applications such as music genre conversion don’t appear to present dangerous uses, voice conversion can be easily misused to create fake audio data for political or personal reasons. It is crucial to also invest resources into developing methods to recognize fake audio data.