The performance of the neural text-to-speech (TTS) models has dramatically improved in terms of the quality of the speech samples in a recent few years. While the most innovations in TTS field emerged around enhancing the quality of the single-speaker and the multi-speaker models which are trained with a pre-defined speaker set, the deployment in the field application requires TTS systems with various capabilities. One popular demand is an instant speaker adaptation to build a personalized TTS system. Personalized TTS aims to analyze and control the underlying speech factors to imitate the user’s voice characteristics. Nonetheless, these factors are not rigorously defined in a scientific manner and generally known to be entangled which makes it difficult to control each component. In personalized TTS, the main objective is to adapt to a new speaker’s voice characteristics with limited data available. To realize such demand, zero-shot multi-speaker TTS (ZSM-TTS), a sub-branched research under the umbrella of speaker adaptation, has gained an enormous attention from researchers recently. ZSM-TTS seeks to train a multi-speaker TTS model which can generate a speech sample of a new speaker’s voice identity which was not present in the training dataset given a reference utterance without further finetuning the model.
Some previous works have tackled ZSM-TTS using a pre-trained speaker encoder for speaker verification to the existing TTS models , , , . Meanwhile, an effective style modeling method was proposed by , where a bank of style vectors and their weights are learned in an unsupervised manner. On the other hand,  exploits a meta-learning approach by utilizing an episodic training scheme with the phoneme and style discriminators. However, although these approaches focus on improving the speaker embedding extraction, conditioning scheme, and training method, they disregard the fact that the number of speakers in the current TTS training dataset is far less than the total population which is not sufficient to learn the entire speaker space. This directly results in poor generalization on the unseen speakers at inference leading to unsatisfactory performance.
The main challenge of ZSM-TTS is the speaker domain shift problem, which happens when the speaker outside of the training dataset must be inferred properly. It is commonly known that there is a strong bias towards the speakers from the training dataset. In order to overcome this challenge, we propose to train a TTS model with adversarial speaker-consistency learning (ASCL). The ASCL scheme generates an additional speech sample using a query speaker obtained from external untranscribed audio datasets. Such untranscribed audio datasets are readily available from various sources. The generated sample is then used for an adversarial training where the speaker-consistency discriminator is newly proposed. Training in this way exposes the model to a larger speaker pool than the limited training dataset, hence inducing better speaker generalization for ZSM-TTS.
The proposed method directly addresses to the aforementioned speaker domain shift problem by expanding the speaker pool for the ZSM-TTS. The ASCL scheme is built on the architecture of variational inference TTS (VITS)  and its inverse transformation capability of normalizing flow module. We demonstrate the effectiveness of the ASCL by comparing with the baseline using subjective and objective scores. Our results show that the proposed method overcome the baseline in terms of speech quality and speaker similarity.
Our contributions are two-folded as follows:
We propose adversarial speaker-consistency learning (ASCL), a novel way to train a TTS model to address the speaker domain shift problem.
The proposed method leverages the external untranscribed audio datasets where such datasets are available from various sources to expose the TTS model to a larger speaker pool.
2.1 VITS for multi-speaker TTS
Our method extends the original VITS  leveraging its invertible capability of normalizing flow module. VITS architecture is composed of a posterior encoder, a prior encoder, a decoder, a duration predictor, and a discriminator. The prior encoder is further divided into a text encoder and a normalizing flow which transforms the text-conditional prior distribution into a more complex distribution
. VITS combines the variational autoencoder formulation with an adversarial training scheme to successfully generate an audio in a phoneme-to-wav fashion without a separate vocoder training. VITS is trained to maximize the variational lower bound of a conditional log-likelihood
where x, y, and denote the input phoneme sequence, the target waveform, and the latent acoustic embedding sequence respectively. The essential part of VITS training strategy is the distribution matching between the posterior distribution, , and the prior distribution,
, via Kullback-Leibler divergence loss. The alignment between the text and audio sequence is learned via monotonic alignment search (MAS) algorithm originally proposed in.
In the multi-speaker setting, a speaker embedding vector is conditioned to the normalizing flow where the text-conditional prior embedding sequence is trained to become speaker-independent via the forward transformation at training. At inference, VITS generates a speech using a speaker embedding vector and the text-conditional prior embedding sequence , exploiting the inverse transformation of the normalizing flow as follows: ~y = Dec_θ(f^-1_θ(z_f, g))_, where denote the decoder.
3 Proposed method
Our method builds on top of the original VITS architecture and training methodology. The proposed method resolves the speaker domain shift problem with a two-stage approach. Firstly, we generate a speech sample of a query speaker at each training iteration utilizing the external untranscribed audio datasets such that the TTS model is exposed to an extensively large speaker set. Secondly, a speaker-consistency discriminator is introduced to determine if the generated speech sample follows the speaker identity of the given speaker embedding vector in an adversarial manner.
3.1 Speech generation from a query speaker
At each training iteration, we randomly draw a text and a waveform pair from a TTS dataset as a support set and another waveform from an external untranscribed audio dataset as a query to form an input tuple where the speaker sets of the TTS dataset and the untranscribed dataset are disjoint. Our goal is to generate an output pair where it is evaluated by two different objectives. At training, a posterior encoder first receives as an input and outputs a latent acoustic embedding sequence . then goes through the decoder to generate in an autoencoding fashion. On the other hand, we can also generate which contains the content of and the speaker identity of by exploiting the forward and inverse transformation of the flow module with the speaker embedding vectors extracted from and respectively as follows: ~y^t_q &= Dec_θ(f^-1_θ(f_θ(z_v^t_s, g_s), g_q))_.
The motivation behind using the external untranscribed audio data is to extensively expand the speaker set of the training data where only a relatively small number of speakers are available within TTS datasets due to the requirement of the high-quality pairing of
. By utilizing the untranscribed datasets which are easily accessible from various sources including but not limited to speaker verification and automatic speech recognition datasets, the number of speakers that we gain becomes enormously larger than that of TTS datasets. This enforces the TTS model to be exposed to a more diverse speaker pool at training and enhances the overall speaker generalization.
3.2 Adversarial learning
Since the ground truth does not exist, the reconstruction loss of VITS cannot be imposed on . In order to circumvent this problem, an adversarial learning technique is employed to evaluate on the generated output pair and their corresponding speaker embedding vectors . We introduce a speaker-consistency discriminator, , which takes a waveform and a speaker embedding vector as an input pair. tries to determine whether the input waveform is consistent with the given speaker embedding in terms of speaker identity. Utilizing the generated outputs and ground truth samples, distinguishes the real pairs from the generated pairs . The adversarial objective for the speaker-consistency discriminator is formulated as follows:
where we set .
To prevent the ASCL from affecting the training of unrelated modules such as posterior encoder, text encoder, and duration predictor, we use a stop gradient operator at to restrain the back-propagation of the gradient flow. The training pipeline of ASCL built on VITS (ASCL-VITS) is shown in Fig. 1.
3.3 Speaker-consistency discriminator
We modify the original multi-scale discriminator (MSD) architecture, proposed in MelGAN , with a speaker embedding vector as a conditional input. Unlike the original MSD where the discriminator is a mixture of three sub-discriminators which takes an audio segment on three different scales, our speaker-consistency discriminator operates on the raw audio scale without any sub-discriminators.12]. Each convolutional layer performs a downsampling operation with the kernel size of 4, which captures the features from the smoothed waveform at different scales. A speaker embedding vector is added to the input of each convolutional layer to evaluate the speaker consistency at each downsampling stage. A post 1-D convolutional layer with the kernel size of 3 is added at the end of the stack to produce the output. The detailed block diagram of is shown in Fig. 2.
4 Experiments and results
4.1 Implementation details
The proposed method adds on to the original VITS using the official implementation code111https://github.com/jaywalnut310/vits with a few modifications.
4.1.1 Speaker encoder module
In order to extract the speaker embedding vectors of a large set of speakers, a pre-trained speaker encoder from speaker verification task is employed. We used the Fast ResNet-34 model with a contrastive equilibrium learning (CEL)  trained on VoxCeleb2 , and its official implementation can be found here222github.com/msh9184/contrastive-equilibrium-learning. We trained the speaker encoder with the input of 80-dimensional log mel-spectrograms to extract a 512-dimensional speaker embedding vector. A linear layer which reduces the embedding dimension to 256 followed by a ReLU activation and a linear projection is added to the pre-trained speaker encoder to obtain the final speaker embedding vector. The weights of the pre-trained speaker encoder are frozen throughout the training and inference of ASCL-VITS to consistently draw the speaker embedding vectors from a single learned speaker embedding space.
4.1.2 VITS speaker conditioning
While conditioning the speaker embedding vector to all submodules is known to enhance the overall TTS performance, we restrict our VITS model to condition the speaker embedding to the normalizing flow module and the duration predictor to evaluate the proposed method effectively. Moreover, since the duration is more related to the speech rate of the reference given at inference as well as the identity of the speaker, we train a separate reference encoder to extract an utterance-level reference embedding vector to condition the duration predictor. The reference encoder follows the same architecture as that of the global style token  with the output dimension of 256.
4.2 Experimental settings and datasets
For ZSM-TTS evaluation, we used the VCTK  dataset, which consists of 108 speakers and contains approximately 44 hours of speech recording. We selected 11 speakers as an in-domain test set following  and , and used the remaining 97 speakers as a paired training dataset. We combined the audio data of LibriTTS-clean-100, LibriTTS-clean-360, and LibriTTS-other-500 from LibriTTS dataset 
to build an untranscribed dataset for training. The combined dataset contains 2,311 speakers with approximately 554 hours of speech recording. LibriTTS-test-clean dataset, which is composed of 39 speakers with 9.56 hours of speech recording, is used for the out-of-domain evaluation. We downsampled the audio to 24kHz for training and inference. We extract linear spectrograms with a 1024 Fast Fourier Transform (FFT) size, 256 hop size, and 1024 window size. Then, 80-dimensional mel-scale filterbank is applied to convert the linear spectrograms to mel-spectrograms.
4.2.2 Experimental setup
We first built a VITS model with the proposed ASCL scheme, referred as VITS-ASCL. To demonstrate the effectiveness of ASCL, we trained a vanilla VITS model with the aforementioned pre-trained speaker encoder, referred as Multi-speaker VITS. This baseline has the same architecture as ASCL-VITS without ASCL.
All models were trained using the training set of VCTK which is composed of 97 speakers. To evaluate the baseline and proposed models, we conducted the experiments in two different settings. First, we generated the speech samples from the remaining 11 speakers of VCTK dataset to evaluate on in-domain unseen speakers. For the second experimental setup, we drew 20 speakers randomly from LibriTTS test-clean dataset and generated speech samples to evaluate on out-of-domain unseen speakers. We randomly selected 25 samples from each model depending on the measure and experimental setup.
4.2.3 Evaluation method
To evaluate the effectiveness of ASCL, we first conducted a subjective test to measure the quality of the generated speech samples from each model with mean opinion score (MOS). We also compared the generated speech samples from each model in terms of the speaker similarity via measuring similarity mean opinion score (SMOS). Both measures are 5-point scale ranging from 1 to 5. The participants were asked to evaluate on the quality of samples in terms of intelligibility and naturalness for MOS test, whereas participants were asked to rate the speaker similarity between the given speech sample and the ground truth sample for SMOS test. For the evaluation, 14 participants were asked to rate MOS and SMOS tests on VCTK and LibriTTS unseen speakers. The resulting scores are presented with 95% confidence intervals in the Table1.
Along with the above subjective tests, we also conducted an objective test which measures the cosine distance similarity between the two speaker embedding vectors extracted from the generated sample and the ground truth sample, respectively. This score, also known as speaker embedding cosine similarity (SECS), ranges from -1 to 1, and as the score is closer to 1, the two given speaker embedding vectors are closer in terms of speaker similarity. We used a publicly available toolkit, SpeechBrain333https://speechbrain.github.io/  which includes a speaker encoder trained on a speaker verification task.
As presented in Tables 1 and 2, our proposed method, ASCL-VITS, demonstrates the effectiveness with higher scores in MOS, SMOS, and SECS on both VCTK and LibriTTS unseen speaker. One notable observation is that the Multi-speaker VITS encountered multiple occasions where the speaker identity is switched within a generated sample in LibriTTS unseen speaker experiment. However, we did not observe such phenomenon from the samples generated from ASCL-VITS. This shows that our method is more robust in terms of speaker generalization in out-of-domain cases compared to the baseline. The overall results show that ASCL is an effective and suitable method to enhance the speaker generalization of a TTS model for the ZSM-TTS task.
In this paper, adversarial speaker-consistency learning (ASCL) is proposed to mitigate the speaker domain shift problem of ZSM-TTS. We first generate a speech of a query speaker using an extensive speaker set from untranscribed datasets and learns to generate the samples of an unseen speaker via an adversarial learning with the speaker-consistency discriminator. The ASCL-applied VITS model outperforms the baseline in ZSM-TTS with both in-domain and out-of-domain experimental settings in terms of the quality and the speaker similarity. The experimental results demonstrate that ASCL is effective in synthesizing high-quality speech samples given a reference audio from an unseen speaker. For future works, we further develop ASCL by extending to different speech factors, such as emotion, prosody, and dialect by utilizing generative models for a more diverse zero-shot multi-factor text-to-speech scenarios.
This work was supported by Institute of Information & communications Technology Planning and Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-00456, Development of Ultra-high Speech Quality Technology for Remote Multi-speaker Conference System)
-  J. Shen et al., ”Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Calgary, AB, Canada, 15–20 April 2018, pp. 4779–4783.
-  A. Oord et al., ”Wavenet: A generative model for raw audio,” 2016, arXiv preprint arXiv:1609.03499.
-  R. Skerry-Ryan et al., “Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4693–4702.
-  E. Cooper et al., ”Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Barcelona, Spain, 4–8 May 2020, pp. 6184-6188.
-  E. Casanova et al., ”SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model,” 2021, arXiv:2104.05557.
-  E. Casanova, J. Weber, C. Chulby, A. Junior, E. Golge, and M. Ponti, ”YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” 2022, arXiv:2112.02418.
-  Y. Wang et al., ”Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. Int. Conf. Mach. Learn., Stockholm, Sweden, 10–15 July 2018, Volume 80, pp. 5180–5189.
-  D. Min, D. Lee, E. Yang, and S. Hwang, ”Meta-StyleSpeech: Multi-speaker adaptive text-to-speech generation,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 7748-7759.
-  J. Kim, J. Kong, and J. Son, ”Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 5530-5540.
-  J. Kim, S. Kim, J. King, and S. Yoon, ”Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” in proc. Neural Inf. Process. Syst., vol. 33, 2020, pp. 8067-8077.
K. Kumar et al., ”MelGAN: Generative adversarial networks for conditional waveform synthesis,” inproc. Neural Inf. Process. Syst., 2019.
-  B. Xu, N. Wang, T. Chen, and M. Li, ”Empirical evaluation of rectified activations in convolution network,” 2015, arXiv:1505.00853.
-  S. Mun, W. Kang, M. Han, and N. Kim, ”Unsupervised representation learning for speaker recognition via contrastive equilibrium learning,” 2020, arXiv:2010.11433.
-  J. Chung, A. Nagrani, and A. Zisserman, ”VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, Hyderabad, India,2-6 Sep 2018, pp. 1086–1090.
-  J. Yamagishi, C. Veaux, and K. MacDonald, ”CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92), 2019.
-  H. Zen et al., ”LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. Interspeech, Graz, Austria, 16–19 Sept 2019, pp. 1526–1530.
-  Y. Wang et al., ”Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, Toronto, ON, Canada, 24–28 June 2017, pp. 4006–4010.
-  H. Heo, B. Lee, J. Huh, and J. Chung, “Clova baseline system for the voxceleb speaker recognition challenge 2020,” 2020, arXiv preprint arXiv:2009.14153.
-  M. Ravanelli et al., “Speechbrain: A general-purpose speech toolkit,” 2021, arXiv preprint arXiv:2106.04624.
-  D. Kingma and J. Ba, ”Adam: A method for stochastic optimization,” in Proc. Int. Conf. Learn. Representation, 2015.
-  D. Kingma and M. Welling, ”Auto-encoding variational bayes,” in Proc. Int. Conf. Learn. Representation, 2014.
-  L. Dihn, D. Krueger, and Y. Bengio, ”NICE: Non-linear independent components estimation,” 2015, arXiv:1410.8516.
L. Dihn, J. Solh-Dickstein, and S. Bengio, ”Density estimation using real NVP,” inProc. Int. Conf. Learn. Representation, 2015.
-  J. Donahue, S. Dieleman, M. Binkowski, E. Elsen, and K. Simonyan, ”End-to-end adversarial text-to-speech,” in arXiv preprint arXiv:2006.03575, 2020.
-  J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 17022-17033.
-  R. Skerry-Ryan et al., “Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 4693–4702.
-  S. Lee, H. Yoon, H. N, J. Kim, and S. Lee, ”Multi-SpectroGAN: High-diversity and high-fidelity spectrogram generation with adversarial style combination for speech synthesis,” in Proc. AAAI Conf. Artif. Intell., 2021, pp. 13198-13206.
-  N. Kumar, S. Goel, A. Narang, and B. Lall, ”Normalization driven zero-shot multi-speaker speech synthesis,” in Proc. Interspeech, Brno, Czechia, 30 August – 3 September, 2021, pp. 1354–1358.
-  X. Mao et al., ”Least squares generative adversarial networks,” in Proc. Int. Conf. Comp. Vision,, Venice, Italy, 2017, pp.2813-2821.
-  I. Goodfellow et al., ”Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014.
-  B. Tong et al., ”Adversarial zero-shot learning with semantic augmentation,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 2476-2483.
-  R. Gao et al., ”Zero-VAE-GAN: Generating unseen features for generalized and transductive zero-shot learning,” IEEE Trans. on Image Proc., vol 29, 2020, pp. 3665-3680.
-  A. Grover, M. Dhar, and S. Ermon, ”Flow-GAN: Combining maximum likelihood and adversarial learning in generative models,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 3069-3076.
-  A. Grover, C. Chute, R. Shu, Z. Cao, and S. Ermon, ”Aligflow: Cycle consistent learning from multiple domains via normalizing flows,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 4028-4035.