Log In Sign Up

Adversarial Speaker-Consistency Learning Using Untranscribed Speech Data for Zero-Shot Multi-Speaker Text-to-Speech

Several recently proposed text-to-speech (TTS) models achieved to generate the speech samples with the human-level quality in the single-speaker and multi-speaker TTS scenarios with a set of pre-defined speakers. However, synthesizing a new speaker's voice with a single reference audio, commonly known as zero-shot multi-speaker text-to-speech (ZSM-TTS), is still a very challenging task. The main challenge of ZSM-TTS is the speaker domain shift problem upon the speech generation of a new speaker. To mitigate this problem, we propose adversarial speaker-consistency learning (ASCL). The proposed method first generates an additional speech of a query speaker using the external untranscribed datasets at each training iteration. Then, the model learns to consistently generate the speech sample of the same speaker as the corresponding speaker embedding vector by employing an adversarial learning scheme. The experimental results show that the proposed method is effective compared to the baseline in terms of the quality and speaker similarity in ZSM-TTS.


page 1

page 2

page 3

page 4


Few-Shot Speaker Identification Using Depthwise Separable Convolutional Network with Channel Attention

Although few-shot learning has attracted much attention from the fields ...

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

The cloning of a speaker's voice using an untranscribed reference sample...

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Training a text-to-speech (TTS) model requires a large scale text labele...

Speaker Generation

This work explores the task of synthesizing speech in nonexistent human-...

Improving multi-speaker TTS prosody variance with a residual encoder and normalizing flows

Text-to-speech systems recently achieved almost indistinguishable qualit...

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

This paper proposes a new "decompose-and-edit" paradigm for the text-bas...

1 Introduction

The performance of the neural text-to-speech (TTS) models has dramatically improved in terms of the quality of the speech samples in a recent few years. While the most innovations in TTS field emerged around enhancing the quality of the single-speaker and the multi-speaker models which are trained with a pre-defined speaker set, the deployment in the field application requires TTS systems with various capabilities. One popular demand is an instant speaker adaptation to build a personalized TTS system. Personalized TTS aims to analyze and control the underlying speech factors to imitate the user’s voice characteristics. Nonetheless, these factors are not rigorously defined in a scientific manner and generally known to be entangled which makes it difficult to control each component. In personalized TTS, the main objective is to adapt to a new speaker’s voice characteristics with limited data available. To realize such demand, zero-shot multi-speaker TTS (ZSM-TTS), a sub-branched research under the umbrella of speaker adaptation, has gained an enormous attention from researchers recently. ZSM-TTS seeks to train a multi-speaker TTS model which can generate a speech sample of a new speaker’s voice identity which was not present in the training dataset given a reference utterance without further finetuning the model.

Some previous works have tackled ZSM-TTS using a pre-trained speaker encoder for speaker verification to the existing TTS models [3], [4], [5], [6]. Meanwhile, an effective style modeling method was proposed by [7], where a bank of style vectors and their weights are learned in an unsupervised manner. On the other hand, [8] exploits a meta-learning approach by utilizing an episodic training scheme with the phoneme and style discriminators. However, although these approaches focus on improving the speaker embedding extraction, conditioning scheme, and training method, they disregard the fact that the number of speakers in the current TTS training dataset is far less than the total population which is not sufficient to learn the entire speaker space. This directly results in poor generalization on the unseen speakers at inference leading to unsatisfactory performance.

The main challenge of ZSM-TTS is the speaker domain shift problem, which happens when the speaker outside of the training dataset must be inferred properly. It is commonly known that there is a strong bias towards the speakers from the training dataset. In order to overcome this challenge, we propose to train a TTS model with adversarial speaker-consistency learning (ASCL). The ASCL scheme generates an additional speech sample using a query speaker obtained from external untranscribed audio datasets. Such untranscribed audio datasets are readily available from various sources. The generated sample is then used for an adversarial training where the speaker-consistency discriminator is newly proposed. Training in this way exposes the model to a larger speaker pool than the limited training dataset, hence inducing better speaker generalization for ZSM-TTS.

The proposed method directly addresses to the aforementioned speaker domain shift problem by expanding the speaker pool for the ZSM-TTS. The ASCL scheme is built on the architecture of variational inference TTS (VITS) [9] and its inverse transformation capability of normalizing flow module. We demonstrate the effectiveness of the ASCL by comparing with the baseline using subjective and objective scores. Our results show that the proposed method overcome the baseline in terms of speech quality and speaker similarity.

Our contributions are two-folded as follows:

  1. We propose adversarial speaker-consistency learning (ASCL), a novel way to train a TTS model to address the speaker domain shift problem.

  2. The proposed method leverages the external untranscribed audio datasets where such datasets are available from various sources to expose the TTS model to a larger speaker pool.

2 Background

2.1 VITS for multi-speaker TTS

Our method extends the original VITS [9] leveraging its invertible capability of normalizing flow module. VITS architecture is composed of a posterior encoder, a prior encoder, a decoder, a duration predictor, and a discriminator. The prior encoder is further divided into a text encoder and a normalizing flow which transforms the text-conditional prior distribution into a more complex distribution

. VITS combines the variational autoencoder formulation with an adversarial training scheme to successfully generate an audio in a phoneme-to-wav fashion without a separate vocoder training. VITS is trained to maximize the variational lower bound of a conditional log-likelihood


where x, y, and denote the input phoneme sequence, the target waveform, and the latent acoustic embedding sequence respectively. The essential part of VITS training strategy is the distribution matching between the posterior distribution, , and the prior distribution,

, via Kullback-Leibler divergence loss. The alignment between the text and audio sequence is learned via monotonic alignment search (MAS) algorithm originally proposed in


In the multi-speaker setting, a speaker embedding vector is conditioned to the normalizing flow where the text-conditional prior embedding sequence is trained to become speaker-independent via the forward transformation at training. At inference, VITS generates a speech using a speaker embedding vector and the text-conditional prior embedding sequence , exploiting the inverse transformation of the normalizing flow as follows: ~y = Dec_θ(f^-1_θ(z_f, g))_, where denote the decoder.

3 Proposed method

Our method builds on top of the original VITS architecture and training methodology. The proposed method resolves the speaker domain shift problem with a two-stage approach. Firstly, we generate a speech sample of a query speaker at each training iteration utilizing the external untranscribed audio datasets such that the TTS model is exposed to an extensively large speaker set. Secondly, a speaker-consistency discriminator is introduced to determine if the generated speech sample follows the speaker identity of the given speaker embedding vector in an adversarial manner.

3.1 Speech generation from a query speaker

At each training iteration, we randomly draw a text and a waveform pair from a TTS dataset as a support set and another waveform from an external untranscribed audio dataset as a query to form an input tuple where the speaker sets of the TTS dataset and the untranscribed dataset are disjoint. Our goal is to generate an output pair where it is evaluated by two different objectives. At training, a posterior encoder first receives as an input and outputs a latent acoustic embedding sequence . then goes through the decoder to generate in an autoencoding fashion. On the other hand, we can also generate which contains the content of and the speaker identity of by exploiting the forward and inverse transformation of the flow module with the speaker embedding vectors extracted from and respectively as follows: ~y^t_q &= Dec_θ(f^-1_θ(f_θ(z_v^t_s, g_s), g_q))_.

The motivation behind using the external untranscribed audio data is to extensively expand the speaker set of the training data where only a relatively small number of speakers are available within TTS datasets due to the requirement of the high-quality pairing of

. By utilizing the untranscribed datasets which are easily accessible from various sources including but not limited to speaker verification and automatic speech recognition datasets, the number of speakers that we gain becomes enormously larger than that of TTS datasets. This enforces the TTS model to be exposed to a more diverse speaker pool at training and enhances the overall speaker generalization.

3.2 Adversarial learning

Since the ground truth does not exist, the reconstruction loss of VITS cannot be imposed on . In order to circumvent this problem, an adversarial learning technique is employed to evaluate on the generated output pair and their corresponding speaker embedding vectors . We introduce a speaker-consistency discriminator, , which takes a waveform and a speaker embedding vector as an input pair. tries to determine whether the input waveform is consistent with the given speaker embedding in terms of speaker identity. Utilizing the generated outputs and ground truth samples, distinguishes the real pairs from the generated pairs . The adversarial objective for the speaker-consistency discriminator is formulated as follows:


where we set .

is the generator part of VITS architecture in this case, where . The utilizes LS-GAN [29] instead of the original GAN [30] loss for stable training.

To prevent the ASCL from affecting the training of unrelated modules such as posterior encoder, text encoder, and duration predictor, we use a stop gradient operator at to restrain the back-propagation of the gradient flow. The training pipeline of ASCL built on VITS (ASCL-VITS) is shown in Fig. 1.

Figure 1: The architecture of ASCL-VITS scheme and the training pipeline.

3.3 Speaker-consistency discriminator

We modify the original multi-scale discriminator (MSD) architecture, proposed in MelGAN [11], with a speaker embedding vector as a conditional input. Unlike the original MSD where the discriminator is a mixture of three sub-discriminators which takes an audio segment on three different scales, our speaker-consistency discriminator operates on the raw audio scale without any sub-discriminators.

is composed of a stack of six 1-D strided and grouped convolutional layers each followed by leaky ReLU activations

[12]. Each convolutional layer performs a downsampling operation with the kernel size of 4, which captures the features from the smoothed waveform at different scales. A speaker embedding vector is added to the input of each convolutional layer to evaluate the speaker consistency at each downsampling stage. A post 1-D convolutional layer with the kernel size of 3 is added at the end of the stack to produce the output. The detailed block diagram of is shown in Fig. 2.

Figure 2: The block diagram of the speaker-consistency discriminator.

4 Experiments and results

4.1 Implementation details

The proposed method adds on to the original VITS using the official implementation code111 with a few modifications.

4.1.1 Speaker encoder module

In order to extract the speaker embedding vectors of a large set of speakers, a pre-trained speaker encoder from speaker verification task is employed. We used the Fast ResNet-34 model with a contrastive equilibrium learning (CEL) [13] trained on VoxCeleb2 [14], and its official implementation can be found We trained the speaker encoder with the input of 80-dimensional log mel-spectrograms to extract a 512-dimensional speaker embedding vector. A linear layer which reduces the embedding dimension to 256 followed by a ReLU activation and a linear projection is added to the pre-trained speaker encoder to obtain the final speaker embedding vector. The weights of the pre-trained speaker encoder are frozen throughout the training and inference of ASCL-VITS to consistently draw the speaker embedding vectors from a single learned speaker embedding space.

4.1.2 VITS speaker conditioning

While conditioning the speaker embedding vector to all submodules is known to enhance the overall TTS performance, we restrict our VITS model to condition the speaker embedding to the normalizing flow module and the duration predictor to evaluate the proposed method effectively. Moreover, since the duration is more related to the speech rate of the reference given at inference as well as the identity of the speaker, we train a separate reference encoder to extract an utterance-level reference embedding vector to condition the duration predictor. The reference encoder follows the same architecture as that of the global style token [7] with the output dimension of 256.

4.2 Experimental settings and datasets

4.2.1 Datasets

For ZSM-TTS evaluation, we used the VCTK [15] dataset, which consists of 108 speakers and contains approximately 44 hours of speech recording. We selected 11 speakers as an in-domain test set following [5] and [6], and used the remaining 97 speakers as a paired training dataset. We combined the audio data of LibriTTS-clean-100, LibriTTS-clean-360, and LibriTTS-other-500 from LibriTTS dataset [16]

to build an untranscribed dataset for training. The combined dataset contains 2,311 speakers with approximately 554 hours of speech recording. LibriTTS-test-clean dataset, which is composed of 39 speakers with 9.56 hours of speech recording, is used for the out-of-domain evaluation. We downsampled the audio to 24kHz for training and inference. We extract linear spectrograms with a 1024 Fast Fourier Transform (FFT) size, 256 hop size, and 1024 window size. Then, 80-dimensional mel-scale filterbank is applied to convert the linear spectrograms to mel-spectrograms.

4.2.2 Experimental setup

We first built a VITS model with the proposed ASCL scheme, referred as VITS-ASCL. To demonstrate the effectiveness of ASCL, we trained a vanilla VITS model with the aforementioned pre-trained speaker encoder, referred as Multi-speaker VITS. This baseline has the same architecture as ASCL-VITS without ASCL.

All models were trained using the training set of VCTK which is composed of 97 speakers. To evaluate the baseline and proposed models, we conducted the experiments in two different settings. First, we generated the speech samples from the remaining 11 speakers of VCTK dataset to evaluate on in-domain unseen speakers. For the second experimental setup, we drew 20 speakers randomly from LibriTTS test-clean dataset and generated speech samples to evaluate on out-of-domain unseen speakers. We randomly selected 25 samples from each model depending on the measure and experimental setup.

4.2.3 Evaluation method

To evaluate the effectiveness of ASCL, we first conducted a subjective test to measure the quality of the generated speech samples from each model with mean opinion score (MOS). We also compared the generated speech samples from each model in terms of the speaker similarity via measuring similarity mean opinion score (SMOS). Both measures are 5-point scale ranging from 1 to 5. The participants were asked to evaluate on the quality of samples in terms of intelligibility and naturalness for MOS test, whereas participants were asked to rate the speaker similarity between the given speech sample and the ground truth sample for SMOS test. For the evaluation, 14 participants were asked to rate MOS and SMOS tests on VCTK and LibriTTS unseen speakers. The resulting scores are presented with 95% confidence intervals in the Table


Along with the above subjective tests, we also conducted an objective test which measures the cosine distance similarity between the two speaker embedding vectors extracted from the generated sample and the ground truth sample, respectively. This score, also known as speaker embedding cosine similarity (SECS), ranges from -1 to 1, and as the score is closer to 1, the two given speaker embedding vectors are closer in terms of speaker similarity. We used a publicly available toolkit, SpeechBrain

333 [19] which includes a speaker encoder trained on a speaker verification task.

Model VCTK LibriTTS
Ground Truth 4.760.03 4.680.03 4.620.03 4.470.04
Multi-speaker VITS 4.200.04 3.810.05 4.010.04 3.110.06
ASCL-VITS 4.500.03 4.110.04 4.370.04 3.520.05
Table 1: MOS and SMOS on VCTK and LibriTTS unseen speakers
Model SECS()
Ground Truth 0.78 0.70
Multi-speaker VITS 0.35 0.19
ASCL-VITS 0.37 0.23
Table 2: SECS comparison between the baseline and proposed model on VCTK and LibriTTS unseen speakers

4.3 Results

As presented in Tables 1 and 2, our proposed method, ASCL-VITS, demonstrates the effectiveness with higher scores in MOS, SMOS, and SECS on both VCTK and LibriTTS unseen speaker. One notable observation is that the Multi-speaker VITS encountered multiple occasions where the speaker identity is switched within a generated sample in LibriTTS unseen speaker experiment. However, we did not observe such phenomenon from the samples generated from ASCL-VITS. This shows that our method is more robust in terms of speaker generalization in out-of-domain cases compared to the baseline. The overall results show that ASCL is an effective and suitable method to enhance the speaker generalization of a TTS model for the ZSM-TTS task.

5 Conclusions

In this paper, adversarial speaker-consistency learning (ASCL) is proposed to mitigate the speaker domain shift problem of ZSM-TTS. We first generate a speech of a query speaker using an extensive speaker set from untranscribed datasets and learns to generate the samples of an unseen speaker via an adversarial learning with the speaker-consistency discriminator. The ASCL-applied VITS model outperforms the baseline in ZSM-TTS with both in-domain and out-of-domain experimental settings in terms of the quality and the speaker similarity. The experimental results demonstrate that ASCL is effective in synthesizing high-quality speech samples given a reference audio from an unseen speaker. For future works, we further develop ASCL by extending to different speech factors, such as emotion, prosody, and dialect by utilizing generative models for a more diverse zero-shot multi-factor text-to-speech scenarios.


This work was supported by Institute of Information & communications Technology Planning and Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-00456, Development of Ultra-high Speech Quality Technology for Remote Multi-speaker Conference System)