1 Introduction
With the advance of deep learning, modern text-to-speech (TTS) systems have developed end-to-end pipelines and enable the generation of speech signals approaching the human level of naturalness. For example, Tacotron-based approaches [Wang2017TacotronTE] [shen2018natural] will first map linguistic features of textual input into spectrograms, and then use a vocoder model [oord2016wavenet] to obtain corresponding speech signals. Such an encoder-decoder network architecture and attention mechanism have been widely used now, and are shown to achieve remarkable performance in synthesized speech signals [jia2018transfer].
In this paper, we aim to solve a novel task in multi-speaker TTS systems: how to create new synthetic, fictional speaker profiles for the use of speech synthesis. The motivation behind this task is twofold: first and foremost, creating new synthetic speaker profiles helps to obtain a greater variety of voice profiles, which itself is a crucial aspect of multi-speaker TTS systems. Secondly, current multi-speaker TTS systems are limited to synthesize speech signals with the same speaker profiles as their training data. Such a limitation comes from the widely-used speaker embedding vectors [skerry2018towards] [gibiansky2017deep] in TTS systems. In particular, each speaker embedding vector captures the voice characteristics of one speaker in training data. This one-to-one correspondence between speaker embedding vectors and individual speakers in TTS systems posts constrains on its generalization capability. As a result, it struggles to generate speech signals with speaker profiles that are new and different from training data, thus hinders the diversity of voice profiles in multi-speaker TTS systems.

To solve this problem, we propose to use generative networks for modelling speaker profiles in TTS systems, which breaks down the one-to-one correspondence between speaker embedding vectors and speaker identities. In particular, we choose to use a variational auto-encoder (VAE) [kingma2013auto] to synthesize speaker profiles along with TTS models. The trained generative model will learn the probabilistic distribution of speaker profiles and then create synthetic speaker profiles via sampling from this distribution. It is worth noting that the speaker profile modelling is decoupled from speech synthesis, in order to solely learn the probabilistic distribution of speaker characteristics.
To enhance the quality of synthetic speaker profiles, we propose to use two specific techniques when training the VAE for speaker profiles: i) a triplet loss upon the latent space of VAE, in order to encourage a more structured latent space. Triplet loss [chechik2010large]
is a standard approach in metric learning and has shown promising performance in both face recognition
[schroff2015facenet] and speaker recognition [chung2020in], and ii) another challenge here is about disentanglement. During speech synthesis, many underline factors (e.g, speaker profiles, speech prosody and emotion status) are often mixed together. Disentangling these factors and explicitly generating certain speaker profiles is not easy. Therefore, we propose to use a shuffle mechanism of spectrograms to learn the disentangled representation of speaker profiles. During training, this mechanism will feed TTS models with spectrograms from the same speaker but have different speech content. The only consistency here is the speaker identity, and by doing so, speech characteristics other than speaker profiles will be filtered out.2 Methodology
Our work is based on Amazon multi-speaker TTS system111https://aws.amazon.com/polly/. In essence, we replace discrete speaker embedding vectors in the multi-speaker TTS system, and use a variational auto-encoder to learn speaker profile information from reference spectrograms. An overview of the proposed multi-speaker TTS system is depicted in Figure 1.
2.1 Multi-speaker TTS system
Standard TTS systems nowadays use an end-to-end pipeline based on the sequence-to-sequence (seq2seq) modelling [sutskever2014sequence] with an attention paradigm [bahdanau2015neural]. It takes textual features as the input and produces spectrogram frames, which are then converted into waveforms. As shown in Figure 1, the TTS system includes an encoder, an attention-based decoder, and a neural vocoder [shen2018natural].
To explicitly model speaker profiles in TTS, embedding vectors are used to represent each speaker in training data. These vectors will be concatenated into the attention layer of decoder, propagating the speaker profile information into synthesized speech signals. Specifically, these speaker embedding vectors can be initialized in different ways, such as using a uniform distribution over
[gibiansky2017deep], the Glorot initialization [glorot2010understanding] or extracted embeddings from speaker recognition and verification networks [variani2014deep] [heigold2016end] [snyder2018x]. For those fixed-dimensional vectors that are extracted from neural networks, they are also widely used in the field of speaker recognition, with terminologies such as x-vector and d-vector.
2.2 Proposed speaker profile modelling
Herein we propose to use a variational auto-encoder to synthesize speaker profiles. As shown in Figure 1, an encoder of VAE is used as the reference encoder to obtain the speaker embedding vector z. It takes the input from a reference spectrogram and outputs the mean
and the variance
of a 32-dimensional diagonal Gaussian distribution. As defined in
[kingma2013auto] the speaker embedding z:(1) |
with we signify an element-wise product. is sampled from a standard Gaussian distribution. After obtaining the speaker embedding z, it will be concatenated with the phonetic encoder output at the attention layer.
In the proposed method, synthetic speaker profiles are created by sampling from a 32-dimensional diagonal Gaussian distribution, where the latent variable z lies in. With a well-trained VAE, the Gaussian distribution represents all speaker profile characteristics learned from training data. Sampling from it is equivalent to randomly choose one combination of these characteristics as the synthetic speaker profile.
Besides, to enhance the quality of synthetic speaker profiles, we propose to use two specific techniques when training the VAE: i) a triplet loss and ii) a shuffle mechanism.
Triplet loss. To encourage a structured latent space of speaker embeddings, we apply a triplet loss upon the latent variable z. The triplet loss will enforce embedding vectors of the same speaker become similar to each other, while embeddings of different speakers far away. As defined in [chechik2010large], the triplet loss can be formulated as follows:
(2) |
where A is an anchor input, i.e., a baseline utterance. P is a positive input, i.e., another utterance from the same speaker as A. N is a negative input, i.e., one utterance from a different speaker than A. is the VAE encoder that extracts embedding vectors from input utterances. The distance from the baseline (anchor) input to a positive input is minimized, while the distance from the baseline (anchor) input to a negative input is maximized. is a margin between positive and negative pairs, where we use in the experiments.
Shuffle mechanism
. One challenge here in modelling speaker profiles is how to disentangle the speaker profile information out of other mixed speech characteristics, such as speech prosody, environmental noise and emotional status. To this end, we apply a shuffle mechanism that dynamically selects the input reference spectrogram at each epoch. This mechanism will randomly choose the reference spectrogram from one of the output speaker’s spectrograms. During training the only consistency between the reference spectrogram and the output spectrogram is their corresponding speaker identity. By doing so, speech characteristics other than speaker profiles will be filtered out after training. Compared to other approaches for disentangling representations in TTS
[hsu2018hierarchical] [hsu2019disentangling], the proposed method is both simple and effective as demonstrated in our experimental results.3 Experimental Evaluation
This section provides an experimental evaluation of the proposed method for building synthetic speaker profiles. First of all, it is worth mentioning that how to evaluate synthesized speech signals from TTS systems has always been a challenging research topic, and remains an open question nowadays. There are several subjective metrics used in past works to evaluate synthesized speech signals, such as the estimated mean-opinion scores (MOS), which relies on the judgement of human annotators. To promote the objective evaluation of synthesized speech signals, herein we propose to use three objective metrics summarized as follows:
-
Distinctiveness. The proposed method is first assessed in the distinctiveness metric, which aims to quantify the diversity of synthetic speaker profiles and gives a global picture of its diversity. Specifically, we use the false-accept rate (FAR) obtained from a given speaker verification model as the distinctiveness: when the FAR is smaller, it indicates the speaker profiles of a set of synthesized speech signals are more diverse.
-
Speaker similarity
. Different from the distinctiveness metric that provides a global picture of speaker diversity, we use cosine similarity scores between speaker profiles to conduct a local comparison. In particular, we examine the similarities of synthetic speaker profiles along an interpolation line, i.e., using interpolation to traverse between pairs of existing speaker profiles. It provides a zoom-in view on the learned VAE latent space, and check how the synthetic speaker profile is changing from one to the other.
-
Intelligibility. Previous two metrics quantify how synthetic speaker profiles are different from each other, from both global and local perspectives. However, there exists a trivial solution to obtain distinctive speaker profiles: making synthetic speaker profiles as random noise (then they are indeed different from each other but is a meaningless solution). To prevent this, we use the intelligibility metric to measure the quality of synthetic speaker profiles and synthesized speech signals. Similar to [taylor2021confidence], we use the word error rate (WER) of a given speech recognition model as the intelligibility: when the WER is smaller, it means the intelligibility of synthesized speech signals is better.
The proposed method and its comparisons are trained on Amazon internal multi-speaker corpus, which contains more than 700k utterances from 2870 speakers. 8 V100 GPUs are used for training, and each model often takes days training time to converge. Besides, we use the ADAM optimizer to update model weights, with minimizing the loss between the original and generated mel-spectrograms.
3.1 Distinctiveness
To quantify how synthesized speech signal is different from each other in term of its speaker profile characteristics, we use the false-accept rate (FAR) from the speaker verification model [chung2020in] as the distinctiveness metric. It is worth emphasizing why FAR can be used here: let us assume the given speaker verification model is reasonably good, then a false acceptance case will happen (with an increase of FAR) when two compared speech signals have similar speaker profiles, i.e., their similarity score is high. In other words, it is hard for the speaker verification model to tell the difference between speech signals in term of their speaker identities. In that case, the FAR is large, which indicates the speaker profiles of speech signals are less distinctive. For a good performance, i.e., diversity of synthetic speaker profiles, we expect the value of FAR to be small.
Table 1 reports the normalized FAR values of the proposed method and its comparison, calculated by dividing the FAR of a baseline method. As we can see, the proposed VAE outperforms d-vector that is used as speaker profile embeddings in multi-speaker TTS systems. To promote a more comprehensive view of FAR, there are three threshold percentile scores listed in the table. In all these thresholds, the proposed VAE with a triplet loss and a shuffle mechanism achieves the best performance, i.e, most diverse synthetic speaker profiles.
Speaker profile modelling | Threshold percentiles | ||
---|---|---|---|
60th | 70th | 80th | |
d-vector | 1.000 | 1.000 | 1.000 |
Variational auto-encoder (VAE) | 0.802 | 0.831 | 0.945 |
VAE with a triplet loss | 0.734 | 0.686 | 0.836 |
VAE with a triplet loss and shuffle | 0.716 | 0.657 | 0.774 |
3.2 Speaker similarity
In this section, we aim to provide a zoom-in view on synthetic speaker profiles by comparing with existing speaker profiles in training data. Specifically, we take pairs of naturally spoken speech signals, encode them with the VAE encoder and then linearly interpolate on the latent space to obtain interpolated synthetic speaker profile vectors. The obtained speaker profile vector is concatenated into the attention layer of the TTS decoder to generate speech spectrograms of the corresponding speaker profile.

We use cosine similarity scores from the speaker verification network [chung2020in] to measure the similarity/difference between speaker profiles. As shown in Figure 2, the interpolated synthetic speaker profiles from the proposed method show a smooth and gradual transformation from one natural speaker profile to the other, which demonstrates the continuity of the learned manifold space z. Compare to the baseline approach using d-vector (where you can see a sudden jump from the weight to ), our synthetic speech signals also have more diverse speaker profiles, which further supports the result about distinctiveness in section 3.1.
3.3 Intelligibility
To evaluate the intelligibility of synthesized speech signals, we use the word error rate (WER) from Amazon speech recognition system222https://aws.amazon.com/transcribe/. As shown in [taylor2021confidence], ASR-based metrics, e.g. WER, can perform reliably for evaluating intelligibility of TTS systems, and indeed on a comparable level to the paid human annotators/listeners.
Table 2 shows the normalized WER results of the proposed method and its comparison, calculated by dividing the WER of a baseline method. Herein, we compare different methods using a certain number of synthetic speaker profiles generated from these methods. As we can see, the proposed VAE with a triplet loss achieves the best performance with the smallest WERs, showing the benefit of the triplet loss during training, as it enforces a more structured latent space on z. On the other hand, the shuffle mechanism has a negligible effect on the intelligibility of synthesized speech signals.
Speaker profile modelling | Number of speaker profiles | ||
---|---|---|---|
1 | 250 | 500 | |
d-vector | 1.000 | 1.000 | 1.000 |
Variational auto-encoder (VAE) | 0.926 | 0.983 | 0.924 |
VAE with a triplet loss | 0.914 | 0.971 | 0.878 |
VAE with a triplet loss and shuffle | 0.914 | 0.970 | 0.879 |
4 Conclusion
In this paper, we propose to use generative models to synthesize speaker profiles in TTS systems. Specifically, a variational auto-encoder is used to learn the probabilistic distribution of speaker profiles, with a triplet loss to regularize its latent space and a shuffle mechanism to disentangle speaker information. By doing so, the proposed method enables TTS systems to generate synthetic speaker profiles that have not been seen in training data. The effectiveness of the proposed method has been demonstrated in our experiments.