Building Synthetic Speaker Profiles in Text-to-Speech Systems

02/07/2022
by   Jie Pu, et al.
0

The diversity of speaker profiles in multi-speaker TTS systems is a crucial aspect of its performance, as it measures how many different speaker profiles TTS systems could possibly synthesize. However, this important aspect is often overlooked when building multi-speaker TTS systems and there is no established framework to evaluate this diversity. The reason behind is that most multi-speaker TTS systems are limited to generate speech signals with the same speaker profiles as its training data. They often use discrete speaker embedding vectors which have a one-to-one correspondence with individual speakers. This correspondence limits TTS systems and hinders their capability of generating unseen speaker profiles that did not appear during training. In this paper, we aim to build multi-speaker TTS systems that have a greater variety of speaker profiles and can generate new synthetic speaker profiles that are different from training data. To this end, we propose to use generative models with a triplet loss and a specific shuffle mechanism. In our experiments, the effectiveness and advantages of the proposed method have been demonstrated in terms of both the distinctiveness and intelligibility of synthesized speech signals.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

04/01/2019

Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora

When the available data of a target speaker is insufficient to train a h...
07/23/2020

Version Control of Speaker Recognition Systems

This paper discusses one of the most challenging practical engineering p...
11/24/2020

Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech

In recent years, Text-To-Speech (TTS) has been used as a data augmentati...
06/29/2021

GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis

Recent advances in neural multi-speaker text-to-speech (TTS) models have...
04/12/2017

Trainable Referring Expression Generation using Overspecification Preferences

Referring expression generation (REG) models that use speaker-dependent ...
03/28/2022

Analyzing Language-Independent Speaker Anonymization Framework under Unseen Conditions

In our previous work, we proposed a language-independent speaker anonymi...
02/24/2021

Triplet loss based embeddings for forensic speaker identification in Spanish

With the advent of digital technology, it is more common that committed ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advance of deep learning, modern text-to-speech (TTS) systems have developed end-to-end pipelines and enable the generation of speech signals approaching the human level of naturalness. For example, Tacotron-based approaches [Wang2017TacotronTE] [shen2018natural] will first map linguistic features of textual input into spectrograms, and then use a vocoder model [oord2016wavenet] to obtain corresponding speech signals. Such an encoder-decoder network architecture and attention mechanism have been widely used now, and are shown to achieve remarkable performance in synthesized speech signals [jia2018transfer].

In this paper, we aim to solve a novel task in multi-speaker TTS systems: how to create new synthetic, fictional speaker profiles for the use of speech synthesis. The motivation behind this task is twofold: first and foremost, creating new synthetic speaker profiles helps to obtain a greater variety of voice profiles, which itself is a crucial aspect of multi-speaker TTS systems. Secondly, current multi-speaker TTS systems are limited to synthesize speech signals with the same speaker profiles as their training data. Such a limitation comes from the widely-used speaker embedding vectors [skerry2018towards] [gibiansky2017deep] in TTS systems. In particular, each speaker embedding vector captures the voice characteristics of one speaker in training data. This one-to-one correspondence between speaker embedding vectors and individual speakers in TTS systems posts constrains on its generalization capability. As a result, it struggles to generate speech signals with speaker profiles that are new and different from training data, thus hinders the diversity of voice profiles in multi-speaker TTS systems.

Figure 1: Overview of the proposed multi-speaker text-to-speech system. The proposed method focuses on the speaker profile modelling, which propagates the speaker profile information into generated speech signals. In previous methods, speaker profile information is represented as a fixed dimensional vector, called speaker/voice profile embeddings. In contrast, we propose to use a variational auto-encoder with the triplet loss to model speaker profiles. The obtained speaker profile is then conditioned on the attention layer of the context decoder, to generate speech signals of the corresponding speaker profile.

To solve this problem, we propose to use generative networks for modelling speaker profiles in TTS systems, which breaks down the one-to-one correspondence between speaker embedding vectors and speaker identities. In particular, we choose to use a variational auto-encoder (VAE) [kingma2013auto] to synthesize speaker profiles along with TTS models. The trained generative model will learn the probabilistic distribution of speaker profiles and then create synthetic speaker profiles via sampling from this distribution. It is worth noting that the speaker profile modelling is decoupled from speech synthesis, in order to solely learn the probabilistic distribution of speaker characteristics.

To enhance the quality of synthetic speaker profiles, we propose to use two specific techniques when training the VAE for speaker profiles: i) a triplet loss upon the latent space of VAE, in order to encourage a more structured latent space. Triplet loss [chechik2010large]

is a standard approach in metric learning and has shown promising performance in both face recognition

[schroff2015facenet] and speaker recognition [chung2020in], and ii) another challenge here is about disentanglement. During speech synthesis, many underline factors (e.g, speaker profiles, speech prosody and emotion status) are often mixed together. Disentangling these factors and explicitly generating certain speaker profiles is not easy. Therefore, we propose to use a shuffle mechanism of spectrograms to learn the disentangled representation of speaker profiles. During training, this mechanism will feed TTS models with spectrograms from the same speaker but have different speech content. The only consistency here is the speaker identity, and by doing so, speech characteristics other than speaker profiles will be filtered out.

2 Methodology

Our work is based on Amazon multi-speaker TTS system111https://aws.amazon.com/polly/. In essence, we replace discrete speaker embedding vectors in the multi-speaker TTS system, and use a variational auto-encoder to learn speaker profile information from reference spectrograms. An overview of the proposed multi-speaker TTS system is depicted in Figure 1.

2.1 Multi-speaker TTS system

Standard TTS systems nowadays use an end-to-end pipeline based on the sequence-to-sequence (seq2seq) modelling [sutskever2014sequence] with an attention paradigm [bahdanau2015neural]. It takes textual features as the input and produces spectrogram frames, which are then converted into waveforms. As shown in Figure 1, the TTS system includes an encoder, an attention-based decoder, and a neural vocoder [shen2018natural].

To explicitly model speaker profiles in TTS, embedding vectors are used to represent each speaker in training data. These vectors will be concatenated into the attention layer of decoder, propagating the speaker profile information into synthesized speech signals. Specifically, these speaker embedding vectors can be initialized in different ways, such as using a uniform distribution over

[gibiansky2017deep], the Glorot initialization [glorot2010understanding] or extracted embeddings from speaker recognition and verification networks [variani2014deep] [heigold2016end] [snyder2018x]

. For those fixed-dimensional vectors that are extracted from neural networks, they are also widely used in the field of speaker recognition, with terminologies such as x-vector and d-vector.

2.2 Proposed speaker profile modelling

Herein we propose to use a variational auto-encoder to synthesize speaker profiles. As shown in Figure 1, an encoder of VAE is used as the reference encoder to obtain the speaker embedding vector z. It takes the input from a reference spectrogram and outputs the mean

and the variance

of a 32-dimensional diagonal Gaussian distribution. As defined in

[kingma2013auto] the speaker embedding z:

(1)

with we signify an element-wise product. is sampled from a standard Gaussian distribution. After obtaining the speaker embedding z, it will be concatenated with the phonetic encoder output at the attention layer.

In the proposed method, synthetic speaker profiles are created by sampling from a 32-dimensional diagonal Gaussian distribution, where the latent variable z lies in. With a well-trained VAE, the Gaussian distribution represents all speaker profile characteristics learned from training data. Sampling from it is equivalent to randomly choose one combination of these characteristics as the synthetic speaker profile.

Besides, to enhance the quality of synthetic speaker profiles, we propose to use two specific techniques when training the VAE: i) a triplet loss and ii) a shuffle mechanism.

Triplet loss. To encourage a structured latent space of speaker embeddings, we apply a triplet loss upon the latent variable z. The triplet loss will enforce embedding vectors of the same speaker become similar to each other, while embeddings of different speakers far away. As defined in [chechik2010large], the triplet loss can be formulated as follows:

(2)

where A is an anchor input, i.e., a baseline utterance. P is a positive input, i.e., another utterance from the same speaker as A. N is a negative input, i.e., one utterance from a different speaker than A. is the VAE encoder that extracts embedding vectors from input utterances. The distance from the baseline (anchor) input to a positive input is minimized, while the distance from the baseline (anchor) input to a negative input is maximized. is a margin between positive and negative pairs, where we use in the experiments.

Shuffle mechanism

. One challenge here in modelling speaker profiles is how to disentangle the speaker profile information out of other mixed speech characteristics, such as speech prosody, environmental noise and emotional status. To this end, we apply a shuffle mechanism that dynamically selects the input reference spectrogram at each epoch. This mechanism will randomly choose the reference spectrogram from one of the output speaker’s spectrograms. During training the only consistency between the reference spectrogram and the output spectrogram is their corresponding speaker identity. By doing so, speech characteristics other than speaker profiles will be filtered out after training. Compared to other approaches for disentangling representations in TTS

[hsu2018hierarchical] [hsu2019disentangling], the proposed method is both simple and effective as demonstrated in our experimental results.

3 Experimental Evaluation

This section provides an experimental evaluation of the proposed method for building synthetic speaker profiles. First of all, it is worth mentioning that how to evaluate synthesized speech signals from TTS systems has always been a challenging research topic, and remains an open question nowadays. There are several subjective metrics used in past works to evaluate synthesized speech signals, such as the estimated mean-opinion scores (MOS), which relies on the judgement of human annotators. To promote the objective evaluation of synthesized speech signals, herein we propose to use three objective metrics summarized as follows:

  • Distinctiveness. The proposed method is first assessed in the distinctiveness metric, which aims to quantify the diversity of synthetic speaker profiles and gives a global picture of its diversity. Specifically, we use the false-accept rate (FAR) obtained from a given speaker verification model as the distinctiveness: when the FAR is smaller, it indicates the speaker profiles of a set of synthesized speech signals are more diverse.

  • Speaker similarity

    . Different from the distinctiveness metric that provides a global picture of speaker diversity, we use cosine similarity scores between speaker profiles to conduct a local comparison. In particular, we examine the similarities of synthetic speaker profiles along an interpolation line, i.e., using interpolation to traverse between pairs of existing speaker profiles. It provides a zoom-in view on the learned VAE latent space, and check how the synthetic speaker profile is changing from one to the other.

  • Intelligibility. Previous two metrics quantify how synthetic speaker profiles are different from each other, from both global and local perspectives. However, there exists a trivial solution to obtain distinctive speaker profiles: making synthetic speaker profiles as random noise (then they are indeed different from each other but is a meaningless solution). To prevent this, we use the intelligibility metric to measure the quality of synthetic speaker profiles and synthesized speech signals. Similar to [taylor2021confidence], we use the word error rate (WER) of a given speech recognition model as the intelligibility: when the WER is smaller, it means the intelligibility of synthesized speech signals is better.

The proposed method and its comparisons are trained on Amazon internal multi-speaker corpus, which contains more than 700k utterances from 2870 speakers. 8 V100 GPUs are used for training, and each model often takes days training time to converge. Besides, we use the ADAM optimizer to update model weights, with minimizing the loss between the original and generated mel-spectrograms.

3.1 Distinctiveness

To quantify how synthesized speech signal is different from each other in term of its speaker profile characteristics, we use the false-accept rate (FAR) from the speaker verification model [chung2020in] as the distinctiveness metric. It is worth emphasizing why FAR can be used here: let us assume the given speaker verification model is reasonably good, then a false acceptance case will happen (with an increase of FAR) when two compared speech signals have similar speaker profiles, i.e., their similarity score is high. In other words, it is hard for the speaker verification model to tell the difference between speech signals in term of their speaker identities. In that case, the FAR is large, which indicates the speaker profiles of speech signals are less distinctive. For a good performance, i.e., diversity of synthetic speaker profiles, we expect the value of FAR to be small.

Table 1 reports the normalized FAR values of the proposed method and its comparison, calculated by dividing the FAR of a baseline method. As we can see, the proposed VAE outperforms d-vector that is used as speaker profile embeddings in multi-speaker TTS systems. To promote a more comprehensive view of FAR, there are three threshold percentile scores listed in the table. In all these thresholds, the proposed VAE with a triplet loss and a shuffle mechanism achieves the best performance, i.e, most diverse synthetic speaker profiles.

Speaker profile modelling Threshold percentiles
60th 70th 80th
d-vector 1.000 1.000 1.000
Variational auto-encoder (VAE) 0.802 0.831 0.945
VAE with a triplet loss 0.734 0.686 0.836
VAE with a triplet loss and shuffle 0.716 0.657 0.774
Table 1: Normalized False Accept Rates (FAR) of the distinctiveness measurement. The FAR performance of d-vector is the baseline, and also the denominator for normalization. The best performance is in bold.

3.2 Speaker similarity

In this section, we aim to provide a zoom-in view on synthetic speaker profiles by comparing with existing speaker profiles in training data. Specifically, we take pairs of naturally spoken speech signals, encode them with the VAE encoder and then linearly interpolate on the latent space to obtain interpolated synthetic speaker profile vectors. The obtained speaker profile vector is concatenated into the attention layer of the TTS decoder to generate speech spectrograms of the corresponding speaker profile.

Figure 2: Cosine similarity scores between interpolated and existing speaker profiles. For a pair of existing speaker profiles, their encoded latent variables are and . The interpolated speaker profiles are created by a weighted sum: , where the x-axis represents this weight . For the y-axis, it shows the cosine similarity between and . The label Proposed represents the use of VAE with a triplet loss and shuffle, and the Ideal stands for the most ideal case (with only theoretical possibility) of manifold smoothness.

We use cosine similarity scores from the speaker verification network [chung2020in] to measure the similarity/difference between speaker profiles. As shown in Figure 2, the interpolated synthetic speaker profiles from the proposed method show a smooth and gradual transformation from one natural speaker profile to the other, which demonstrates the continuity of the learned manifold space z. Compare to the baseline approach using d-vector (where you can see a sudden jump from the weight to ), our synthetic speech signals also have more diverse speaker profiles, which further supports the result about distinctiveness in section 3.1.

3.3 Intelligibility

To evaluate the intelligibility of synthesized speech signals, we use the word error rate (WER) from Amazon speech recognition system222https://aws.amazon.com/transcribe/. As shown in [taylor2021confidence], ASR-based metrics, e.g. WER, can perform reliably for evaluating intelligibility of TTS systems, and indeed on a comparable level to the paid human annotators/listeners.

Table 2 shows the normalized WER results of the proposed method and its comparison, calculated by dividing the WER of a baseline method. Herein, we compare different methods using a certain number of synthetic speaker profiles generated from these methods. As we can see, the proposed VAE with a triplet loss achieves the best performance with the smallest WERs, showing the benefit of the triplet loss during training, as it enforces a more structured latent space on z. On the other hand, the shuffle mechanism has a negligible effect on the intelligibility of synthesized speech signals.

Speaker profile modelling Number of speaker profiles
1 250 500
d-vector 1.000 1.000 1.000
Variational auto-encoder (VAE) 0.926 0.983 0.924
VAE with a triplet loss 0.914 0.971 0.878
VAE with a triplet loss and shuffle 0.914 0.970 0.879
Table 2: Normalized Word Error Rates (WER) of the intelligibility measurement. The WER performance of d-vector is the baseline, and also the denominator for normalization. The best performance is in bold.

4 Conclusion

In this paper, we propose to use generative models to synthesize speaker profiles in TTS systems. Specifically, a variational auto-encoder is used to learn the probabilistic distribution of speaker profiles, with a triplet loss to regularize its latent space and a shuffle mechanism to disentangle speaker information. By doing so, the proposed method enables TTS systems to generate synthetic speaker profiles that have not been seen in training data. The effectiveness of the proposed method has been demonstrated in our experiments.

References