The fast-paced development of neural end-to-end TTS synthesis has enabled the generation of speech approaching human levels of naturalness [22, 27, 19, 20]. Such models directly map an input text to a sequence of acoustic features using an encoder-decoder network architecture . In addition to the input text, there are many other sources of variation in speech, including speaker, background noise, channel properties (e.g., reverberation), and prosody. These attributes can be accounted for in the synthesis model by learning a latent representation as an additional input to the decoder [21, 28, 8, 15].
Prosody  collectively refers to stress, intonation and rhythm in speech. Annotations for such factors are rarely available for training. Many recent end-to-end TTS models aiming to capture these factors extract a latent representation from the target speech, and factorize the observed attributes, such as speaker and text information, out of the prosody latent space [21, 8, 7, 1]. These approaches extract a single latent variable for an entire utterance, requiring a single global representation to capture the full space of variation across speech signals of arbitrary length. In contrast, the model proposed in  uses a fine-grained structure to encode the prosody associated with each phoneme in the input sequence from the aligned target spectrogram. This system can synthesize speech which closely resembles the prosody of a provided reference speech and control local prosody by varying the values of corresponding latent features.
There exist applications where generating samples of natural speech corresponding to the same text with different prosody is desirable. For example, samples with diverse prosodic variations could be useful in data augmentation for ASR where a limited amount of real speech is available. Using a generative framework, such as VAE 
, to represent the fine-grained latent variable, naturally enables sampling of different prosody features for each phoneme. The prior over each latent variable is commonly modeled using a standard Gaussian distribution. Since the prior is independent at each phoneme, the generated audio often exhibits discontinuous and unnatural artifacts such as long pauses between syllables or sudden increases in energy or fundamental frequency ().
A simple way to ameliorate these unnatural samples is to scale down the standard deviation of the prior distribution during generation, which decreases the likelihood of sampling outlier values. However, this also suppresses the diversity of the generated audio samples and does not eliminate discontinuities since consecutive samples are still independent. Attempts to introduce temporal correlation to sequential latent representations[4, 6]
often adopt autoregressive decomposition of the prior and posterior distributions, parameterizing both using neural networks. More recently, introduced the vector-quantized VAE (VQ-VAE), and a two-stage training approach for generate high fidelity speech samples, in which the posterior is trained for reconstruction and an autoregressive (AR) prior is trained separately to fit the posteriors extracted from the training set.
This paper utilizes a two-staged training approach similar to . We extend Tacotron 2  to incorporate a quantized fine-grained VAE (QFVAE) where the latent representation is quantized into a fixed number of classes. We find that using a quantized representation improves the naturalness over audio samples generated from the continuous latent space, while still ensuring reasonable diversity across samples. In the first stage, the TTS model is trained in a teacher-forced setting to maximize the likelihood of the training set. In the second stage, an autoregressive (AR) prior network is trained to fit the VAE posterior distribution over the training data learned in the first stage. This network learns to model the temporal dynamics across latent features. Samples characterized by latent features can be drawn from the AR prior by providing an initial state. We compare two AR prior fitting schemes, one in the continuous space and another in the quantized latent space.
We evaluate the proposed model from a few different perspectives. Sample naturalness and completeness are measured using an ASR system trained on real speech data in addition to subjective listening tests evaluating the naturalness of the generated speech. Sample diversity is evaluated by the average standard deviation per phoneme in three measurable prosody attributes. Lastly, its benefit as a data augmentation method is demonstrated by the ASR system trained on audio samples generated from TTS systems.
2 Quantized Fine-grained VAE TTS Model
The fine-grained VAE structure used to model the prosody at phoneme-level, similar to that in , is shown in Fig. 1. The VAE component is integrated with the encoder of the Tacotron-2 model  and the target spectrogram is provided as an extra input to the encoder in order to extract latent prosody features.
The spectrogram is first aligned with the phoneme sequence using attention  according to phoneme encodings from the output of the encoder. The aligned spectrogram is then sent to the VAE to extract a sequence of latent representations for the prosody which is also aligned with the phoneme sequence. Finally, the phoneme encodings are concatenated with the latent representations and sent to the decoder. The system is trained by optimizing the fine-grained evidence lower bound (ELBO) loss:
where the first term is the reconstruction loss and the second term is the KL divergence between prior and posterior. The prior is chosen to be . represents the sequence of latent features and corresponds to the latent representation for the -th phoneme. is the aligned spectrogram and represents the phoneme encoding.
2.1 Vector Quantization
Vector quantization in Fig. 2 is performed after the latents are drawn from the posterior distribution by assigning a quantized embedding to each latent vector by minimizing the Euclidean distance.
Unlike the original VQ-VAE  which used a one-hot posterior distribution, we maintain the Gaussian form of the posterior from the standard VAE to make it possible to experiment with continuous or discrete representations within AR prior. Therefore the phoneme-level ELBO loss is used to train the VQ-VAE as before. The gradient from the reconstruction loss term is back-propagated to the latent encoder by directly copying the gradient from previous layers to quantized embeddings at each step to the latent vectors in the continuous space. Furthermore, to update the quantized embeddings, the following quantization and commitment losses are optimized together with the ELBO:
where is the stop-gradient operator, which is identity for the forward path and has zero partial derivatives for the backward path. The first term is the quantization loss which moves the embeddings towards the latent vectors computed by the encoder, in order to minimize the error introduced in quantization. The second term is the commitment loss which encourages the continuous latent vector to remain close to the quantized embedding, preventing the embedding space from expanding too fast. The total loss optimized by the model is the sum of the VAE loss and the VQ loss .
3 Autoregressive prosody prior
Once the TTS model is trained, the fine-grained VAE encoder shown in Fig. 1 can be used to compute parameters for the posterior distribution over latents from a reference spectrogram. Since the posterior is derived from a real speech spectrogram, samples from it will be natural and coherent across phonemes. However, without such a reference spectrogram, the model does not expose a method to generate natural samples. We therefore train an AR prior to model such temporal coherency in the latent feature sequence from the posterior. This AR prior is trained separately without affecting the training of the posterior. Because the encoder network uses vector quantization after sampling from an underlying posterior distribution in the continuous latent space, we compare two different prior fitting schemes.
The first scheme aims to fit an Gaussian AR prior in the continuous latent space so that the prior and the posterior at each time step come from the same family of distributions. A single layer LSTM is used to model the prior, and is trained using teacher forcing from the latent feature sequence from the posterior. The output at each step of the sequence is a diagonal Gaussian distribution whose mean and standard deviation are functions of the previous latent features. This prior is also conditioned on the phoneme encoding :
where and are outputs of the prior LSTM. The LSTM is trained by additionally minimizing the same KL divergence in the continuous latent space as Eq. (1) except that the original prior is replaced with . During generation, the network is only provided with the phoneme encoding and an all-zero initial state. Samples drawn from the continuous AR prior for each phoneme are quantized using the same embedding space.
Alternatively, the AR prior can be fit directly in the discrete latent space, in which case the prior at each step is a categorical distribution over the quantization embeddings space . This is similar to training a neural language model 
. To enable training the prior with KL divergence in the discrete space, single sample estimation of the posterior distribution is used, hence the training objective becomes the cross-entropy loss:
where and are the discrete posterior and prior multinomial distributions, respectively, and is a single sample drawn from the posterior distribution. Hence the estimated posterior distribution is one-hot at embedding (i.e. only when , and
otherwise) which yields the cross-entropy loss form. As before, a single layer LSTM is used to model the prior and the output at each step is a categorical distribution after a softmax activation function. The phoneme encodingis also used during training and generation.
We evaluated multispeaker TTS models on the LibriTTS dataset , a multispeaker English corpus of approximately 585 hours of read audiobooks sampled at 24kHz. It covers a wide range of speakers, recording conditions and speaking styles. We used a model following , replacing the Gaussian mixture VAE (GMVAE) with a QFVAE. Output audio is synthesized using a WaveRNN vocoder .
To evaluate how much information was lost by quantizing the latent prosody representation, we measured reconstruction performance by encoding a reference signal and using the resulting posterior to reconstruct the same signal. In addition, we measured the naturalness of the synthesized speech using subjective listening tests, and speech intelligibility using speech recognition performance with a pretrained ASR model. We also evaluated the diversity of the prosody in samples from different models. A good system should be able to generate natural audio samples with a relatively large prosody diversity.
Finally, we demonstrated that samples from the proposed model could be used to augment real speech training data to improve performance of a speech recognition system. We encourage readers to listen to audio examples on the accompanying web page111 https://google.github.io/tacotron/publications/prosody_prior .
4.1 Reconstruction Performance
Reconstruction, or copy synthesis, performance is measured using frame error (FFE)  and mel-cepstral distortion (MCD)  computed from the first 13 MFCCs, which reflect how well a model captures the pitch and timbre of the original speech, respectively. Lower values are better for both metrics. Results are shown in Table 1, illustrating that quantizing the prosody latents in the QFVAE degrades reconstruction compared to the baseline, since some information is discarded as part of the quantization process.
|Baseline fine-grained VAE||0.18||8.6|
|QFVAE 32 classes||0.32||11.3|
|QFVAE 256 classes||0.26||10.2|
|QFVAE 1024 classes||0.22||9.5|
Compared to the global VAE model, which used a single 32-dimensional latent vector to represent the prosody across a full utterance, the baseline fine-grained VAE used a 3-dimensional latent for each phoneme and achieves a significantly better reconstruction performance. The QFVAE models have higher FFE and MCD than the baseline, however their performances improve as the number of classes increased. With 1024 classes QFVAE performance approaches that of the baseline.
|Baseline||Indep. scale=0.2||2.80 0.08||26.8%||0.35||38||16|
|Baseline||Indep. scale=0.0||4.04 0.06||10.6%||0.09||8||8|
|Baseline||AR continuous||2.37 0.07||32.0%||0.48||32||21|
|QFVAE||AR discrete||3.45 0.07||11.5%||0.39||32||16|
|QFVAE||AR continuous||3.98 0.06||8.4%||0.29||24||12|
4.2 Sample Naturalness and Diversity
We conducted subjective listening tests over 10 speakers each synthesized 100 utterances with native English speakers asked to rate the naturalness of speech samples on a 5-point scale in increments of 0.5. Results are reported in terms of mean opinion score (MOS).
Complementary to naturalness, we also report the word error rate (WER) from an ASR model trained on real speech from the LibriTTS training set and evaluated on speech synthesized from transcripts in the LibriTTS test set222WER on LibriTTS is different from WER on LibriSpeech.. We used the sequence-to-sequence ASR model from . This metric verifies that the synthesized speech contains the full content of the input text. However, even if the full text is synthesized, we expect the WER to increase for speech samples with prosody that is very inconsistent with that of real speech.
Finally, we evaluated the diversity of samples from the proposed prior by measuring the standard deviation of three prosody attributes computed for each phoneme: relative energy, fundamental frequency (), and duration. The duration of a phoneme is represented by the number of frames where the decoder attention assigns maximum values to that phoneme, is computed by the YIN pitch tracker , and the relative energy is the ratio of the average signal magnitude within a phoneme with the average magnitude of the entire utterance. The standard deviations for each of these metrics is computed within each phoneme, and the average standard deviation across all the phonemes is reported. To estimate these statistics, we synthesized 100 random samples for each of 3 randomly selected utterances from the test set, each using 3 different speaker IDs.
Results are shown in Table 2. Comparing independent sampling strategies from the baseline model, decreasing the scale reduced the diversity in each attribute, while increasing naturalness MOS and improving WER. This indicates a trade-off between the diversity and naturalness that results from naive sampling from the baseline system. A scaling factor of is the largest under which the system still generated somewhat natural audio samples. Samples using a scale of were too poor (as reflected in the very high WER) that we did not conduct listening tests with them. Fitting an AR prior over the baseline’s continuous latent space results in worse naturalness than sampling independently for each phoneme with moderate scale of , although the diversity becomes higher.
QFVAE samples always result in reasonable MOS and WER regardless of the type of prior. This indicates the benefit of the regularization imposed by quantizing the latent space during training, even if the discrete representation is not used directly by the prior. The most natural results with highest MOS and lowest WER resulted from using an AR prior fit in the continuous space. Samples generated using this prior have MOS close to the baseline with neutral prosody (
), but with lower WER. However independent samples had better diversity metrics, once again reflecting a similar, but less pronounced, diversity-quality trade-off to the baseline model. Finally, the AR prior in the discrete space gives a similar diversity and naturalness to the independent prior. We conjecture that using a single sample to estimate the discrete KL divergence brings uncertainty to the model, resulting in a prior with large variance at each step.
4.3 Data Augmentation
One potential application of the proposed TTS model is to sample synthetic speech to help training ASR models, in a data augmentation procedure. We trained ASR models on synthesized audio from different TTS models and evaluated how well the resulting recognizers generalized when evaluated on real speech.
For each TTS model, we synthesized speech for the full set of training transcripts in LibriTTS with randomized speaker IDs. The synthesized speech was downsampled to a rate of 16 kHz and then used to train the ASR model. We used an end-to-end encoder-decoder ASR model with additive attention 
. The model, training strategy, and associated hyperparameters followed LAS-4-1024 in.
As shown in Table 3, the WER of the ASR model when trained on real LibriTTS speech is 7.2%. Similar to , synthesizing speech using a baseline multispeaker Tacotron 2 model results in a significantly degraded WER of 20.0%, indicating that samples from this model did not capture the full space of variation of real speech.
In a copy synthesis setting where the ground truth reference speech is provided, the global VAE  improves on the Tacotron baseline, and the fine-grained VAE (Baseline) improves even further. These results indicate the importance of explicitly modeling prosody variation (VAE models) as well as speaker variation (Tacotron 2).
Randomly sampling independently at each phoneme using the baseline fine-grained VAE performs significantly worse than copy synthesis. The lowest WER is obtained when the scale is set to an intermediate value, reflecting a reasonable trade-off between naturalness and diversity. As in Table 2, the best QFVAE performance results from sampling independently at each phoneme.
To explore the improvement from diversity, the best performing QFVAE was used to generate 10 copies with varying prosody of the training data to train the ASR system. This gives 12.5% WER which is close to the copy synthesize result. Negligible improvement is found when adding more copies for Tacotron 2 model since they do not contain much variation in prosody. Finally, we found that training on synthetic speech (oversampled ten times) with real speech in a data augmentation configuration resulted in a 16% relative WER reduction compared to the model trained on real speech alone.
|Multispeaker Tacotron 2||20.0%|
|GMVAE-Tacotron  copy synthesis||16.4%|
|Baseline copy synthesis||11.8%|
|10 samples per transcript|
|Real speech + QFVAE||Indep.||6.0%|
We repeated the evaluation of the data augmentation experiment on the LibriSpeech corpus  in Table 4. The real speech used in this comparison was the union of LibriTTS and LibriSpeech training sets, the TTS model was trained on LibriTTS, which contained material that was not in LibriSpeech. Using the QFVAE for data augmentation resulted in relative WER reductions of 14% and 8% on the LibriSpeech test-clean and test-other sets, respectively.
|Real speech + QFVAE (Indep.) 10 samples||3.8%||11.4%|
This paper proposed a quantized fine-grained VAE TTS model, and compared different prosody priors to synthesize natural and diverse audio samples. A set of evaluations for naturalness and diversity was provided. Results showed that the quantization improved the sample naturalness while retaining a similar diversity. Sampling from an AR prior further improved the naturalness. When generated samples were used to train ASR systems, we demonstrated a potential application that used prosody variations for data augmentation.
The authors thank Daisy Stanton, Eric Battenberg, and the Google Brain and Perception teams for their helpful feedback and discussions.
-  (2019) Effective use of variational embedding capacity in expressive end-to-end speech synthesis. arXiv: 1906.03402. Cited by: §1.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, pp. 4774–4778. Cited by: §4.3.
-  (2009) Reducing F0 frame error of F0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend. In Proc. ICASSP, Cited by: §4.1.
-  (2016) A recurrent latent variable model for sequential data. In Advances in Neural Information Processing Systems, Cited by: §1.
-  (2002) YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America 111 (4), pp. 1917–1930. Cited by: §4.2.
-  (2015) Variational recurrent auto-encoders. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
-  (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In Proc. ICASSP, Cited by: §1.
-  (2019) Hierarchical generative modeling for controllable speech synthesis. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1, §1, §4.3, Table 3, §4.
-  (2019) On the choice of modeling unit for sequence-to-sequence speech recognition. In Proc. Interspeech, pp. 3800–3804. Cited by: §4.2.
Efficient neural audio synthesis.
Proc. International Conference on Machine Learning (ICML), pp. 2415–2424. Cited by: §4.
-  (2014) Auto-encoding variational bayes. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
-  (1993) Mel-cepstral distance measure for objective speech quality assessment. In Proc. IEEE Pacific Rim Conference on Communications Computers and Signal Processing, Vol. 1, pp. 125–128. Cited by: §4.1.
-  (2019) Robust and fine-grained prosody control of end-to-end speech synthesis. In Proc. ICASSP, pp. 5911–5915. Cited by: §1, §2.
-  (2018) Training neural speech recognition systems with synthetic speech augmentation. arXiv preprint arXiv:1811.00707. Cited by: §4.3.
-  (2019) A generative adversarial network for style modeling in a text-to-speech system. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
-  (2010) Recurrent neural network based language model. In Proc. Interspeech, Cited by: §3.
-  (2015) LibriSpeech: an ASR corpus based on public domain audio books. In Proc. ICASSP, pp. 5206–5210. Cited by: §4.3.
-  (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. In Proc. Interspeech, Cited by: §4.3.
-  (2018) Deep voice 3: 2000-speaker neural text-to-speech.. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
-  (2018) Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions.. In Proc. ICASSP, pp. 4779–4783. Cited by: §1, §1, §2.
-  (2018) Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. In Proc. International Conference on Machine Learning (ICML), Cited by: §1, §1.
-  (2017) Char2Wav: end-to-end speech synthesis.. In Proc. International Conference on Learning Representations (ICLR), Cited by: §1.
-  (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, Cited by: §1.
-  (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, Cited by: §1, §1, §2.1.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §2.
-  (2010) Experimental and theoretical advances in prosody: a review. In Language and Cognitive Processes, pp. 905–945. Cited by: §1.
-  (2017) Tacotron: towards end-to-end speech synthesis.. In Proc. Interspeech, pp. 4006–4010. Cited by: §1.
-  (2018) Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proc. International Conference on Machine Learning (ICML), pp. 5167–5176. Cited by: §1.
-  (2019) LibriTTS: a corpus derived from LibriSpeech for text-to-speech. In Proc. Interspeech, Cited by: §4.