Significant developments have taken place in the neural end-to-end text-to-speech (TTS) synthesis models for generating high fidelity speech with a simplified pipeline [char2wav, tacotron, deepvoice, wavmel]
. Such systems usually incorporate an encoder-decoder neural network architecture[seq2seq] that maps a given text sequence to a sequence of acoustic features. More recent advancement in such models enables the use of crowd-sourced data by disentangling and controlling different attributes such as speaker identity, noise, recording channels as well as prosody [hsu2018hierarchical, styletoken, gan4tts]. The focus of this paper, prosody, is a collection of attributes including fundamental frequency (), energy and duration [prosody]. Efforts have been made to model and control these attributes by factorizing the latent attributes (e.g. prosody) from observed attributes (e.g. speaker). Although most of these works use latent representations at utterance level which captures the salient features of the utterance [adverserial, battenberg2019effective, transfer, hsu2018hierarchical], fine-grained prosody that are aligned with the phone sequence can be captured using techniques recently proposed in [finegrained]. This model provides a localized prosody control that achieves more variability and higher robustness to speaker perturbations.
Even though prosody attributes such as and energy can be treated as latent features, interpreting the phone-level latent space is still difficult since latent dimensions can be entangled with each other. Moreover, coherence of the prosody within a word (e.g. accented syllables), noise level and channel properties are important attributes not captured at the phone-level alone. Respecting the hierarchical nature of spoken language and aiming at interpretation of prosody at fine-scale such as for a vowel, this paper aims to achieve disentangled control of each prosody attribute at different levels.
This paper proposes a multilevel model based on Tacotron 2 [tacotron2] integrated with a hierarchical latent variable model. In addition to the prosody representation at utterance level, the representation is also extracted at word and phone levels. Apart from utterance-level characteristics such as noise and channel properties, phone-level prosodic features are expected to capture fine-grained information associated with each phone, and word-level features are expected to capture the prosody at each word while maintaining a natural prosody structure within the word. To better interpret the representation of each latent dimension, the original VAE is replaced by a conditional VAE driven by the information contained in the previous latent dimensions. This setup gives a hierarchy where finer level features are conditioned on coarser, and latent variables at each level are hierarchically factorized. The proposed model is thus referred to as a fully-hierarchical VAE. Furthermore, imposing a training schedule for each latent dimension results in a phone-level representation which reflects a consistent ordering of prosody attributes. Finally, we assess the disentanglement property of our model on three most significant attributes.
2 Prior work
Abundant research has been performed on learning latent representations for styles and prosody [latentstyle, latentsynth] such as the use of an utterance-level VAE in [latentsynth2]. Our multilevel model is based on the fine-grained VAE structure which extends the idea in [finegrained], and is closely related to the hierarchical VQVAE model in [vqvae2]. While the latter uses down sampling to extract coarser features for image processing, our proposed model takes advantage of the hierarchical structure of spoken language. The multilevel alignment is also similar to the multilevel information extraction model proposed by [semisup].
Meanwhile, exhaustive exploration has been made in the unsupervised learning of disentangled latent representations these years in various scenarios including speech recognition[disentangleasr, disentangleasr2]. Progress have been made mostly in the direction of learning independently distributed latent variables without associating them with the actual latent factors [unsupdisentangle, unsupdisentangle2, unsupdisentangle3]. However, [icmlbest]
demonstrated the impossibility of unsupervised learning of disentangled representations without any inductive bias. They pointed out that there exists an infinite number of bijective mappings from the learned latent space to another space with the same marginal distribution, but the two spaces are fully entangled. Other works try to learn disentangled representations via semi-supervised learning[semidisentangle, semidisentangle2, habib2019semi] that guides a subset of latent variables to learn some labelled features, or via adversarial training [gandisentangle, gandisentangle2]. The most similar hierarchical decomposition to our approach is proposed by [hfvae]
, but the intention of this decomposition is to facilitate learning statistically independent random variables.
3 Multilevel prosody modeling structure
Different from utterance-level VAE [hsu2018hierarchical] where a single latent feature is extracted for each utterance, the fine-grained VAE [finegrained] aligns the target spectrogram with the phone sequence and extract a sequence of phone-level latent prosody features. These latent prosody features are concatenated with their corresponding phone encodings before sending to the decoder. Extending the fine-grained VAE, an illustration of our proposed multilevel model is shown in Fig. 1.
This structure is integrated with the encoder of the Tacotron-2[tacotron2]. Location-sensitive attention[transformer] is used to align the target spectrogram with encodings of each phone at step 1. After the aligned target spectrogram is obtained, the average of spectrograms associated with phones in each word is calculated using the known phone-word alignment. The word-level latent prosody features are then extracted from these averaged spectrograms. Phone-level latent features are extracted conditioning on word-level latent features, and both features are concatenated using the phone-word alignment again. These features are used by the decoder for reconstruction. The system is optimized with the multilevel evidence lower bound (ELBO):
where is the number of words, is the number of phones and s have the same function as [unsupdisentangle]. refers to the sequence of concatenated phone and word-level latent features. maps the phone index to the corresponding word index. and represent latent features associated with each word and each phone respectively. The model incorporates the utterance-level feature by conditioning other fine-grained latent features on the utterance level latent one, and subtracting in Eq.(1) accordingly. For a more complete model one could also introduce the dependency on text and speaker information for the posterior, e.g. and where is the speaker embedding and and for phone and word encodings.
4 Interpretable Conditional VAE Structure
In the previous multilevel structure, the VAE layer models the data distribution in the latent space as a multi-dimensional Gaussian distribution with diagonal covariance matrices, which is based on the assumption that different latent dimensions have independent effects to the prosody attributes. A corresponding graphical model is illustrated in Fig.2 where the graph with label 1 shows the generation sub-graph and label 2 shows the corresponding inference sub-graph. When inferring the posterior of , the model in label 2 where are independent of each other indicates when the observation is given, knowing what represents does not provide any further information. However, if latent dimensions control disentangled factors and is known to represent the energy of the phone, it adds extra information to indicating that captures attributes other than energy. Therefore, this conditional dependency as shown by the graph with label 3 should be modeled when inferring the posterior distribution.
To incorporate the dependency into inference, we extend the hierarchical structure described in the previous section further to include a conditional VAE according to an auto-regressive decomposition of the posterior. The model structure is shown in the right part of Fig. 2
. Unlike auto-regressive density estimators[NIPS2013_5060, NIPS2017_6828] which uses recurrent structures directly on top of the latent variables, our recurrent model conditions on the projection of latent variables and extract latent dimensions one at a time. Specifically, when extracting the -th latent dimension, the input to the VAE is the aligned spectrogram concatenated with the summation of all previously extracted latent features. Because these latent dimensions after projection are directly used by the decoder, they are effectively representing the prosody attributes being captured. Using these features with the aligned spectrogram as inputs to the VAE implicitly encourages the current latent dimension to extract information about the prosody attribute other than what has already been represented. The training objective for this VAE model can still be written in the ELBO form as shown in Eq. (2), where expectation is estimated by single sample for the KL-divergence between the auto-regressive posterior and the prior .
where are samples of dimensions to
from their posterior distributions. The prior uses the isometric standard normal distribution for each latent dimension. When applied to the multilevel framework, this loss function is minimized at each time step for each level and the subscriptionand in Eq. (1) for phone and word indices can be directly added to each latent variable. Combining the two approaches where auto-regressive decomposition of the posterior is applied across different levels and different latent dimensions, the model thus covers the full hierarchy from phone to utterance level.
Finally, disentangled prosody features are observed to be extracted following an energy-duration- order guided by scheduled training across latent dimensions. Scheduled training refers to the process where the first latent dimension is trained for a certain number of steps before the second dimension starts to train. The rest of the dimensions are started consecutively in the same way. Therefore, when scheduling is imposed together with the conditional VAE, the first latent dimension will capture the energy information, and the second dimension being aware that energy has been represented by the first dimension, will seek for representations of the duration.
The proposed models are evaluated on the LibriTTS multi-speaker audiobook dataset [zen2019libritts] and the Blizzard Challenge 2013 single-speaker audiobook dataset [Blizzard]. LibriTTS includes approximately 585 hours of read English audiobooks at 24kHz sampling rate. It covers a wide range of speakers, recording conditions and speaking styles. The latent space is expected to control the prosody without affecting speaker characteristics. On the other hand, the Blizzard Challenge 2013 dataset contains 147 hours US English speech with highly varying prosody, recorded by a female professional speaker.
Three attributes are considered in this paper for fine-grained prosody interpretation including , energy and duration. To quantitatively evaluate each attributes, we leverage the decoder alignment attention weights to obtain the duration of the phone by counting the frames which have a peak value at that specific phone in their attention weights. After obtaining the duration frames to and also converting to signal sample indices to , the energy can be estimated using the average signal magnitude in divided by the average signal magnitude of the whole utterance. can be similarly measured using the average estimated from an tracker [yin] among the frames in
. To decrease the variance due to bad alignments, we exclude 50 samples at both margins.
Finally, the mel-cepstral distortion (MCD), the Frame Error (FFE) [ffe], which is a combination of the Gross Pitch Error (GPE) and the Voicing Decision Error (VDE), are used to quantify the reconstruction performance. FFE evaluates the reconstruction of the track, and MCD evaluates the timbral distortion. We strongly recommend readers to listen to the samples on the demo page [hierarchical_demo].
5.1 Reconstruction Performance
Table 1 shows reconstruction performance measured using FFE and for the first 13 MFCCs. Lower is better for both metrics. Both the fine-grained VAE with 3-dimensional latent features and the fully-hierarchical VAE achieve similar reconstruction performance. Furthermore, progressive improvements from global to the system with 3-dimensional latent space can be interpreted: By introducing fine-grained VAE, there is a significant drop in VDE as the fine-grained VAE could capture phone-level energy and duration information. Conditioning the posterior distribution on the speaker embedding at the encoder side significantly reduced GPE because the speaker identity is closely related to the average . Increasing the latent space size to 3 again significantly reduces the error, which confirms the fact that information is the last to be captured.
|Phone-level VAE 2d (no spk.)||0.33||0.18||0.35||10.5|
|Phone-level VAE 2d||0.25||0.15||0.28||9.0|
|Phone-level VAE 3d||0.10||0.12||0.18||8.6|
|Phone-level conditional VAE||0.10||0.13||0.20||9.2|
5.2 Multilevel Controllability
The model selected for the demonstration is trained on the LibriTTS dataset, and both phone-level and word-level latent spaces are 3-dimensional. To illustrate the effect of controlling a single attribute at different levels clearly, we traverse one dimension of the latent features to control a certain attribute while keeping other dimensions constant. We demonstrate the control of a single vowel or a word using phone-level or word-level latent features in Fig 4.
The effect on a single vowel is clear: Increasing raised harmonic frequencies within that phone. Increasing energy brightens the area in the box while darkens the rest as signals are normalized. Increasing duration stretched the corresponding area. Because the influence at the word level contains a mixture of effects on vowels and consonants, when changing the dimension controlling energy, the duration of the phone n also changes which affects the duration of the word. However, the most significant changes can still be interpreted as the same three attributes exerted on all the vowels in the word. Meanwhile, the traversing of each latent dimension for a phone at different word prosody level is shown in Fig. 3. The control of each latent attribute is linear when traversing each dimension from to . The word-level control shifts the phone-level curves up and down as the phone-level latent features are conditioned on the word-level.
Furthermore, adjusting word-level prosody features also retains the prosody distribution within a word while phone-level adjustment required a manual assignment to keep it natural. This effect can be evaluated by generating utterances with phone-level or word-level independent sampling of one latent feature. Subjective mean opinion score (MOS) test results are shown in Table 2 where random samples of the latent dimension controlling the were used. Consequently, the word-level independent sampling sounds more natural, as the prosody structure within each word is retained as neutral prosody.
5.3 Improved phone-level interpretability
The interpretability is improved by the disentanglement for the three prosody attributes with the conditional VAE model. To illustrate this improvement, a vowel was selected and its , energy and duration were measured with the method in Sec. 5
. For each model, 100 samples were generated by drawing from a standard Gaussian distribution for one latent dimension while keeping other dimensions constant. Then, the standard deviations for measured attributes were calculated and scaled to lie in a similar range. Next, the ratio of standard deviations between the attribute under control and the attribute with the second-largest deviation was obtained to represent the disentanglement in that dimension. This was repeated for each latent dimension and the sum of the ratios for each repetition is used to generally represent the degree of disentanglement. Though the ratio is not capturing the exact entanglement, it still reflects the degree of disentanglement because disentangled systems should have a much larger variation in one factor than the other two when only one latent dimension is varying. The experiment with each model was repeated 5 times with different random seeds to show a consistent improvement in disentanglement. Results are displayed in the form ofin Table 3 and Table 4 respectively, where is the average of summed ratios and is the standard deviation across repetitions.
|Model||Average variance ratio|
|Baseline fine-grained VAE|
|Model||Average variance ratio|
|Baseline fine-grained VAE|
Even though the standard deviation increases, using the conditional VAE model on average improves the degree of disentanglement. As the variation in prosody is found to be linearly correlated with the latent dimension, each attribute is adjusted by traversing the corresponding latent dimension. Additionally, when training schedule is imposed on latent dimensions, the order of prosody attributes being captured is always found to be energy, duration, and on both datasets. Energy is the amplitude of the signal which is directly related to the reconstruction loss and is easier to be captured, and is the last which coincides with the findings from the reconstruction evaluation. Moreover, the first dimension captures the duration of silence since that is the most significant attribute. The effect of latent features for a consonant in spectrograms can also be categorized into these three attributes but is hard to interpret directly from the audio.
A fully-hierarchical model to achieve multilevel control of prosody attributes is proposed in this paper. The model consists of a hierarchical structure across different levels covering phone, word and utterance. Besides, a conditional VAE is applied at the phone and word-level which also adopts a hierarchical structure across all latent dimensions. Experimental results demonstrate improved interpretability by showing improved disentanglement, and the order of prosody attributes to be extracted is explained. Furthermore, the difference in phone and word level control effects is also analyzed.
The authors thank Daisy Stanton, Eric Battenberg, and the Google Brain and Perception teams for their helpful feedback and discussions.